Slack digest for #general - 2020-05-21

Apache Pulsar Slack Thu, 21 May 2020 02:12:23 -0700

2020-05-20 10:57:21 UTC - Patrik Kleindl: Two things I have noticed:
• Pulsar Functions can run outside the broker, this is not stated correctly
• The table service of Bookkeeper requires maintaining RocksDB but so does 
Kafka Streams, this is not mentioned
----
2020-05-20 11:04:36 UTC - Thiago: Hi everyone. If you happen to know:

(1) What's included in the standalone pulsar we get with Docker?

I checked the image on dockerhub and github but it's not obvious what's in it.
I would assume it's a cluster with Zookeeper (3 nodes), BookKeeper (3 nodes),
and Broker (3 nodes). I'm thinking of writing a docker-compose yml with those,
so it would be great to know the baseline config for comparison.

(2) What would be a simple way to monitor and assess the performance to know if
or for how long a given Pulsar config (standalone image or a docker-compose
yml) is good for me in terms of capacity, etc?

Thanks in advance! :slightly_smiling_face: Greatly appreciated.
----
2020-05-20 13:57:43 UTC - Huanli Meng: @Huanli Meng set the channel topic: -
Pulsar 2.5.2 released
<http://pulsar.apache.org/blog/2020/05/19/Apache-Pulsar-2-5-2/>
- Pulsar Summit Virtual Conference 2020 will happen on June 17-18:
<https://pulsar-summit.org/registration>
- 2020 Pulsar User Survey Report: <https://bit.ly/3d1KsGG>
+1 : Shivji Kumar Jha, Sijie Guo, Gilles Barbier, Penghui Li
tada : Gilles Barbier, Sijie Guo, Raman Gupta, Penghui Li
clap : Gilles Barbier, Sijie Guo, Penghui Li
----
2020-05-20 14:50:40 UTC - Matteo Merli: &gt; Can anyone explain this please?
Metadata is stored in ZK, though it's not fine-grained and in general it's not
the limiting factor for retention capabilities.

And, finally, Pulsar also supports tiering old data to cloud storage like S3 or
GCS, so that removes the issue completely.
+1 : Konstantinos Papalias, Patrik Kleindl
----
2020-05-20 14:52:43 UTC - Matteo Merli: &gt;
&gt; "replicationBacklog" : 3,
&gt; "connected" : false,
Messages are being accumulated (3 of them) because it's not able to connect to
the other cluster. Check broker logs for errors.

Also check that the serviceUrl for the remote clusters are set correctly in the
clusters metadata (`pulsar-admin clusters --help`)
----
2020-05-20 15:26:45 UTC - Sahil Sawhney: @Sijie Guo i got it fixed by not
changing the default RAM, CPU and Heap for pulsar proxy.
Are there some problems that are encountered if we try to (vertically) scale
the configuration for pulsar proxy.
----
2020-05-20 15:52:55 UTC - Amit Jere: @Amit Jere has joined the channel
----
2020-05-20 16:00:41 UTC - Julius S: We are having Pulsar &amp; Kafka
discussions in our business now and this article having some bite with our
non/semi-technical leadership that is making conversation a bit harder. Would
be good to counter/correct some of the above.
see_no_evil : Konstantinos Papalias
----
2020-05-20 16:12:33 UTC - Amit Gupta: @Amit Gupta has joined the channel
----
2020-05-20 16:16:24 UTC - Sreeram N: @Sreeram N has joined the channel
----
2020-05-20 16:27:25 UTC - Stan: @Stan has joined the channel
----
2020-05-20 16:45:33 UTC - Sijie Guo: Sorry what does “by not changing” mean
here?
----
2020-05-20 16:50:44 UTC - David Kjerrumgaard: @Julius S Hopefully your
non-technical leadership takes that article with a large grain of salt and
recognize it for the marketing piece that it is. The intent of it was to
continue the FUD that confluent so heavily traffics in.
----
2020-05-20 16:54:22 UTC - David Kjerrumgaard: "Bookkeeper’s retention
capabilities are limited by the fine-grained metadata it stores in Zookeeper."
I think they are referring to the fact that the managed ledger information
(which associates the topic will the various ledgers in BK) is stored in
Zookeeper. so there is theoretically a limit to the storage based on the fact
that these ledgers have to be retained inside ZK.
----
2020-05-20 16:58:04 UTC - David Kjerrumgaard: I think an easy fix for this
issue would be to store the managed ledger info inside of BK like we do the
cursors....
----
2020-05-20 17:51:39 UTC - Karthik Ramasamy: We adopted Pulsar since it is
technically superior -
<https://twitter.com/karthikz/status/1262838843071361024?s=20>
----
2020-05-20 18:31:58 UTC - Carolyn King: @Carolyn King has joined the channel
----
2020-05-20 19:10:47 UTC - rwaweber: Hey all! Websocket API question(and could
also be rooted in a lack of in-depth familiarity with websockets) but is it
possible to authorize a client for websocket connections?

I don’t see mention of it in the
<https://pulsar.apache.org/docs/en/client-libraries-websocket/|docs> page or on
the code examples, but it could also be that I’m overlooking a section
----
2020-05-20 19:12:07 UTC - Matteo Merli: No, the websocket proxy will use its
own credentials (as in super-user) when talking to brokers.

Websocket will validate authorization prior to that.
----
2020-05-20 19:25:49 UTC - rwaweber: &gt; Websocket will validate authorization
prior to that
Meaning, that the client connecting to the websocket proxy is authorized before
reaching the proxy? Or does the websocket proxy perform it’s own authorization?
----
2020-05-20 19:30:48 UTC - Matteo Merli: yes, the websocket itself does the
authorization
----
2020-05-20 19:31:28 UTC - Matteo Merli: `ws-client -&gt; [ ws-server
pulsar-client ] -&gt; pulsar-broker`
----
2020-05-20 19:52:27 UTC - Patrik Kleindl: @Karthik Ramasamy In all fairness,
this is a great success but hardly a good example considering Splunk bought
several key people for Pulsar.
I think both sides can take away areas for improvement from this, that‘s why I
am asking for opinions.

----
2020-05-20 20:18:34 UTC - Sijie Guo: @Patrik Kleindl: thank you for sharing the
blog post. We have noted this blog post. It has quite a lot of misleading
technical statements across the blog post. Here are some clarifications to that
blog post. We are also working on a blog post to counter those misleading parts
which will come out soon.

&gt; Message consumption model
Pulsar’s model is not entirely a push model. I would instead call it a
“streaming-pull” model. The consumer asks broker for sending a certain amount
of messages via a flow request and the broker streams the messages to the
consumer. Then consumer asks broker for the next batch of messages via a next
flow request.

&gt; Storage Architecture
Pulsar/BookKeeper is a distributed log system with infinite retention (via the
built-in tiered storage support).

The “index” part what they mentioned in the article is about cursors. The
cursors are smarter “offsets” that have the ability for tracking message
acknowledgments.

&gt; Components
“Pulsar makes use of Apache Zookeeper for consensus, Apache BookKeeper for
message storage which in turn uses the RocksDB database.”

BookKeeper is the storage engine of Pulsar. It is distributed and installed as
Pulsar’s deployment procedure. RocksDB is used for ledger indexes for providing
fast lookup for certain entries. It is embedded in bookie processes.
Technically there is no extra process for that part.

&gt; Operational Simplicity
Decoupling architecture improves the scalability, availability and reliability.
There will be more actions to take for initial setup. However after the first
setup, operating and scaling the cluster is actually much easier. Especially
leveraging Kubernetes, you can reduce all the operational pain points after
setup.

In contrast, Kafka is easy to setup but becomes hard to operate if it needs to
scale up. People has developed a lot of tools to bandaid the system.

&gt; Ecosystem and community
The Pulsar ecosystem has grown a lot in less than a year. We have integrations
with many major popular systems through connectors and integrations.

The Pulsar community has also grown fast in the past year. It has been adopted
by internet giants like Tencent, Splunk and many others. The number of
contributors has grown from ~70 to ~260. It is almost 4 times increase in one
year.

Huge momentum is happening in the Pulsar community. It is also growing beyond
Yahoo where the project was created.

If an organization is choosing a technology, they might also consider how the
technology and community will evolve in the coming 1~2 years.

&gt; Throughput, Latency and scale
throughput and latency argument tends to become very biased.

The main point of that section is saying the multiple-layer architecture
introduces additional network hops. It is a very misleading statement for
non-technical people. Any distributed system with replication would require at
least 2 network round-trips. So does Kafka.

&gt; Ordering guarantee
Pulsar is a *log* storage. So it provides very strong partition-based and
key-based ordering guarantee.

Different from Kafka, Pulsar allows out-of-ordering consumption over a strongly
ordered log through a shared subscription to scale out consumption for those
applications that don’t require ordering guarantee. It is the flexibility
provided by Pulsar which Kafka can’t support.

&gt; Compaction
&gt; It also does not work in the same seamless way but instead creates a
“snapshot” of a topic compacted at some prior point in time (with the original
topic remaining).
It was actually a technical decision when we introduced this feature. We want
to provide the flexibility to allow application choosing either consuming from
compacted “state” or from the raw data.

&gt; *Mission Critical:*
+1 : Konstantinos Papalias
----
2020-05-20 20:18:35 UTC - Sijie Guo:
Pulsar has been used in a lot of mission critical use cases like payment
processing, billing, transaction in different industries (e.g. financial,
e-commerce, retailer and etc). One of the largest use cases in Tencent billing
platform. Tencent is using Pulsar for processing tens of billions of
transactions every day.

&gt; Message Routing
&gt; But like RabbitMQ, Pulsar runs its Pulsar Functions inside the broker.
Pulsar Functions has a very flexible deployment model. It can run along with
broker, in a dedicated function worker cluster, or in Kubernetes.
----
2020-05-20 20:38:48 UTC - Patrik Kleindl: @Sijie Guo Thank you for the
comprehensive reply.
I think the RocksDB comment was more about an additional piece to maintain, and
if I can make an educated guess this will show up when the table service is
used more heavily by Pulsar functions.

The scalability argument is true but only applies to systems with extreme
growth, for most small to medium installations this is probably irrelevant.
----
2020-05-20 20:43:52 UTC - rwaweber: Interesting, what does the websocket server
use to authenticate its’ clients then? would it be something like a token as a
url parameter?
----
2020-05-20 20:44:15 UTC - Julius S: @David Kjerrumgaard yes we are fighting the
FUD of course but with this kind of audience it just creates a small uphill
battle unfortunately. @Sijie Guo thanks for the clear replies and great that
there is a plan for blog to straighten record. Please feel free to share work
in progress if you would like input.
----
2020-05-20 20:51:26 UTC - Sahil Sawhney: @Sijie Guo by not changing, I mean I
used the default values that the `values.yaml` file had.
Though my orignal aim was use an upgraded config for CPU and RAM. But that was
not working and resulted in
```<http://javax.ws.rs|javax.ws.rs>.ProcessingException: handshake timed out```
----
2020-05-20 20:51:51 UTC - Gilles Barbier: thx @David Kjerrumgaard
----
2020-05-20 21:36:48 UTC - Sijie Guo: @Patrik Kleindl:

regarding RocksDB, it should be exactly same as Kafka uses RocksDB for its
state stores. I was surprised that they would use a library that they are also
using as an argue point :slightly_smiling_face:

people tends to think scalability only applies to systems with extreme growth.
However that is not a fair statement. The scalability issue usually is tied to
an availability issue. The problem arises for mission-critical services. I
worked with a customer who using Kafka somehow in a critical path. It is an
e-commerce company. They are not in large scale but Kafka unfortunately is in a
critical path. So in the black friday sale-ish event, they were not able to add
machines to its existing Kafka cluster even they have new machines allocated.
Because the capability of their existing Kafka cluster already reached 90% of
total capacity. If they add new machines, they have to perform partition
rebalance and it will cause service unavailability and brought down the whole
service.

So scalability itself isn’t just about scale out to hundreds or thousands of
machines. scalability is also about scaling up to keep your service available
for unexpected traffic. That is what was lacking in Kafka.
ok_hand : Konstantinos Papalias, Shivji Kumar Jha
grey_question : Julius S
----
2020-05-20 21:56:05 UTC - Raman Gupta: The Slack Community Size one is funny
because its pretty much impossible to get an answer to a question on
Kafka/Confluent's Slack, despite having many more users. Nobody from Confluent
ever answers anything, nor do Kafka users. By comparison, smaller though it is,
Pulsar's Slack is way more active and responsive.
+1 : David Kjerrumgaard, Sijie Guo, Penghui Li, Shivji Kumar Jha
----
2020-05-20 21:56:50 UTC - Raman Gupta: I shouldn't say "never", but definitely
rarely, at least in my experience...
----
2020-05-20 21:58:46 UTC - Raman Gupta: This is also a bit amusing:
&gt; Pull based architectures are often preferable for high throughput
workloads as they allow consumers to manage their own flow control, essentially
fetching only what they need. Push based architectures require flow control and
backpressure to be integrated into the broker.
IOW, our clients are way more complex, with much more complex APIs, and
hundreds of tuning knobs it takes weeks or months to understand, and
essentially impossible to configure timeout values as timeouts are actually
indeterministic, however, that's a good thing!
----
2020-05-20 22:06:28 UTC - Raman Gupta: In all fairness there are some
advantages to their approach, but I still prefer Pulsar's approach overall.
----
2020-05-20 23:49:42 UTC - David Kjerrumgaard:
<https://thenewstack.io/lenses-io-helps-eliminate-kafka-fridays-for-vortexa/|https://thenewstack.io/lenses-io-helps-eliminate-kafka-fridays-for-vortexa/>
----
2020-05-20 23:50:04 UTC - David Kjerrumgaard: Lots of horror stories with Kafka
as well
100 : Julius S
----
2020-05-20 23:57:12 UTC - Matteo Merli: It uses the same provider mechanism
that broker uses. token, tls certs, etc...
----
2020-05-21 00:48:16 UTC - Ken Huang: I checked broker logs and it got a wrong
service URL. I modified the URL before but broker still use old URL. Using
pulsar-admin to check, it shows the correct service url.
----
2020-05-21 02:43:39 UTC - Matteo Merli: Yes, it has created a Pulsar client
instance internally with the wrong URL and that instance is cached
----
2020-05-21 02:52:41 UTC - Ken Huang: canI clear the cached? how to do
----
2020-05-21 03:41:36 UTC - Luke Stephenson:
<https://apache-pulsar.slack.com/archives/C5Z4T36F7/p1589874310138300?thread_ts=1589856904.131300&amp;cid=C5Z4T36F7>
From this it sounds like pulsar currently doesn't gracefully handle if
producers are attempting to publish more than the cluster can handle. Even
with the brokers having 16gb of memory, I'm still seeing them OOM if the
brokers fall behind. Is there anything on the roadmap to address this? If
not, happy to raise something. In my instance the cluster can be fairly happy
during publishing, but if it starts to fall behind, one broker will crash with
OOM and once that happens it cascades to the other brokers.
----
2020-05-21 03:54:27 UTC - Matteo Merli: The only way would be to do a rolling
restart of brokers
----
2020-05-21 04:04:22 UTC - Matteo Merli: There is already a throttling mechanism
that happens in terms of max number of messages pending to be published from a
single client connection. After that, we pause reading from that connection.

An improvement there would be to also limit pending bytes instead and having a
global max (although that’s more complex to implement in an efficient way).

After that, there’s also possibility to rate limit topics, if needed.
----
2020-05-21 04:08:30 UTC - Luke Stephenson: &gt; There is already a throttling
mechanism that happens in terms of max number of messages pending to be
published from a single client connection. After that, we pause reading from
that connection.
Is that configurable?
----
2020-05-21 04:15:15 UTC - Luke Stephenson: I've only got 2 instances
publishing, so surprised I'm hitting any limits.
----
2020-05-21 04:16:00 UTC - Matteo Merli: are these big messages?
----
2020-05-21 04:16:17 UTC - Luke Stephenson: 5kb each message
----
2020-05-21 04:16:27 UTC - Luke Stephenson: is that considered big?
----
2020-05-21 04:17:33 UTC - Luke Stephenson: And the brokers are showing 13gb
memory usage when OOM occurs:
```01:06:36.302 [pulsar-io-22-1] ERROR org.apache.pulsar.PulsarBrokerStarter -
-- Shutting down - Received OOM exception: failed to allocate 16777216 byte(s)
of direct memory (used: 13203668992, max: 13207863296)```
----
2020-05-21 04:19:25 UTC - Matteo Merli: No, 5KB is definitely not big
----
2020-05-21 04:19:46 UTC - Matteo Merli: the max pending request per connection
is configurable:

```maxPendingPublishdRequestsPerConnection = 1000```
----
2020-05-21 04:21:43 UTC - Matteo Merli: Really with default settings it
shouldn't get into OOM situation
----
2020-05-21 04:22:26 UTC - Matteo Merli: &gt; I'm still seeing them OOM if the
brokers fall behind.
What do you mean by fall behind?
----
2020-05-21 04:30:03 UTC - Luke Stephenson: I see logs like this before the
memory pressure:
`04:13:34.282 [BookKeeperClientScheduler-OrderedScheduler-0-0] INFO
org.apache.bookkeeper.proto.PerChannelBookieClient - Timed-out 835 operations
to channel`
----
2020-05-21 04:30:11 UTC - Luke Stephenson: Assuming the bookies can't keep up
with the brokers
----
2020-05-21 04:31:18 UTC - Matteo Merli: How many topics and producers overall
are you using?
----
2020-05-21 04:34:26 UTC - Matteo Merli: any step that I can use to repdouce?
----
2020-05-21 04:41:38 UTC - Patrik Kleindl: People treating a community slack
channel like a support service will be unhappy anywhere sooner or later. I have
positive experience with Kafka in this regard but many people underestimate the
volume of requests and the fact the this is not a paid service. And with Kafka
there are a lot of people on Confluents payroll doing community work. Scaling
up will be the challenge for Pulsar here in my opinion. And I too find the
community here helpful and responsive.
----
2020-05-21 05:03:14 UTC - Enrico Olivelli: Hi @Alan Broddle in which script you
are passing that property? A Pulsar bash script?
----
2020-05-21 05:15:56 UTC - Enrico Olivelli: If it is a Pulsar broker script that
we could make it configurable using Pulsar configuration file, it would be far
more better
----
2020-05-21 05:24:31 UTC - Luke Stephenson: 1 persistent topic with 8
partitions. 2 producers running in parallel publishing 10 million messages in
total to that topic. Each message is just 5kb of random data. It was stable
handling 23k messages per second inbound, but when that increased to 40k per
second (we turned on batching in this instance, but more producer replicas has
same effect) the cluster became unstable.

Cluster is setup in eks.
Bookie setup:
```bookkeeper:
replicaCount: 9
resources:
requests:
memory: 2000Mi
cpu: 3.0
configData:
BOOKIE_MEM: &gt;
"
-XX:+UseContainerSupport
-XX:InitialRAMPercentage=40.0
-XX:MinRAMPercentage=20.0
-XX:MaxRAMPercentage=80.0
"
PULSAR_MEM: &gt;
"
-XX:+UseContainerSupport
-XX:InitialRAMPercentage=40.0
-XX:MinRAMPercentage=20.0
-XX:MaxRAMPercentage=80.0
"
volumes:
ledgers:
name: ledgers
size: 3000Gi
local_storage: true```
broker setup:
```broker:
resources:
requests:
memory: 1024Mi
cpu: 2.0
configData:
PULSAR_MEM: &gt;
"
-XX:+UseContainerSupport
-XX:InitialRAMPercentage=40.0
-XX:MinRAMPercentage=20.0
-XX:MaxRAMPercentage=80.0
"```
(3 broker replicas, using the default from the helm chart)
----
2020-05-21 05:24:52 UTC - Luke Stephenson: There are no consumers at this point
----
2020-05-21 05:39:12 UTC - Matteo Merli: Gotcha. I’ll try to repro tomorrow
----
2020-05-21 05:54:05 UTC - Matteo Merli: One sec: what kind of ensemble size,
write quorum are you using ?
----
2020-05-21 05:55:17 UTC - Matteo Merli: Is that 3-2-2? That would explain the
mem growing.

If that’s the case, switch to 2-2-2 or 3-3-3 and it will fix it
----
2020-05-21 06:06:08 UTC - Luke Stephenson: ok. I need to read more to learn
about what the ensemble size / write quorum is
----
2020-05-21 06:06:44 UTC - Matteo Merli: Sorry, I was meaning : is that 3-3-2?

In other words, if write quorum is &gt; than ack quorum, it’s a thing in BK
client that will accumulate mem in retrying to other bookies
----
2020-05-21 06:07:09 UTC - Matteo Merli: Did you change config related to that?
----
2020-05-21 06:09:00 UTC - Matteo Merli:
<https://github.com/apache/pulsar/blob/cc15ad58583a640ad311fe5a91a9968ed46bf335/conf/broker.conf#L655|https://github.com/apache/pulsar/blob/cc15ad58583a640ad311fe5a91a9968ed46bf335/conf/broker.conf#L655>
----
2020-05-21 06:09:17 UTC - Luke Stephenson: nope. just using defaults from
<https://github.com/apache/pulsar-helm-chart> unless changed as shown above.
But haven't changed the quorum config.
----
2020-05-21 06:10:59 UTC - Luke Stephenson: I've scaled the number of bookies up
from the default. We had the same broker instability before at lower
throughput with less bookies though.
----
2020-05-21 06:19:44 UTC - Patrik Kleindl: Hi, does anyone have examples of
python unit or integration tests for pulsar functions?
I am working my way through the java examples but there seem to be none in
python so far.
----
2020-05-21 06:20:01 UTC - Ken Huang: Hi, can I set
bookkeeperClientRegionawarePolicyEnabled=true without configuationStore in
multiple clusters?
It can do synchronous geo-replication?
----
2020-05-21 06:20:22 UTC - Matteo Merli: Can you share the broker logs, in
particular the beginning of it, where it’s printing all the config vars?
----
2020-05-21 06:28:51 UTC - Sijie Guo: Configuration store is used for
replication between multiple pulsar clusters.

bookkeeper region aware policy is more about geo-replication within one pulsar
cluster. So you don’t need configuration store to do that.
----
2020-05-21 06:29:13 UTC - Sijie Guo: They are the concepts at two different
layers.
----
2020-05-21 06:39:42 UTC - kunni: @kunni has joined the channel
----
2020-05-21 06:47:59 UTC - Ken Huang: So multiple pulsar clusters can do
synchronous geo-replication?
I mean client recives ack until at least 2 clusters recive message
----

Slack digest for #general - 2020-05-21

Reply via email to