Slack digest for #general - 2020-06-02

Apache Pulsar Slack Tue, 02 Jun 2020 02:12:08 -0700

2020-06-01 10:59:41 UTC - Oleg Toubenshlak: @Oleg Toubenshlak has joined the 
channel
----
2020-06-01 11:00:56 UTC - Shalom Tuby: @Shalom Tuby has joined the channel
----
2020-06-01 12:11:30 UTC - Tony Free: @Tony Free has joined the channel
----
2020-06-01 13:29:11 UTC - Miguel Martins: Pulsar is probably not fit for an 
event store.
I believe it doesn't have a way to perform inserts with optimistic concurrency 
check and as you mention, a way to retrieve all events by aggregate id without 
creating a topic for each one.
IMO, it would be better to use a dedicated event store and use it as a source 
for pulsar.
E.g postgres as event store and publish to pulsar using debezium
----
2020-06-01 13:29:41 UTC - Ankush: @Ankush has joined the channel
----
2020-06-01 13:55:37 UTC - Oleg Toubenshlak: Hi everyone,

I have a question regarding to custom Pulsar connector (sink) deployment with
Kubernetes (and Helm latest version).
I need to access some files volumed externally in it.

Is there a way to trigger pulsar broker to mount the external volume for the
automatically generated connector which is run as a separated pod and created
by pulsar broker.

The connector is deployed by pulsar admin as “Functions-worker with brokers”.

Thanks!
----
2020-06-01 14:34:54 UTC - Hugo Smitter: @Hugo Smitter has joined the channel
----
2020-06-01 15:00:52 UTC - Oleg Toubenshlak: Hi again, another question
regarding to pulsar connector. Is there a way to run connector as worker with
java higher than java8? For example by pulsarDockerImageName configuration
with docker java13 image?
----
2020-06-01 15:55:38 UTC - Gary Fredericks: is there any way to know whether a
particular version of the pulsar client is compatible with a particular version
of the server?
----
2020-06-01 15:55:54 UTC - Gary Fredericks: I'm wondering about using client
2.5.2 with server 2.4.1 in particular
----
2020-06-01 16:01:36 UTC - Ebere Abanonu: You look at the protocol version.
2.5.2 protocol version 15 while 4.5.1 protocol version 14.
----
2020-06-01 16:02:11 UTC - Gary Fredericks: Okay, thanks
----
2020-06-01 16:02:28 UTC - Ebere Abanonu: The server is compatible with any
client as long as the protocol version is indicated when connecting
+1 : Frank Kelly
----
2020-06-01 16:04:05 UTC - Tester T: Hi, folks!
We plan to use pulsar as event journal in event sourced system. So, we are
planning to use following topic design:
1. topic per ddd aggregate. Won’t have direct subscriptions, only time to time
usage of the reader api. Messages must be stored indefinitely (via tiered
storage).
2. couple of aggregated streams for other services integration, event
processing, job queuing. Planned to be implemented via pulsar function which
will listen for topic-pattern or full namespace. Messages could be deleted
within some retention period.
e.g.
1. event-journaled topics: ‘orders_shop234’, ‘orders_shop124’.
2. integration topics: ‘paid-orders’, ‘placed-orders’ which will be hydrated
by pulsar-functions from event-journaled topics.

So, the question is what are the possible limitations of this design? Would
there be any possible scaling issues e.g for 50k of aggregates (event journaled
topics)?

Thanks for answers!
----
2020-06-01 16:04:40 UTC - Gary Fredericks: so I shouldn't _have_ to pay
attention to this, ideally?
----
2020-06-01 16:06:35 UTC - Ebere Abanonu: That will be the job of the client
library. Just pay attention to the features you will be getting
----
2020-06-01 16:06:51 UTC - Gary Fredericks: cool, thanks
----
2020-06-01 16:08:42 UTC - Ebere Abanonu: How do you intend to replay the
events? That will determine your limitations
----
2020-06-01 16:11:46 UTC - Tester T: Why not to use the reader api for this
purpose?
event rewind is required only in event-journaled topics
----
2020-06-01 16:14:29 UTC - Ebere Abanonu: the reader api is the best fit for that
----
2020-06-01 16:14:33 UTC - David Kjerrumgaard: I am not aware of any limitations
beyond what the hardware can handle :smiley:
+1 : Kirill Kosenko
----
2020-06-01 16:15:07 UTC - Ebere Abanonu: Am actually working on something on
EventSourcing on Pulsar
----
2020-06-01 16:19:43 UTC - Tester T: I am little bit worried about functions
subs that will potentially listen for ~50-70k topics.
&gt; Am actually working on something on EventSourcing on Pulsar
Yeah, I saw your contribution to the pulsar <http://akka.net|akka.net>
persistence plugin, keep up good work!
----
2020-06-01 16:23:39 UTC - Ebere Abanonu: I think the best way to know is to
take it for that sought of rough ride. But dont think that should be an issue,
should it @Sijie Guo?
----
2020-06-01 16:23:49 UTC - Ebere Abanonu: Thanks
----
2020-06-01 16:24:44 UTC - Addison Higham: pulsar doesn't have a limit on number
of subscriptions, but it does have some cost in the client SDK with keeping
track of all of them (which does require a fair bit of memory) but more
limiting is that it just takes a while for pulsar to iterate through and create
all the subscriptions. It can cause some timeouts. In a flink job we had, it
involved changing a couple of timeouts, not sure if that is exposed in the
functions though. I would suggest testing that first.
----
2020-06-01 16:35:15 UTC - Tester T: I am not going to iterate through topics
and individually subscribe.
Instead there will be couple (from 1 to 20) function subscriptions that will
listen full namespace and process events.

I’ll definitely give it a try!

thanks for answers!
----
2020-06-01 17:54:41 UTC - Kirill Merkushev: Disagree here, we use pulsar as
event source for quite a while now, we have 20M events now and use infinite
retention and in case we need to restream something - we just read from the
beginning - it takes now something like 30 min to fully recreate a database
from the topic (once per quarter its fine on db migrations, esp with no
downtime). We use 32 partitions on the topic and user Id as a key. True, that
pulsar lacks some optimistic checks, but we rely here on the write through db
here, with status update later on consumer thread. Maintaining debezium with pg
with pulsar could be a quite heavy thing. I would advise to check if you need
to store any personal or deletable data in pulsar, as it could be quite hard to
get rid of selected aggregates (we use event gw with custom offloader plugin
which stores personal data in the db and keeps in pulsar only the reference, to
make this data removable). Also check the latest experimental per key
subscription which should solve your per aggregate consume scenario.
----
2020-06-01 17:55:01 UTC - Kirill Merkushev: We use this event gw
<https://github.com/bsideup/liiklus>
----
2020-06-01 17:57:45 UTC - Kirill Merkushev: Pulsar on regexp sub internally
creates consumer per topic, so you would have thousands of consumers sharing
the connection pool internally, don’t think its scalable
----
2020-06-01 20:05:15 UTC - Kirill Kosenko: Thank you guys for your replies
----
2020-06-01 20:27:03 UTC - lujop: Thank you very much for your response penghui.
After processing your responses I've more doubts but mainly are derived or new
ones and I will start different threads for each
+1 : Penghui Li
----
2020-06-01 20:38:16 UTC - lujop: I made a little test with delayedDelivery and
it seems that when applied the message order is ignored, isn't it?
For example for a subscriber with key_shared if I send a message with key=1 and
delay 30 seconds and then another with the same key and delay 5 seconds, the
second one is consumed first. Is this how it works?
----
2020-06-01 20:43:17 UTC - Vladimir Shchur: topic pattern does exactly that - it
creates a separate consumer for each topic involved. So I would not call it a
good idea.
----
2020-06-01 20:54:16 UTC - lujop: I'm evaluating to use Pulsar for a use case of
classic message queues for integrations without realtime needs and not a huge
number of messages.
Some of the features are a very good match for me, like multi subscriptions,
cheap topic creation, and allow to have topics for entity pattern, pulsar SQL,
and to have the flexibility to use for more advanced streaming uses cases if
later needed.
One of my concerns is that reading documentation about production requirements,
it says that a single machine instance is only for development purposes, but if
no realtime uses cases are needed and it's not expected a lot of use, it is
realistic to have only one instance if in case of disaster messages can be
regenerated some way?
And about disaster recovery, if it would be needed, are backups of Pulsar
possible, or the way is to relay on the cluster replication?
----
2020-06-01 20:59:04 UTC - Greg Methvin: isn’t this what you would expect? you
asked for the second message to be delivered in 5 seconds and the first one to
be delivered in 30 seconds
----
2020-06-01 21:04:30 UTC - lujop: No, I expected the message order to have the
preference above the delay and that the first one was delivered after 30
seconds, and the second one immediately after the first.
But I understand that when you use delay, order is not important?
----
2020-06-01 21:12:12 UTC - Addison Higham: I can't speak to if there is
something about pulsar standalone that would make it be a no-go in production
(besides the obvious SPOF and no ability to scale it out). What you describe
seems somewhat reasonable to me, but there may be details of how standalone is
run/configured that may cause more problems.

One other thing I wanted to mention though, under standalone, you can put all
the data on a single disk/volume. That would make it much much easier to
snapshot the disk and have it be consistent. The replication factor of
bookkeeper/zookeeper is what handles most of that in a clustered scenario (as
multiple disks/services are much more difficult to do traditional backups on),
but with a single disk in standalone, it should be relative safe to just do DR
with a disk backup
----
2020-06-01 21:35:50 UTC - Greg Methvin: I’m guessing the documentation is
probably lacking here, but as I understand it the shared and key_shared
subscription types don’t guarantee message ordering. what behavior were you
expecting in your example?
----
2020-06-01 21:36:42 UTC - lujop: I've some doubts about how
<https://github.com/apache/pulsar/wiki/PIP-26:-Delayed-Message-Delivery|Delayed
Delivery> and new
<https://github.com/apache/pulsar/wiki/PIP-58-%3A-Support-Consumers--Set-Custom-Retry-Delay|Consumers
Set Custom Retry Delay feature> and how it can impact with a big number of
retries and exponential backoff.
For example for a queue that is used for an external integration, that starts
first retry after 1 minute, but does last after 2 days due to exponential
backoff can have a huge impact in append-only Pulsar structures? Because
although there is only one old message that is not processed, the log cannot be
discarded for newer entries until the older is processed, can't it?
Can this be a problem, or it's internally optimized using caching in memory?
Also, to my understanding, custom retry delays will use another topic, then if
I expect strictly order m1,m2,m3 and need to retry m1 and don't want m2 and m3
to be processed, I would need to manage by myself doing some precondition
checks and manually force also m2 and m3 retries?
----
2020-06-01 21:44:36 UTC - Alexander Ursu: Was wondering if anyone has used the
Pulsar SQL (Presto) successfully and securely in kubernetes. I'm not quite fond
of it and unsure how to configure authentication for it. Would like to hear
from any success stories.
----
2020-06-01 21:46:27 UTC - lujop: For key_shared I initially expect that
messages with the same key to be processed in the order. That is:
To -&gt; m1 queued with key=K1 and delay 30s
To+1s -&gt; m2 queued with key=K1 and delay 5s
To+30s -&gt; m1 is processed only after delayed time passes. m2 hasn't been
processed yet because although it's delay time has been passed order is
preserved
To+30s -&gt; just after m1 is processed m2 is processed also
----
2020-06-01 22:05:31 UTC - Sijie Guo:
<https://github.com/streamnative/charts/blob/master/charts/pulsar/templates/presto/presto-coordinator-configmap.yaml#L170>

You can check this configmap to see how to configure authentication for presto.
----
2020-06-01 22:07:03 UTC - Raphael Enns: I was looking at
<https://pulsar.apache.org/docs/en/deploy-bare-metal/>. We don't need any data
redundancy the data we're sending doesn't need to last long. We're also not
pushing through a large amount or frequency of data. What would you recommend
for a simple stable production setup? Would 1 zookeeper process, 1 bookkeeper
process and 1 pulsar broker process all running on the same machine work?
----
2020-06-02 04:43:18 UTC - Alexander Ursu: Ah thank you. I was also wondering
more along the lines of how external clients connect to the Presto cluster
securely, specifically traffic external to the k8s cluster. Can that be done as
a part of this helm chart too?
----
2020-06-02 08:02:18 UTC - xue: There are three brokers in a pulsar cluster.
Using the synchronous sending interface of pulsar producers, there are only
about 100 TPS.
code:
Producer&lt;String&gt; stringProducer = client.newProducer(Schema.STRING)
.topic("my-topic")
.create();
stringProducer.send("My message");
----

Slack digest for #general - 2020-06-02

Reply via email to