Slack digest for #general - 2020-05-07

Apache Pulsar Slack Thu, 07 May 2020 02:11:52 -0700

2020-05-06 10:27:41 UTC - Franck Schmidlin: Followup question: what difference 
are there between io #sinks and #functions?

I want to post topic messages to an http endpoint, à la pulsar beam.

Functions feel more versatile than sinks, but would I be missing a trick? Is
there any difference in processing, isolation, etc?
----
2020-05-06 10:36:36 UTC - alex kurtser: Hello, I would like to clarify better
the managedLedgerDefaultEnsembleSize parameter.

If i want to scale out my bookkeepers instances from 3 instances to 6. Do i
need to set the parameter managedLedgerDefaultEnsembleSize to the value of 6
as well in order to use the all 6 bookkeepers ?
----
2020-05-06 12:08:10 UTC - Pierre Zemb: Hi all :wave:
I have a question, why so much parameters like georeplication, tiered-storage,
retention and others are on the namespace-level and not the topic-level?
----
2020-05-06 12:16:01 UTC - Alexandre DUVAL:
<https://github.com/apache/pulsar/wiki/PIP-51%3A-Tenant-policy-support>

I think it has been impl that way because of previous needs. Now, more
globally, I think it's "todo".
----
2020-05-06 12:17:12 UTC - Alexandre DUVAL: I'm open to contribute with you if
you go for it :wink:.
----
2020-05-06 12:18:05 UTC - Pierre Zemb: thanks @Alexandre DUVAL! Found PIP 39:
<https://github.com/apache/pulsar/wiki/PIP-39:-Namespace-Change-Events>
----
2020-05-06 12:19:24 UTC - Pierre Zemb: I might work on that part indeed, I will
keep you posted :slightly_smiling_face:
----
2020-05-06 12:19:44 UTC - Alexandre DUVAL: PIP39 is really interesting.
----
2020-05-06 12:21:37 UTC - Alexandre DUVAL: About this work, I think
<https://github.com/apache/pulsar/pull/6428> will be interesting (currently
only working for namespaces).
----
2020-05-06 12:26:53 UTC - Pierre Zemb: thanks a lot @Alexandre DUVAL for the
links, will dive into those
----
2020-05-06 12:39:53 UTC - Damien Roualen: Hello,
I have a question regarding Presto.
Is that better to keep Pulsar with Presto included (for instance the 2.5.0 with
a custom version of Presto e.g. 0.206 added to the pom file)?
Or to deploy Presto from the official website (<https://prestodb.io/>) and add
the Pulsar connector plugin.
Context: we have an existing Pulsar cluster, and we would like to deploy Presto
and connect to the cluster.
----
2020-05-06 12:45:16 UTC - rani: @Sijie Guo, any clues here^?
----
2020-05-06 13:38:04 UTC - Ming: Sink refers to outbound data from Pulsar to an
external system. If we speak of data flow, Pulsar Function in most cases keep
the data within Pulsar (i.e. sending to another topic). If you have external
data I/O, `sink` or `source` connectors are right approach. Speaking of
underline implementation, both connectors and Pulsar functions are very
similar. They serve different purposes. If you refer to posting data to http
endpoints, Sink source is more applicable. However, Beam is neither, which was
developed as a standalone component to be more versatile and pluggable.
+1 : Franck Schmidlin
----
2020-05-06 13:44:37 UTC - Ming: @Kirill Merkushev you use admin API, admin CLI
or rest API to rewind the cursor once the function subscription is created. An
example could be
<https://pulsar.apache.org/admin-rest-api/?version=2.5.1#operation/resetCursor>
+1 : Kirill Merkushev
----
2020-05-06 14:09:14 UTC - hugues DESLANDES: Hi,
We are testing the pulsar-flink connector (but not using the schema registry,
<https://github.com/streamnative/pulsar-flink>). From flink we would like to
sink in pulsar some empty messages (to use compaction on pulsar topic).
Acoording to my understanding of the connector, I have not found any way to do
this : we provide a message and a way to find the key from the message : how
could we make the message empty ?
Any tip or workaround would be helpfull. Thanks
----
2020-05-06 14:37:12 UTC - Penghui Li: I have create two issues to track the
documentation for Proxy metrics and Presto worker metrics.
<https://github.com/apache/pulsar/issues/6896>
<https://github.com/apache/pulsar/issues/6897>
And I marked help-wanted. If you are interesting in fix them, welcome.
----
2020-05-06 14:40:56 UTC - Penghui Li: No, the new bookies will be selected when
the managed-ledger rollover. For more details you can read
<https://jack-vanlightly.com/blog/2018/10/2/understanding-how-apache-pulsar-works>
+1 : alex kurtser
----
2020-05-06 14:44:40 UTC - Kirill Merkushev: can I precreate same way
subscription and then create a function then?
----
2020-05-06 15:26:30 UTC - Allen ONeill: Does anyone know of a hosted/managed
version of Pulsar same as I can get for eg: Cassandra/Kafka etc?
----
2020-05-06 15:31:28 UTC - Chris Bartholomew: We do at <https://kafkaesque.io/>.
If you have any questions about it, le me know.
----
2020-05-06 15:36:14 UTC - Chris Hansen: Really? They seem to work for me but I
only tested `@JsonCreator` and `@JsonProperty`. Without those, I was getting an
exception.
----
2020-05-06 15:48:48 UTC - Ming: @Kirill Merkushev Not only do you just
pre-create a subscription, you also have to create the input topic. Although I
have not tried, it should work since the default subscription type is shared.
----
2020-05-06 15:57:41 UTC - Ricardo Ferreira: @Ricardo Ferreira has joined the
channel
----
2020-05-06 16:02:28 UTC - Alex: @Alex has joined the channel
----
2020-05-06 16:56:13 UTC - Manjunath Ghargi: Hi All,
I'm looking for a Performance Test tool through which we can benchmark all the
performance metrics similar to Jmeter or Gatling which are some standard tools
for performance benchmarking. Can someone kindly share more details if any of
these external tools supports Pulsar and if we can make use of them for
performance testing? Or any other info related to performance testing of the
Pulsar server scaling up to ~30k to 50k TPS?
+1 : Franck Schmidlin
----
2020-05-06 17:18:55 UTC - Addison Higham:
<http://openmessaging.cloud/docs/benchmarks/> at one point, you had to build it
yourself as the published docker images were broke, but perhaps they work again
now
tada : Shivji Kumar Jha
+1 : Franck Schmidlin
----
2020-05-06 17:21:02 UTC - Sijie Guo: @rani for python functions, there was one
protobuf related change was missing to cherry-pick in 2.5.1 release.
<https://github.com/apache/pulsar/issues/6858>
----
2020-05-06 17:27:10 UTC - Sijie Guo: @Kirill Merkushev

• If you already have a function running, you can use reset-cursor (i.e.
admin-cli or resetful api) to reset the cursor for the subscription created by
the function.
• You can pre-recreate a subscription with the subscription position you like
to start before submitting a function.
----
2020-05-06 17:34:58 UTC - Sijie Guo: So this is related to shading problems.
The Jackson related libraries are shaded. You can use `pulsar-client-original`
to get around this issue.

Can you create an issue for us to improve this behavior?
----
2020-05-06 17:36:11 UTC - Shivji Kumar Jha: Hi @Manjunath Ghargi We used
<https://locust.io/> . A pretty good tool for a dev. We wrote down
<https://docs.locust.io/en/stable/writing-a-locustfile.html#declaring-tasks|locust
tasks> which could use python pulsar client to send/receive pulsar.
This task was then baked into a docker container and we could just launch more
and more instances of this container to increase throughput on pulsar.

By default, the perf results are ephemeral so you could write to your favourite
graphing tool (statsd for us) and then follow in there...

Not out of the box, but very flexible!
----
2020-05-06 17:36:18 UTC - Sijie Guo: Proxy expose metrics. But I don’t think
presto expose prometheus metrics.
----
2020-05-06 17:37:24 UTC - Sijie Guo: Along with PIP-39, we will introduce
topic-level policy. /cc @Penghui Li
----
2020-05-06 17:39:30 UTC - Sijie Guo: @Damien Roualen :

I would recommend getting started with the one bundled with Pulsar. Because you
don’t need to worry about the compatibility issue with different presto
version.

But deploying from Presto officially has its advantage - you can always pull in
the latest changes from Presto.
----
2020-05-06 17:40:44 UTC - Sijie Guo: Hi @hugues DESLANDES - I don’t think the
current connector implementation support. Can you create an issue for it?
----
2020-05-06 17:43:56 UTC - Manjunath Ghargi: @Shivji Kumar Jha: Can you please
share a sample code for locust task that you have written or any open Git Repo
that we I can refer.
----
2020-05-06 17:45:33 UTC - Manjunath Ghargi: Thanks I'll look into this
framework.
----
2020-05-06 17:50:28 UTC - Sijie Guo: @Allen ONeill - Please checkout
<https://streamnative.io/support/managed-pulsar-service/> built by the original
developers of Pulsar/BookKeeper.
+1 : Shivji Kumar Jha
----
2020-05-06 17:51:27 UTC - Shivji Kumar Jha: @Manjunath Ghargi Here is a quick
<https://gist.github.com/shiv4289/fba1f68542b2fd4505e72d91de91b9f2|gist> that
you could refer.
----
2020-05-06 17:56:23 UTC - Manjunath Ghargi: Thanks Shiv.
----
2020-05-06 17:57:38 UTC - Kirill Merkushev: is it safe to create subscription
via consumer (to change the subscription type to exclusive/failover)? As this
option is missing in the api @Sijie Guo
----
2020-05-06 17:59:51 UTC - Sijie Guo: It is safe to create subscription.
Functions doesn’t support exclusive. So don’t use exclusive type.

You can’t change the subscription type after a subscription is created.
----
2020-05-06 18:24:08 UTC - Prasanth Lemati: @Prasanth Lemati has joined the
channel
----
2020-05-06 18:52:15 UTC - Addison Higham: hrm, a team at my company is curious
about creating lots of namespaces (like a couple hundred). I would imagine that
would create additional load on the metadata store with load balancing and
policy management, but should that be much of a concern?
+1 : Franck Schmidlin, 高天赐
----
2020-05-06 18:53:33 UTC - Addison Higham: (still figuring out of the use case
makes sense, but curious conceptually if that would place undue stress anywhere)
----
2020-05-06 19:09:31 UTC - Gary Fredericks: I'm trying to figure out how to
reconcile these two things

A) tiered storage lets you store an arbitrarily long history for a topic, and
pulsar lets you read that history
B) the backlog quota feature prevents a subscriber from consuming messages from
too far behind the newest message in a topic
----
2020-05-06 19:09:52 UTC - Gary Fredericks: (this is me trying to work out what
to do about <https://github.com/streamnative/pulsar/issues/931>)
----
2020-05-06 19:12:54 UTC - Gary Fredericks: my suspicion is that the backlog
quota shouldn't apply in certain circumstances, like retention situations where
the data isn't going to be deleted anyhow
----
2020-05-06 19:13:18 UTC - Gary Fredericks: but I don't understand well enough
how backlogs work to be sure of that
----
2020-05-06 19:22:18 UTC - Alexandre DUVAL: What can be the reason of
```[pulsar-io-23-1] WARN org.apache.pulsar.broker.service.ServerCnx -
[/192.168.10.13:43134] java.lang.NoSuchMethodError:
java.nio.ByteBuffer.rewind()Ljava/nio/ByteBuffer; with role proxy-to-broker```
----
2020-05-06 19:31:43 UTC - Addison Higham: @Gary Fredericks are you using
`consumer_backlog_eviction`? I have been curious about that as well. Since
backlogs are per namespace though, you should be able to remove the backlog
quota and see if that fixes the issue.
----
2020-05-06 19:35:49 UTC - Gary Fredericks: @Addison Higham I am, that was the
key thing I didn't know when I filed that issue

the problem is I don't yet know the implications of changing that policy; is
running with an unlimited backlog quota a safe thing to do, in namespaces with
infinite retention?
----
2020-05-06 19:38:33 UTC - Addison Higham: the only thing I can think of is if
when a subscription retains a message that prevents either offloading from
happening OR from the broker message cache from being cleared out. Otherwise, I
don't see why it would be problematic
----
2020-05-06 19:38:45 UTC - Addison Higham: and I don't know the answer to that
question (but am curious to know as well :slightly_smiling_face: )
----
2020-05-06 19:39:37 UTC - Addison Higham: I have sort of assumed that reader
subscription, since they are somewhat different as a non-durable cursor, may
not even have backlog quota logic applied, but that might not be true
----
2020-05-06 19:40:13 UTC - Gary Fredericks: well they do, is what I found, but I
was wondering if maybe they shouldn't
----
2020-05-06 19:41:53 UTC - Pierre Zemb: for all :fr: readers, @Steven Le Roux,
@Quentin ADAM and myself recorded a podcast about Pulsar and KoP, enjoy:
<https://bigdatahebdo.com/podcast/episode-99-apache-pulsar-et-kafka-on-pulsar/>
+1 : Florentin Dubois, Gilles Barbier, Alexandre DUVAL, Sijie Guo, Pierre Zemb
fr : Pierre Zemb
clap : Karthik Ramasamy
----
2020-05-06 19:45:23 UTC - Chris Hansen: sure thing
----
2020-05-06 20:25:44 UTC - Franck Schmidlin: I'm looking at the AWS deployment
instructions and the default cluster sizing seems quite large/expensive to my
untrained eye.

<https://pulsar.apache.org/docs/v2.0.1-incubating/deployment/aws-cluster/|https://pulsar.apache.org/docs/v2.0.1-incubating/deployment/aws-cluster/>

Is there any minimal but functional size for a cluster? I want a realistic
infrastructure for my poc but i won't be hammering it.
In fact, even in production I don't have the kind of volumes that seem to be
the standard use case for pulsar.
----
2020-05-06 20:28:09 UTC - Alexandre DUVAL: @Sijie Guo any idea? (latest master
proxy/broker, 2.5.1 bookkeeper/client)
----
2020-05-06 20:29:01 UTC - Alexandre DUVAL: i don't see major changes on bookies
between both versions so i didnt updated bookies, but maybe i must
----
2020-05-06 20:30:30 UTC - Alexandre DUVAL: the global behavior is everything
connect well but no message is forwarded
----
2020-05-06 20:39:06 UTC - Sijie Guo: NoSuchMethodError means there is
dependency conflict that causes netty is not properly loaded
----
2020-05-06 20:44:32 UTC - Alexandre DUVAL: on the broker itself?
----
2020-05-06 21:18:16 UTC - Alexandre DUVAL: is that related to the bump of netty
``` &lt;netty.version&gt;4.1.48.Final&lt;/netty.version&gt;

&lt;netty-tc-native.version&gt;2.0.30.Final&lt;/netty-tc-native.version&gt;```
and it conflicted with the pulsar usages on proxy &lt;-&gt; broker connections ?
----
2020-05-06 21:53:11 UTC - Greg Methvin: I’m wondering about the same thing
actually. We basically want to have a namespace per customer, of which we might
have 1000 or so.
----
2020-05-06 21:53:47 UTC - Greg Methvin: it’s not totally necessary but it seems
useful.
----
2020-05-06 21:58:01 UTC - Addison Higham: @Franck Schmidlin pulsar scales down
pretty well. I run across 8 regions which are very imbalanced, in my smallest
regions I run brokers with as little as 1 GB of memory. Bookies tend to be a
bit more memory hungry (I have had issues with ensuring it doesn't OOM on heap)
but I can still run it at about 4 GB of memory. Zookeeper can be quite small as
well, 1 GB of heap is fine for it.
----
2020-05-06 21:59:53 UTC - Addison Higham: that size cluster can still be
capable of pushing reasonable throughputs, 10k msgs/sec or more. The real
important bit is just fast disk for bookie journals. I use provisioned IOPS
volumes
----
2020-05-06 22:38:18 UTC - Ming: @Gary Fredericks We were just discussing this
topic about message retention with someone. I think there is a lot of different
concepts here. In A), Tiered Storage merely extends the disk space. It does
not govern the message retention policy. Message retention policy governs how
long message can be kept. B) Backlog quota puts a limit on how many unacked
messages on a subscription. It prevents a topic growing infinite if there are
too many unacked messages. So the consumer can either ack those messages or TTL
will force expired message to be auto-acked. Backlog is per subscription. So
there could be multiple backlogs for a topic because a topic can have multiple
subscriptions.
----
2020-05-06 22:38:42 UTC - Kirill Merkushev: also is there a way to tune
producer the same way - as change the hashing to murmur32?
----
2020-05-06 22:40:08 UTC - Gary Fredericks: @Ming does a backlog imply extra
resource usage beyond what's already used by the retention?

or more to my use case, is there _any_ benefit to a backlog quota if your
retention is infinite?
----
2020-05-06 23:08:05 UTC - Ming: Backlog and retention policy are two different
concepts. In a vanilla Pulsar configuration, only unacked message on a
subscription will be kept for consumption. This means ack-ed message and
messages on topics with no subscription (ie. reader only) can be deleted. This
is why retention policy is introduced to allow messages to be retained in a
persistent storage. Pulsar tries to keep unacked message forever. The backlog
quota and TTL are really to prevent message queue (purposefully I use queue
instead of topic since these terms are interchangeable in queuing world)
growing indefinitely. Pulsar will delete acked message and messages with no
subscription as soon as it could (when the trigger is satisfied such as time
interval) So the retention policy counters this behaviour to keep the message.
Actually, messages can still be deleted in tiered storage if it no longer
satisfies retention policy. You probably know already message are not deleted
individually instead it is the ledger, a collection of messages, to be deleted.
----
2020-05-06 23:10:33 UTC - Gary Fredericks: Does this mean that if the retention
policy prevents deletion, the backlog quota has no additional effect?
----
2020-05-06 23:10:58 UTC - Kirill Merkushev: aand one more question regarding
functions - is context shared between functions?
----
2020-05-06 23:21:06 UTC - Ming: They work independently. Depends on which
blacklog quota policy, in the case of `producer_exception` when the blacklog
quota is reached, no longer will producer can send a message instead it will
receive an exception from the broker.
----
2020-05-06 23:23:00 UTC - Kirill Merkushev: and btw how enable state for local
runner?
----
2020-05-06 23:23:46 UTC - Ming: While the data retention still could have
plenty of room to persist messages. These are two independent problems Pulsar
tries to tackle. But they interplay too.
----
2020-05-06 23:25:16 UTC - Kirill Merkushev: (as I get exception)
```java.lang.RuntimeException: Failed to increment key
'85658b96-f126-413a-8f2a-1304604a6902' by amount '1'
at
org.apache.pulsar.functions.instance.ContextImpl.incrCounter(ContextImpl.java:277)
~[org.apache.pulsar-pulsar-functions-instance-2.5.0.jar:?]
at
ru.lanwen.pulsar.functions.SimpleCtxFunction.process(SimpleCtxFunction.java:13)
~[?:?]```
----
2020-05-06 23:34:12 UTC - Liam Clarke: Hi all,

I'm looking at Pulsar as a Kafka replacement, and I had a question about
delivery guarantees. From reading the docs, it seems that Pulsar's architecture
guarantees "at least once" by default - if a producer sends a record to a
broker, and the broker commits it to BookKeeper and then fails before sending
the ack to the producer, then the producer will try again. However, if I enable
deduplication, it looks like it guarantees 'exactly once' -
<https://pulsar.apache.org/docs/en/cookbooks-deduplication/>

Am I understanding this correctly?

Also, Pulsar IO and Pulsar Functions - am I correct in that they run on the
brokers, as opposed to Kafka Connect / Kafka Streams which run standalone?
----
2020-05-06 23:37:04 UTC - Alexandre DUVAL: oh maybe it's because i compiled
with java11 T_T
----
2020-05-06 23:37:50 UTC - Alexandre DUVAL: and running with java8..
----
2020-05-06 23:48:46 UTC - Ming: It depends on your requirements. Since you are
looking into aws cluster, I guess standalone won't be enough for your PC. So
you might need at least 3 bookies and 3 zookeeper pods. But can you get away
with one broker? If it's POC, you could use spot instances that's 80 to 90%
less in terms of costs. You will do fine with m4 or m5. No need to pay premium
for compute or storage optimized vms. We actually have been using m4 in
production cluster that is sufficient.
----
2020-05-06 23:50:47 UTC - Chris Hansen:
<https://github.com/apache/pulsar/issues/6902>
----
2020-05-07 00:49:19 UTC - Joshua Dunham: Hi Everyone, Getting an error : Error
creating ledger for allocating /stream/storage/streamsXXX... is this a disk
storage issue or ZooKeeper issue?
----
2020-05-07 03:25:29 UTC - Raphael Enns: @Raphael Enns has joined the channel
----
2020-05-07 04:47:28 UTC - Sijie Guo: Correctly. Broker de-duplication can
achieve exactly-once producing.
----
2020-05-07 04:48:01 UTC - Sijie Guo: Pulsar Functions and Connectors can run
standalone, along with brokers, spearately in a function worker cluster, or
over Kubernetes.
----
2020-05-07 04:48:38 UTC - Sijie Guo: ZooKeeper issue.
----
2020-05-07 04:49:20 UTC - Sijie Guo: If you are running stanalone , I will
recommend disabling the state store first. you can run standalone with
`bin/pulsar standalone -nss`.
----
2020-05-07 07:26:36 UTC - Damien Roualen: I started with the one with Pulsar,
but the version was old 0.206, and it was not possible for me to work using
Java 11.

Exactly I can use the last version directly that way.
----

Slack digest for #general - 2020-05-07

Reply via email to