2020-05-06 10:27:41 UTC - Franck Schmidlin: Followup question: what difference are there between io #sinks and #functions?
I want to post topic messages to an http endpoint, à la pulsar beam. Functions feel more versatile than sinks, but would I be missing a trick? Is there any difference in processing, isolation, etc? ---- 2020-05-06 10:36:36 UTC - alex kurtser: Hello, I would like to clarify better the managedLedgerDefaultEnsembleSize parameter. If i want to scale out my bookkeepers instances from 3 instances to 6. Do i need to set the parameter managedLedgerDefaultEnsembleSize to the value of 6 as well in order to use the all 6 bookkeepers ? ---- 2020-05-06 12:08:10 UTC - Pierre Zemb: Hi all :wave: I have a question, why so much parameters like georeplication, tiered-storage, retention and others are on the namespace-level and not the topic-level? ---- 2020-05-06 12:16:01 UTC - Alexandre DUVAL: <https://github.com/apache/pulsar/wiki/PIP-51%3A-Tenant-policy-support> I think it has been impl that way because of previous needs. Now, more globally, I think it's "todo". ---- 2020-05-06 12:17:12 UTC - Alexandre DUVAL: I'm open to contribute with you if you go for it :wink:. ---- 2020-05-06 12:18:05 UTC - Pierre Zemb: thanks @Alexandre DUVAL! Found PIP 39: <https://github.com/apache/pulsar/wiki/PIP-39:-Namespace-Change-Events> ---- 2020-05-06 12:19:24 UTC - Pierre Zemb: I might work on that part indeed, I will keep you posted :slightly_smiling_face: ---- 2020-05-06 12:19:44 UTC - Alexandre DUVAL: PIP39 is really interesting. ---- 2020-05-06 12:21:37 UTC - Alexandre DUVAL: About this work, I think <https://github.com/apache/pulsar/pull/6428> will be interesting (currently only working for namespaces). ---- 2020-05-06 12:26:53 UTC - Pierre Zemb: thanks a lot @Alexandre DUVAL for the links, will dive into those ---- 2020-05-06 12:39:53 UTC - Damien Roualen: Hello, I have a question regarding Presto. Is that better to keep Pulsar with Presto included (for instance the 2.5.0 with a custom version of Presto e.g. 0.206 added to the pom file)? Or to deploy Presto from the official website (<https://prestodb.io/>) and add the Pulsar connector plugin. Context: we have an existing Pulsar cluster, and we would like to deploy Presto and connect to the cluster. ---- 2020-05-06 12:45:16 UTC - rani: @Sijie Guo, any clues here^? ---- 2020-05-06 13:38:04 UTC - Ming: Sink refers to outbound data from Pulsar to an external system. If we speak of data flow, Pulsar Function in most cases keep the data within Pulsar (i.e. sending to another topic). If you have external data I/O, `sink` or `source` connectors are right approach. Speaking of underline implementation, both connectors and Pulsar functions are very similar. They serve different purposes. If you refer to posting data to http endpoints, Sink source is more applicable. However, Beam is neither, which was developed as a standalone component to be more versatile and pluggable. +1 : Franck Schmidlin ---- 2020-05-06 13:44:37 UTC - Ming: @Kirill Merkushev you use admin API, admin CLI or rest API to rewind the cursor once the function subscription is created. An example could be <https://pulsar.apache.org/admin-rest-api/?version=2.5.1#operation/resetCursor> +1 : Kirill Merkushev ---- 2020-05-06 14:09:14 UTC - hugues DESLANDES: Hi, We are testing the pulsar-flink connector (but not using the schema registry, <https://github.com/streamnative/pulsar-flink>). From flink we would like to sink in pulsar some empty messages (to use compaction on pulsar topic). Acoording to my understanding of the connector, I have not found any way to do this : we provide a message and a way to find the key from the message : how could we make the message empty ? Any tip or workaround would be helpfull. Thanks ---- 2020-05-06 14:37:12 UTC - Penghui Li: I have create two issues to track the documentation for Proxy metrics and Presto worker metrics. <https://github.com/apache/pulsar/issues/6896> <https://github.com/apache/pulsar/issues/6897> And I marked help-wanted. If you are interesting in fix them, welcome. ---- 2020-05-06 14:40:56 UTC - Penghui Li: No, the new bookies will be selected when the managed-ledger rollover. For more details you can read <https://jack-vanlightly.com/blog/2018/10/2/understanding-how-apache-pulsar-works> +1 : alex kurtser ---- 2020-05-06 14:44:40 UTC - Kirill Merkushev: can I precreate same way subscription and then create a function then? ---- 2020-05-06 15:26:30 UTC - Allen ONeill: Does anyone know of a hosted/managed version of Pulsar same as I can get for eg: Cassandra/Kafka etc? ---- 2020-05-06 15:31:28 UTC - Chris Bartholomew: We do at <https://kafkaesque.io/>. If you have any questions about it, le me know. ---- 2020-05-06 15:36:14 UTC - Chris Hansen: Really? They seem to work for me but I only tested `@JsonCreator` and `@JsonProperty`. Without those, I was getting an exception. ---- 2020-05-06 15:48:48 UTC - Ming: @Kirill Merkushev Not only do you just pre-create a subscription, you also have to create the input topic. Although I have not tried, it should work since the default subscription type is shared. ---- 2020-05-06 15:57:41 UTC - Ricardo Ferreira: @Ricardo Ferreira has joined the channel ---- 2020-05-06 16:02:28 UTC - Alex: @Alex has joined the channel ---- 2020-05-06 16:56:13 UTC - Manjunath Ghargi: Hi All, I'm looking for a Performance Test tool through which we can benchmark all the performance metrics similar to Jmeter or Gatling which are some standard tools for performance benchmarking. Can someone kindly share more details if any of these external tools supports Pulsar and if we can make use of them for performance testing? Or any other info related to performance testing of the Pulsar server scaling up to ~30k to 50k TPS? +1 : Franck Schmidlin ---- 2020-05-06 17:18:55 UTC - Addison Higham: <http://openmessaging.cloud/docs/benchmarks/> at one point, you had to build it yourself as the published docker images were broke, but perhaps they work again now tada : Shivji Kumar Jha +1 : Franck Schmidlin ---- 2020-05-06 17:21:02 UTC - Sijie Guo: @rani for python functions, there was one protobuf related change was missing to cherry-pick in 2.5.1 release. <https://github.com/apache/pulsar/issues/6858> ---- 2020-05-06 17:27:10 UTC - Sijie Guo: @Kirill Merkushev • If you already have a function running, you can use reset-cursor (i.e. admin-cli or resetful api) to reset the cursor for the subscription created by the function. • You can pre-recreate a subscription with the subscription position you like to start before submitting a function. ---- 2020-05-06 17:34:58 UTC - Sijie Guo: So this is related to shading problems. The Jackson related libraries are shaded. You can use `pulsar-client-original` to get around this issue. Can you create an issue for us to improve this behavior? ---- 2020-05-06 17:36:11 UTC - Shivji Kumar Jha: Hi @Manjunath Ghargi We used <https://locust.io/> . A pretty good tool for a dev. We wrote down <https://docs.locust.io/en/stable/writing-a-locustfile.html#declaring-tasks|locust tasks> which could use python pulsar client to send/receive pulsar. This task was then baked into a docker container and we could just launch more and more instances of this container to increase throughput on pulsar. By default, the perf results are ephemeral so you could write to your favourite graphing tool (statsd for us) and then follow in there... Not out of the box, but very flexible! ---- 2020-05-06 17:36:18 UTC - Sijie Guo: Proxy expose metrics. But I don’t think presto expose prometheus metrics. ---- 2020-05-06 17:37:24 UTC - Sijie Guo: Along with PIP-39, we will introduce topic-level policy. /cc @Penghui Li ---- 2020-05-06 17:39:30 UTC - Sijie Guo: @Damien Roualen : I would recommend getting started with the one bundled with Pulsar. Because you don’t need to worry about the compatibility issue with different presto version. But deploying from Presto officially has its advantage - you can always pull in the latest changes from Presto. ---- 2020-05-06 17:40:44 UTC - Sijie Guo: Hi @hugues DESLANDES - I don’t think the current connector implementation support. Can you create an issue for it? ---- 2020-05-06 17:43:56 UTC - Manjunath Ghargi: @Shivji Kumar Jha: Can you please share a sample code for locust task that you have written or any open Git Repo that we I can refer. ---- 2020-05-06 17:45:33 UTC - Manjunath Ghargi: Thanks I'll look into this framework. ---- 2020-05-06 17:50:28 UTC - Sijie Guo: @Allen ONeill - Please checkout <https://streamnative.io/support/managed-pulsar-service/> built by the original developers of Pulsar/BookKeeper. +1 : Shivji Kumar Jha ---- 2020-05-06 17:51:27 UTC - Shivji Kumar Jha: @Manjunath Ghargi Here is a quick <https://gist.github.com/shiv4289/fba1f68542b2fd4505e72d91de91b9f2|gist> that you could refer. ---- 2020-05-06 17:56:23 UTC - Manjunath Ghargi: Thanks Shiv. ---- 2020-05-06 17:57:38 UTC - Kirill Merkushev: is it safe to create subscription via consumer (to change the subscription type to exclusive/failover)? As this option is missing in the api @Sijie Guo ---- 2020-05-06 17:59:51 UTC - Sijie Guo: It is safe to create subscription. Functions doesn’t support exclusive. So don’t use exclusive type. You can’t change the subscription type after a subscription is created. ---- 2020-05-06 18:24:08 UTC - Prasanth Lemati: @Prasanth Lemati has joined the channel ---- 2020-05-06 18:52:15 UTC - Addison Higham: hrm, a team at my company is curious about creating lots of namespaces (like a couple hundred). I would imagine that would create additional load on the metadata store with load balancing and policy management, but should that be much of a concern? +1 : Franck Schmidlin, 高天赐 ---- 2020-05-06 18:53:33 UTC - Addison Higham: (still figuring out of the use case makes sense, but curious conceptually if that would place undue stress anywhere) ---- 2020-05-06 19:09:31 UTC - Gary Fredericks: I'm trying to figure out how to reconcile these two things A) tiered storage lets you store an arbitrarily long history for a topic, and pulsar lets you read that history B) the backlog quota feature prevents a subscriber from consuming messages from too far behind the newest message in a topic ---- 2020-05-06 19:09:52 UTC - Gary Fredericks: (this is me trying to work out what to do about <https://github.com/streamnative/pulsar/issues/931>) ---- 2020-05-06 19:12:54 UTC - Gary Fredericks: my suspicion is that the backlog quota shouldn't apply in certain circumstances, like retention situations where the data isn't going to be deleted anyhow ---- 2020-05-06 19:13:18 UTC - Gary Fredericks: but I don't understand well enough how backlogs work to be sure of that ---- 2020-05-06 19:22:18 UTC - Alexandre DUVAL: What can be the reason of ```[pulsar-io-23-1] WARN org.apache.pulsar.broker.service.ServerCnx - [/192.168.10.13:43134] java.lang.NoSuchMethodError: java.nio.ByteBuffer.rewind()Ljava/nio/ByteBuffer; with role proxy-to-broker``` ---- 2020-05-06 19:31:43 UTC - Addison Higham: @Gary Fredericks are you using `consumer_backlog_eviction`? I have been curious about that as well. Since backlogs are per namespace though, you should be able to remove the backlog quota and see if that fixes the issue. ---- 2020-05-06 19:35:49 UTC - Gary Fredericks: @Addison Higham I am, that was the key thing I didn't know when I filed that issue the problem is I don't yet know the implications of changing that policy; is running with an unlimited backlog quota a safe thing to do, in namespaces with infinite retention? ---- 2020-05-06 19:38:33 UTC - Addison Higham: the only thing I can think of is if when a subscription retains a message that prevents either offloading from happening OR from the broker message cache from being cleared out. Otherwise, I don't see why it would be problematic ---- 2020-05-06 19:38:45 UTC - Addison Higham: and I don't know the answer to that question (but am curious to know as well :slightly_smiling_face: ) ---- 2020-05-06 19:39:37 UTC - Addison Higham: I have sort of assumed that reader subscription, since they are somewhat different as a non-durable cursor, may not even have backlog quota logic applied, but that might not be true ---- 2020-05-06 19:40:13 UTC - Gary Fredericks: well they do, is what I found, but I was wondering if maybe they shouldn't ---- 2020-05-06 19:41:53 UTC - Pierre Zemb: for all :fr: readers, @Steven Le Roux, @Quentin ADAM and myself recorded a podcast about Pulsar and KoP, enjoy: <https://bigdatahebdo.com/podcast/episode-99-apache-pulsar-et-kafka-on-pulsar/> +1 : Florentin Dubois, Gilles Barbier, Alexandre DUVAL, Sijie Guo, Pierre Zemb fr : Pierre Zemb clap : Karthik Ramasamy ---- 2020-05-06 19:45:23 UTC - Chris Hansen: sure thing ---- 2020-05-06 20:25:44 UTC - Franck Schmidlin: I'm looking at the AWS deployment instructions and the default cluster sizing seems quite large/expensive to my untrained eye. <https://pulsar.apache.org/docs/v2.0.1-incubating/deployment/aws-cluster/|https://pulsar.apache.org/docs/v2.0.1-incubating/deployment/aws-cluster/> Is there any minimal but functional size for a cluster? I want a realistic infrastructure for my poc but i won't be hammering it. In fact, even in production I don't have the kind of volumes that seem to be the standard use case for pulsar. ---- 2020-05-06 20:28:09 UTC - Alexandre DUVAL: @Sijie Guo any idea? (latest master proxy/broker, 2.5.1 bookkeeper/client) ---- 2020-05-06 20:29:01 UTC - Alexandre DUVAL: i don't see major changes on bookies between both versions so i didnt updated bookies, but maybe i must ---- 2020-05-06 20:30:30 UTC - Alexandre DUVAL: the global behavior is everything connect well but no message is forwarded ---- 2020-05-06 20:39:06 UTC - Sijie Guo: NoSuchMethodError means there is dependency conflict that causes netty is not properly loaded ---- 2020-05-06 20:44:32 UTC - Alexandre DUVAL: on the broker itself? ---- 2020-05-06 21:18:16 UTC - Alexandre DUVAL: is that related to the bump of netty ``` <netty.version>4.1.48.Final</netty.version> <netty-tc-native.version>2.0.30.Final</netty-tc-native.version>``` and it conflicted with the pulsar usages on proxy <-> broker connections ? ---- 2020-05-06 21:53:11 UTC - Greg Methvin: I’m wondering about the same thing actually. We basically want to have a namespace per customer, of which we might have 1000 or so. ---- 2020-05-06 21:53:47 UTC - Greg Methvin: it’s not totally necessary but it seems useful. ---- 2020-05-06 21:58:01 UTC - Addison Higham: @Franck Schmidlin pulsar scales down pretty well. I run across 8 regions which are very imbalanced, in my smallest regions I run brokers with as little as 1 GB of memory. Bookies tend to be a bit more memory hungry (I have had issues with ensuring it doesn't OOM on heap) but I can still run it at about 4 GB of memory. Zookeeper can be quite small as well, 1 GB of heap is fine for it. ---- 2020-05-06 21:59:53 UTC - Addison Higham: that size cluster can still be capable of pushing reasonable throughputs, 10k msgs/sec or more. The real important bit is just fast disk for bookie journals. I use provisioned IOPS volumes ---- 2020-05-06 22:38:18 UTC - Ming: @Gary Fredericks We were just discussing this topic about message retention with someone. I think there is a lot of different concepts here. In A), Tiered Storage merely extends the disk space. It does not govern the message retention policy. Message retention policy governs how long message can be kept. B) Backlog quota puts a limit on how many unacked messages on a subscription. It prevents a topic growing infinite if there are too many unacked messages. So the consumer can either ack those messages or TTL will force expired message to be auto-acked. Backlog is per subscription. So there could be multiple backlogs for a topic because a topic can have multiple subscriptions. ---- 2020-05-06 22:38:42 UTC - Kirill Merkushev: also is there a way to tune producer the same way - as change the hashing to murmur32? ---- 2020-05-06 22:40:08 UTC - Gary Fredericks: @Ming does a backlog imply extra resource usage beyond what's already used by the retention? or more to my use case, is there _any_ benefit to a backlog quota if your retention is infinite? ---- 2020-05-06 23:08:05 UTC - Ming: Backlog and retention policy are two different concepts. In a vanilla Pulsar configuration, only unacked message on a subscription will be kept for consumption. This means ack-ed message and messages on topics with no subscription (ie. reader only) can be deleted. This is why retention policy is introduced to allow messages to be retained in a persistent storage. Pulsar tries to keep unacked message forever. The backlog quota and TTL are really to prevent message queue (purposefully I use queue instead of topic since these terms are interchangeable in queuing world) growing indefinitely. Pulsar will delete acked message and messages with no subscription as soon as it could (when the trigger is satisfied such as time interval) So the retention policy counters this behaviour to keep the message. Actually, messages can still be deleted in tiered storage if it no longer satisfies retention policy. You probably know already message are not deleted individually instead it is the ledger, a collection of messages, to be deleted. ---- 2020-05-06 23:10:33 UTC - Gary Fredericks: Does this mean that if the retention policy prevents deletion, the backlog quota has no additional effect? ---- 2020-05-06 23:10:58 UTC - Kirill Merkushev: aand one more question regarding functions - is context shared between functions? ---- 2020-05-06 23:21:06 UTC - Ming: They work independently. Depends on which blacklog quota policy, in the case of `producer_exception` when the blacklog quota is reached, no longer will producer can send a message instead it will receive an exception from the broker. ---- 2020-05-06 23:23:00 UTC - Kirill Merkushev: and btw how enable state for local runner? ---- 2020-05-06 23:23:46 UTC - Ming: While the data retention still could have plenty of room to persist messages. These are two independent problems Pulsar tries to tackle. But they interplay too. ---- 2020-05-06 23:25:16 UTC - Kirill Merkushev: (as I get exception) ```java.lang.RuntimeException: Failed to increment key '85658b96-f126-413a-8f2a-1304604a6902' by amount '1' at org.apache.pulsar.functions.instance.ContextImpl.incrCounter(ContextImpl.java:277) ~[org.apache.pulsar-pulsar-functions-instance-2.5.0.jar:?] at ru.lanwen.pulsar.functions.SimpleCtxFunction.process(SimpleCtxFunction.java:13) ~[?:?]``` ---- 2020-05-06 23:34:12 UTC - Liam Clarke: Hi all, I'm looking at Pulsar as a Kafka replacement, and I had a question about delivery guarantees. From reading the docs, it seems that Pulsar's architecture guarantees "at least once" by default - if a producer sends a record to a broker, and the broker commits it to BookKeeper and then fails before sending the ack to the producer, then the producer will try again. However, if I enable deduplication, it looks like it guarantees 'exactly once' - <https://pulsar.apache.org/docs/en/cookbooks-deduplication/> Am I understanding this correctly? Also, Pulsar IO and Pulsar Functions - am I correct in that they run on the brokers, as opposed to Kafka Connect / Kafka Streams which run standalone? ---- 2020-05-06 23:37:04 UTC - Alexandre DUVAL: oh maybe it's because i compiled with java11 T_T ---- 2020-05-06 23:37:50 UTC - Alexandre DUVAL: and running with java8.. ---- 2020-05-06 23:48:46 UTC - Ming: It depends on your requirements. Since you are looking into aws cluster, I guess standalone won't be enough for your PC. So you might need at least 3 bookies and 3 zookeeper pods. But can you get away with one broker? If it's POC, you could use spot instances that's 80 to 90% less in terms of costs. You will do fine with m4 or m5. No need to pay premium for compute or storage optimized vms. We actually have been using m4 in production cluster that is sufficient. ---- 2020-05-06 23:50:47 UTC - Chris Hansen: <https://github.com/apache/pulsar/issues/6902> ---- 2020-05-07 00:49:19 UTC - Joshua Dunham: Hi Everyone, Getting an error : Error creating ledger for allocating /stream/storage/streamsXXX... is this a disk storage issue or ZooKeeper issue? ---- 2020-05-07 03:25:29 UTC - Raphael Enns: @Raphael Enns has joined the channel ---- 2020-05-07 04:47:28 UTC - Sijie Guo: Correctly. Broker de-duplication can achieve exactly-once producing. ---- 2020-05-07 04:48:01 UTC - Sijie Guo: Pulsar Functions and Connectors can run standalone, along with brokers, spearately in a function worker cluster, or over Kubernetes. ---- 2020-05-07 04:48:38 UTC - Sijie Guo: ZooKeeper issue. ---- 2020-05-07 04:49:20 UTC - Sijie Guo: If you are running stanalone , I will recommend disabling the state store first. you can run standalone with `bin/pulsar standalone -nss`. ---- 2020-05-07 07:26:36 UTC - Damien Roualen: I started with the one with Pulsar, but the version was old 0.206, and it was not possible for me to work using Java 11. Exactly I can use the last version directly that way. ----
