2020-02-27 09:55:26 UTC - Rolf Arne Corneliussen: Update: the above setup should work. I downloaded the latest Streamnative Weekly Build (have used the 2.5.0 release up to now), and then compaction worked as expected. Will raise a github bug. ---- 2020-02-27 10:56:38 UTC - Thiago: @Thiago has joined the channel ---- 2020-02-27 12:27:19 UTC - riconsch: @riconsch has joined the channel ---- 2020-02-27 12:34:48 UTC - riconsch: Hello everyone :slightly_smiling_face: I have the following question/topic. I do want to have the following job being done by pulsar instead of kafka. My data process looks like this (at the moment with one kafka broker due to licensing confluent restrictions): json data => serialized with avro => taking schema from schema registry => producing in kafka topic => stored via HDFS 3 Sink => Hive to query data (metastore updated via schema information)
Is this workflow possible for pulsar to execute like kafka does in my process? We are evaluting to switch from kafka to pulsar. Any thoughs / help would be greatly appreciated due to the fact that I am pulsar novice. Best, riconsch ---- 2020-02-27 15:13:34 UTC - Sijie Guo: :+1: ---- 2020-02-27 15:15:46 UTC - Sijie Guo: @riconsch yes that’s doable. Pulsar provides the builtin schema management and connectors to achieve the same datapipeline as you described muscle : Konstantinos Papalias ---- 2020-02-27 15:36:44 UTC - Pedro Cardoso: Hello, Does Pulsar support incremental cooperative rebalancing akin to kakfa? See <https://www.confluent.io/blog/incremental-cooperative-rebalancing-in-kafka/> ---- 2020-02-27 16:05:52 UTC - eilonk: Hi :slightly_smiling_face: I'm using helm to deploy pulsar on a kubernetes cluster. I can't find any documentation regarding helm and pulsar with tls enabled. many values, mostly in configmaps, require file paths (like tlsTrustCertsFilePath=/opt/pulsar/ca.cert.pem f.e) On the other hand, the chart requires a tls-secret which i did configure, but it's not possible to refrence a secret in a configmap. how can i workaround these values? has anyone encountered this problem? Thanks! ---- 2020-02-27 16:21:39 UTC - John Duffie: we figured out the problem. If the SpecificRecord’s class is passed to the producer, then the schema is generated on the fly from the fields in the class. What we needed was for the schema string to be the content of the SpecificRecord’s pregenerated schema (eg. SpecificRecordBase.getSchema()) So: When you construct the producer, instead of doing this: ```newProducer(Schema.AVRO(MyAvroClassThatExtendsSpecificRecord.class))``` you need to do this: ```newProducer(Schema.AVRO(schemaDefBuilder.build()))``` Where schemaDefBuilder is of the form: ``` SchemaDefinitionBuilderImpl<MyAvroClassThatExtendsSpecificRecord> schemaDefBuilder = new SchemaDefinitionBuilderImpl<>(); schemaDefBuilder.withJsonDef(MyAvroClassThatExtendsSpecificRecord.getClassSchema().toString());``` +1 : Konstantinos Papalias, Raman Gupta ---- 2020-02-27 16:43:08 UTC - matt_innerspace.io: if it's a function, subscribing to thousands of topics via regex, and it's configured for parallelism, would that help?' ---- 2020-02-27 16:49:05 UTC - Chris DiGiovanni: I have a pulsar environment where someone created a namespace where the name contains a "/" in it. I cannot figure out a way now to delete this namespace. I've tried using the pulsar-admin utility which complains as well as hitting the pulsar admin API directly w/o any luck. I've also tried urlencoding the namespace name and that also does not work. Any help would be appreciated. ---- 2020-02-27 16:54:28 UTC - Chris Bartholomew: Hi @eilonk, we have a Pulsar Helm chart that supports TLS configuration. You can find it here: <https://helm.kafkaesque.io>. If you have trouble with it, let me know. ---- 2020-02-27 17:26:18 UTC - Ryan Slominski: Can you have multiple schema associated with a single topic? For example addUser schema and deleteUser schema on a "user" topic. ---- 2020-02-27 17:28:28 UTC - John Duffie: we have liberty of defining new schema for our particular flow - so, we have an AVRO that supports union of embedded event objects ---- 2020-02-27 17:28:57 UTC - John Duffie: that’s how we got around the problem you list about 1 topic with mult schema ---- 2020-02-27 17:29:55 UTC - Meyappan Ramasamy: @Meyappan Ramasamy has joined the channel ---- 2020-02-27 17:30:02 UTC - Joshua Dunham: Hi @John Duffie, I was *just* going to ask about nested schemas (complex types) in pulsar! ---- 2020-02-27 17:30:26 UTC - Joshua Dunham: How do you manage them outside of pulsar? ---- 2020-02-27 17:31:37 UTC - Joshua Dunham: I was looking into a service that can sync to something like HWX Schema Registry for a UI (short of support in pulsar-manager). ---- 2020-02-27 17:34:14 UTC - Meyappan Ramasamy: hi team, this is Meyappan here. i am running pulsar in docker container in windows. i find pulsar supported only for apple mac and linux OS. I was unable to do a pip install pulsar-client==2.5.0 to get the pulsar python client installed. i want to know how I can access the pulsar CLI commands like pulsar-admin and pulsar-client when running pulsar as a docker image in docker container for windows. or can i access pulsar CLI commands only when pulsar is installed in apple mac or linux OS ? ---- 2020-02-27 17:35:59 UTC - Tanner Nilsson: I would think you could docker exec into your docker container and run `bin/pulsar-admin` or `bin/pulsar-client` from there? (haven't been working with it on windows) ---- 2020-02-27 17:49:07 UTC - John Duffie: Could you elaborate on your management question. Is goal to ensure backups of the schema or use of schema in non pulsar flows like http/grpc? Regarding the latter, years back, I wrote my registry and support library that was decoupled from the broker technology - transports included http, kafka, jmx :slightly_smiling_face: . The approach used a message consisting of a common record with fields containing the schema ver and bytes[] . of the inner, serialized AVRO. Allowed common solution to work over diff technologies. Now, I want to find an off the shelf solution for a very specific use case. ---- 2020-02-27 18:16:01 UTC - Joshua Dunham: Definitely the latter, but my goal is really to make something that is very human focused. ---- 2020-02-27 18:16:24 UTC - Joshua Dunham: Unfortunately the inertia is towards siloed apps and siloed schemas. ---- 2020-02-27 18:16:46 UTC - Joshua Dunham: I want a schema manager that (supports multiple transports) but also acts as a hub to get things cleaned up. ---- 2020-02-27 18:34:29 UTC - John Duffie: I so wish there was an off the shelf solution. I do not like the tight coupling between the broker technology and the schema registry that is becoming widespread so that things are “easier” ---- 2020-02-27 18:36:15 UTC - John Duffie: I don’t know of a FOSS solution. I’m looking at the Hortonworks one you mention. Current project is not based on HDP. Is that a requirement ? ---- 2020-02-27 18:41:49 UTC - Ryan Slominski: Is having both a changes (add/remove/update) entity topic and a snapshot (entire table in a message) topic a good strategy? Some consumers want just what has changed, others want entire list of all entities. Single producer. Some consumers might benefit from both to speed up replay - might try a "change ID" for consumers who want both and correlate change messages to snapshot messages. Seems like Event Sourcing and CQRS are a minefield. ---- 2020-02-27 18:43:10 UTC - John Duffie: the update case is the one that has me perplexed ---- 2020-02-27 18:43:52 UTC - Ryan Slominski: update as in someone changes one of the fields like lastname in a user entity record might change ---- 2020-02-27 18:45:15 UTC - John Duffie: luckily, my use case is simpler. 99.99% of traffic is time series IOT data. updates only occur for some metadata ---- 2020-02-27 18:45:53 UTC - John Duffie: yes ---- 2020-02-27 18:47:18 UTC - Ryan Slominski: Yeah, I'm attempting to store metadata in Pulsar too. My primary data is simple event data (notifications). ---- 2020-02-27 18:47:40 UTC - John Duffie: sending entire object instead of just changed fields is not a deal breaker for me ---- 2020-02-27 18:48:45 UTC - Ryan Slominski: Update isn't a deal breaker - I could just support add and remove and when something is "updated", just remove and add new ---- 2020-02-27 18:51:01 UTC - Ryan Slominski: Question is - is having effectively a materialized view stored in Pulsar in a separate topic a good strategy? If stored outside in RDMS or something it might be easier to query, but requires two different interfaces for clients. ---- 2020-02-27 19:02:06 UTC - Joshua Dunham: Not at all ---- 2020-02-27 19:02:10 UTC - Joshua Dunham: It's all API driven ---- 2020-02-27 19:02:29 UTC - Joshua Dunham: feasible to pull the code, add in a module that syncs to Pulsar at it's API endpoint. ---- 2020-02-27 19:06:34 UTC - John Duffie: want a snapshot in time of the state of the table ? ---- 2020-02-27 19:09:12 UTC - Ryan Slominski: yeah, instead of storing table in a RDMS and instead of having to replay the change log topic to reconstruct state, I was thinking of having an additional topic published by same producer as the change topic that was a snapshot of full table ---- 2020-02-27 19:10:41 UTC - John Duffie: If goal is to have a way to track table changes, then it sounds ok. But if goal is to avoid having 2 diff clients - 1 to pulsar and 1 to RDMS, then I’m more skeptical about the benefit ---- 2020-02-27 19:12:52 UTC - Ryan Slominski: Often same client needs not only record changes, but when first connecting /re-connecting must obtain full table of records. ---- 2020-02-27 19:13:53 UTC - John Duffie: ok, that is interesting use case if there is some form of local cache that needs to be warmed up ---- 2020-02-27 19:16:26 UTC - Ryan Slominski: yeah, kind of a shared config file that has pub/sub notifications when changes occur. ---- 2020-02-27 19:26:28 UTC - Tobias Macey: @Tobias Macey has joined the channel ---- 2020-02-27 19:33:55 UTC - Tobias Macey: This may belong in <#CJLPGAZBM|dev-ruby> but I'm working on deploying a Pulsar cluster with the intent of using it as a buffer for log messages coming from FluentD agents, which will then have those topics subscribed to by a pair of FluentD aggregators. Unfortunately, the lack of Ruby support in terms of Pulsar clients makes this a bit challenging. Would it be viable to use one of the Kafka API compatibility layers to allow for FluentD to treat Pulsar as a Kafka broker, or are those abstractions only available for Java clients that can swap out the class object? ---- 2020-02-27 19:35:45 UTC - Tobias Macey: Alternatively, I'm looking at using the Netty IO client for publishing from FluentD or something like <https://github.com/kafkaesque-io/pulsar-beam/> ---- 2020-02-27 19:36:03 UTC - Tobias Macey: Has anyone else gone down this path and have any advice? ---- 2020-02-27 21:06:55 UTC - Ryan Slominski: Looks like a bug in documentation - clicking next button at bottom of this page <https://pulsar.apache.org/docs/en/functions-deploy/> results in 404 ---- 2020-02-27 21:10:32 UTC - Ryan Slominski: Do Pulsar Schema work with Pulsar Functions? Also, what happens if I have two input topics into a Pulsar function with different schema? ---- 2020-02-27 21:26:37 UTC - Joe Francis: This is a solution to a problem. Unless the same problem exists in Pulsar, I am not clear on why Pulsar requires something like this. ---- 2020-02-27 21:29:03 UTC - Pedro Cardoso: I am simply looking to understand whether a user can written custom logic for partition assignment based within pulsar and perform rebalancing logic in a gradual way (ideally without partition-consumption downtime) ---- 2020-02-27 21:29:15 UTC - Joe Francis: Consumption side scaling is automatic in Pulsar. Use a shared consumer, and scale up the number of consumers up and down as needed. ---- 2020-02-27 21:31:22 UTC - Pedro Cardoso: Is there documentation on how this consumption side scaling is performed? Can I introduce custom logic into it? ---- 2020-02-27 21:32:07 UTC - Pedro Cardoso: Do cluster rebalancements imply downtime or can consumers continue to consume? ---- 2020-02-27 21:32:09 UTC - Joe Francis: None of that repartitioning/regrouping required. You will need to kind of get cured, if you move from K t o Pulsar. ---- 2020-02-27 21:33:03 UTC - Joe Francis: Its a radically new, simpler way of doing things. None of that concepts in K world apply ---- 2020-02-27 21:34:36 UTC - Pedro Cardoso: Clearly there is some information I am missing, I will need to investigate some more, thank you for your time. ---- 2020-02-27 21:36:18 UTC - Joe Francis: <https://pulsar.apache.org/docs/v1.19.0-incubating/getting-started/ConceptsAndArchitecture/> ---- 2020-02-27 21:36:59 UTC - Joe Francis: <https://streaml.io/blog/pulsar-segment-based-architecture> ---- 2020-02-27 21:37:18 UTC - Ryan Slominski: Created pull request for docs issue... <https://github.com/apache/pulsar/pull/6434> ---- 2020-02-27 21:43:13 UTC - Joe Francis: Simply stated, if the load on your topic increases, you spin up a few more instances of your consumer, to scale up consumption rate, and shut them down when its done. It sounds like snake-oil, but that is how it works. ---- 2020-02-27 21:46:36 UTC - Pedro Cardoso: My particular use-case is for stateful consumers, simply spinning up more instances is not enough. Think stream processing in x nodes, need to scale up to x2 nodes, state must be copied over, ideally you would want new nodes to be physically co-located. ---- 2020-02-27 21:47:36 UTC - Pedro Cardoso: Perhaps the stream is partioned in some way, you want rebalancing for a specific partition ---- 2020-02-28 00:31:25 UTC - Eugen: When consuming like `Message<byte[]> msg = consumer.receive()`, is there a way to get the partition of the `msg`? ---- 2020-02-28 03:30:09 UTC - Sijie Guo: Yes. You can get the topic name and use TopicPartition class to get the partition id ---- 2020-02-28 03:46:15 UTC - Eugen: Thanks! I was under the impression that `message.getTopicName()` returns the name as given by the producer, rather than the internal name, which looks like `topic:<persistent://public/default/topic-0-partition-4>` ---- 2020-02-28 05:20:22 UTC - Sijie Guo: no it doesn’t help based on current implementatioon ---- 2020-02-28 07:20:47 UTC - Sijie Guo: the kafka API compatibility layer is a Java API wrapper. it is not the protocol compatibility layer. we have developed a kafka protocol handler for pulsar. it will be open sourced in one week or so. ---- 2020-02-28 07:21:31 UTC - Sijie Guo: @Tobias Gustafsson you can use websocket interface to publish to pulsar. ---- 2020-02-28 07:40:41 UTC - Sijie Guo: > but it’s not possible to refrence a secret in a configmap @eilonk you can mount the secret to the file path that you specify in configmaps. ---- 2020-02-28 07:44:48 UTC - Sijie Guo: @Pedro Cardoso - for autoscaled stream processing, you can use key range consumers/readers. You don’t need to rebalance partitions. You can divide the key range into multiple key ranges. Then assign key ranges to your stateful processors. You open a consumer/reader to read the events for a key range and persisted the state by key ranges. When you scaled up your stream processing, you can copy the key/range state to your process node and then resume reading events from that key/range. You can also do key/range split if a range becomes a bottleneck. In this way, you controlled the scale-up for your state but you don’t need to do any partition rebalance. ---- 2020-02-28 07:45:14 UTC - Sijie Guo: you can also scale up the stateful processing beyond the number of partitions. ---- 2020-02-28 08:13:59 UTC - Manuel Mueller: I would support Tanner. 1. run `docker ps` while pulsar is up 2. find either the ID or the name 3. use `docker exec -it [yourID/name] bash` to get into the container and be able to run pulsar-admin commands ---- 2020-02-28 09:04:13 UTC - Rolf Arne Corneliussen: Have you considered topic compaction, to reduce the time used to replay the change log? ----
