Slack digest for #general - 2020-02-28

Apache Pulsar Slack Fri, 28 Feb 2020 01:11:23 -0800

2020-02-27 09:55:26 UTC - Rolf Arne Corneliussen: Update: the above setup 
should work. I downloaded the latest Streamnative Weekly Build (have used the 
2.5.0 release up to now), and then compaction worked as expected. Will raise a 
github bug.
----
2020-02-27 10:56:38 UTC - Thiago: @Thiago has joined the channel
----
2020-02-27 12:27:19 UTC - riconsch: @riconsch has joined the channel
----
2020-02-27 12:34:48 UTC - riconsch: Hello everyone :slightly_smiling_face:
I have the following question/topic. I do want to have the following job being 
done by pulsar instead of kafka. My data process looks like this (at the moment 
with one kafka broker due to licensing confluent restrictions):
json data =&gt; serialized with avro =&gt; taking schema from schema registry 
=&gt; producing in kafka topic =&gt; stored via HDFS 3 Sink =&gt; Hive to query 
data (metastore updated via schema information)

Is this workflow possible for pulsar to execute like kafka does in my process?
We are evaluting to switch from kafka to pulsar. Any thoughs / help would be
greatly appreciated due to the fact that I am pulsar novice.

Best,
riconsch
----
2020-02-27 15:13:34 UTC - Sijie Guo: :+1:
----
2020-02-27 15:15:46 UTC - Sijie Guo: @riconsch yes that’s doable. Pulsar
provides the builtin schema management and connectors to achieve the same
datapipeline as you described
muscle : Konstantinos Papalias
----
2020-02-27 15:36:44 UTC - Pedro Cardoso: Hello,
Does Pulsar support incremental cooperative rebalancing akin to kakfa? See
<https://www.confluent.io/blog/incremental-cooperative-rebalancing-in-kafka/>
----
2020-02-27 16:05:52 UTC - eilonk: Hi :slightly_smiling_face: I'm using helm to
deploy pulsar on a kubernetes cluster. I can't find any documentation regarding
helm and pulsar with tls enabled. many values, mostly in configmaps, require
file paths (like tlsTrustCertsFilePath=/opt/pulsar/ca.cert.pem f.e)
On the other hand, the chart requires a tls-secret which i did configure, but
it's not possible to refrence a secret in a configmap. how can i workaround
these values? has anyone encountered this problem? Thanks!
----
2020-02-27 16:21:39 UTC - John Duffie: we figured out the problem. If the
SpecificRecord’s class is passed to the producer, then the schema is generated
on the fly from the fields in the class. What we needed was for the schema
string to be the content of the SpecificRecord’s pregenerated schema (eg.
SpecificRecordBase.getSchema())

So:
When you construct the producer, instead of doing this:
```newProducer(Schema.AVRO(MyAvroClassThatExtendsSpecificRecord.class))```
you need to do this:
```newProducer(Schema.AVRO(schemaDefBuilder.build()))```
Where schemaDefBuilder is of the form:
``` SchemaDefinitionBuilderImpl&lt;MyAvroClassThatExtendsSpecificRecord&gt;
schemaDefBuilder = new SchemaDefinitionBuilderImpl&lt;&gt;();

schemaDefBuilder.withJsonDef(MyAvroClassThatExtendsSpecificRecord.getClassSchema().toString());```
+1 : Konstantinos Papalias, Raman Gupta
----
2020-02-27 16:43:08 UTC - matt_innerspace.io: if it's a function, subscribing
to thousands of topics via regex, and it's configured for parallelism, would
that help?'
----
2020-02-27 16:49:05 UTC - Chris DiGiovanni: I have a pulsar environment where
someone created a namespace where the name contains a "/" in it. I cannot
figure out a way now to delete this namespace. I've tried using the
pulsar-admin utility which complains as well as hitting the pulsar admin API
directly w/o any luck. I've also tried urlencoding the namespace name and that
also does not work. Any help would be appreciated.
----
2020-02-27 16:54:28 UTC - Chris Bartholomew: Hi @eilonk, we have a Pulsar Helm
chart that supports TLS configuration. You can find it here:
<https://helm.kafkaesque.io>. If you have trouble with it, let me know.
----
2020-02-27 17:26:18 UTC - Ryan Slominski: Can you have multiple schema
associated with a single topic? For example addUser schema and deleteUser
schema on a "user" topic.
----
2020-02-27 17:28:28 UTC - John Duffie: we have liberty of defining new schema
for our particular flow - so, we have an AVRO that supports union of embedded
event objects
----
2020-02-27 17:28:57 UTC - John Duffie: that’s how we got around the problem you
list about 1 topic with mult schema
----
2020-02-27 17:29:55 UTC - Meyappan Ramasamy: @Meyappan Ramasamy has joined the
channel
----
2020-02-27 17:30:02 UTC - Joshua Dunham: Hi @John Duffie, I was *just* going to
ask about nested schemas (complex types) in pulsar!
----
2020-02-27 17:30:26 UTC - Joshua Dunham: How do you manage them outside of
pulsar?
----
2020-02-27 17:31:37 UTC - Joshua Dunham: I was looking into a service that can
sync to something like HWX Schema Registry for a UI (short of support in
pulsar-manager).
----
2020-02-27 17:34:14 UTC - Meyappan Ramasamy: hi team, this is Meyappan here. i
am running pulsar in docker container in windows. i find pulsar supported only
for apple mac and linux OS. I was unable to do a pip install
pulsar-client==2.5.0 to get the pulsar python client installed. i want to know
how I can access the pulsar CLI commands like pulsar-admin and pulsar-client
when running pulsar as a docker image in docker container for windows. or can i
access pulsar CLI commands only when pulsar is installed in apple mac or linux
OS ?
----
2020-02-27 17:35:59 UTC - Tanner Nilsson: I would think you could docker exec
into your docker container and run `bin/pulsar-admin` or `bin/pulsar-client`
from there? (haven't been working with it on windows)
----
2020-02-27 17:49:07 UTC - John Duffie: Could you elaborate on your management
question. Is goal to ensure backups of the schema or use of schema in non
pulsar flows like http/grpc?

Regarding the latter, years back, I wrote my registry and support library that
was decoupled from the broker technology - transports included http, kafka, jmx
:slightly_smiling_face: .
The approach used a message consisting of a common record with fields
containing the schema ver and bytes[] . of the inner, serialized AVRO. Allowed
common solution to work over diff technologies.

Now, I want to find an off the shelf solution for a very specific use case.
----
2020-02-27 18:16:01 UTC - Joshua Dunham: Definitely the latter, but my goal is
really to make something that is very human focused.
----
2020-02-27 18:16:24 UTC - Joshua Dunham: Unfortunately the inertia is towards
siloed apps and siloed schemas.
----
2020-02-27 18:16:46 UTC - Joshua Dunham: I want a schema manager that (supports
multiple transports) but also acts as a hub to get things cleaned up.
----
2020-02-27 18:34:29 UTC - John Duffie: I so wish there was an off the shelf
solution. I do not like the tight coupling between the broker technology and
the schema registry that is becoming widespread so that things are “easier”
----
2020-02-27 18:36:15 UTC - John Duffie: I don’t know of a FOSS solution. I’m
looking at the Hortonworks one you mention. Current project is not based on
HDP. Is that a requirement ?
----
2020-02-27 18:41:49 UTC - Ryan Slominski: Is having both a changes
(add/remove/update) entity topic and a snapshot (entire table in a message)
topic a good strategy? Some consumers want just what has changed, others want
entire list of all entities. Single producer. Some consumers might benefit
from both to speed up replay - might try a "change ID" for consumers who want
both and correlate change messages to snapshot messages. Seems like Event
Sourcing and CQRS are a minefield.
----
2020-02-27 18:43:10 UTC - John Duffie: the update case is the one that has me
perplexed
----
2020-02-27 18:43:52 UTC - Ryan Slominski: update as in someone changes one of
the fields like lastname in a user entity record might change
----
2020-02-27 18:45:15 UTC - John Duffie: luckily, my use case is simpler. 99.99%
of traffic is time series IOT data. updates only occur for some metadata
----
2020-02-27 18:45:53 UTC - John Duffie: yes
----
2020-02-27 18:47:18 UTC - Ryan Slominski: Yeah, I'm attempting to store
metadata in Pulsar too. My primary data is simple event data (notifications).
----
2020-02-27 18:47:40 UTC - John Duffie: sending entire object instead of just
changed fields is not a deal breaker for me
----
2020-02-27 18:48:45 UTC - Ryan Slominski: Update isn't a deal breaker - I could
just support add and remove and when something is "updated", just remove and
add new
----
2020-02-27 18:51:01 UTC - Ryan Slominski: Question is - is having effectively a
materialized view stored in Pulsar in a separate topic a good strategy? If
stored outside in RDMS or something it might be easier to query, but requires
two different interfaces for clients.
----
2020-02-27 19:02:06 UTC - Joshua Dunham: Not at all
----
2020-02-27 19:02:10 UTC - Joshua Dunham: It's all API driven
----
2020-02-27 19:02:29 UTC - Joshua Dunham: feasible to pull the code, add in a
module that syncs to Pulsar at it's API endpoint.
----
2020-02-27 19:06:34 UTC - John Duffie: want a snapshot in time of the state of
the table ?
----
2020-02-27 19:09:12 UTC - Ryan Slominski: yeah, instead of storing table in a
RDMS and instead of having to replay the change log topic to reconstruct state,
I was thinking of having an additional topic published by same producer as the
change topic that was a snapshot of full table
----
2020-02-27 19:10:41 UTC - John Duffie: If goal is to have a way to track table
changes, then it sounds ok.

But if goal is to avoid having 2 diff clients - 1 to pulsar and 1 to RDMS, then
I’m more skeptical about the benefit
----
2020-02-27 19:12:52 UTC - Ryan Slominski: Often same client needs not only
record changes, but when first connecting /re-connecting must obtain full table
of records.
----
2020-02-27 19:13:53 UTC - John Duffie: ok, that is interesting use case if
there is some form of local cache that needs to be warmed up
----
2020-02-27 19:16:26 UTC - Ryan Slominski: yeah, kind of a shared config file
that has pub/sub notifications when changes occur.
----
2020-02-27 19:26:28 UTC - Tobias Macey: @Tobias Macey has joined the channel
----
2020-02-27 19:33:55 UTC - Tobias Macey: This may belong in
<#CJLPGAZBM|dev-ruby> but I'm working on deploying a Pulsar cluster with the
intent of using it as a buffer for log messages coming from FluentD agents,
which will then have those topics subscribed to by a pair of FluentD
aggregators. Unfortunately, the lack of Ruby support in terms of Pulsar clients
makes this a bit challenging. Would it be viable to use one of the Kafka API
compatibility layers to allow for FluentD to treat Pulsar as a Kafka broker, or
are those abstractions only available for Java clients that can swap out the
class object?
----
2020-02-27 19:35:45 UTC - Tobias Macey: Alternatively, I'm looking at using the
Netty IO client for publishing from FluentD or something like
<https://github.com/kafkaesque-io/pulsar-beam/>
----
2020-02-27 19:36:03 UTC - Tobias Macey: Has anyone else gone down this path and
have any advice?
----
2020-02-27 21:06:55 UTC - Ryan Slominski: Looks like a bug in documentation -
clicking next button at bottom of this page
<https://pulsar.apache.org/docs/en/functions-deploy/> results in 404
----
2020-02-27 21:10:32 UTC - Ryan Slominski: Do Pulsar Schema work with Pulsar
Functions? Also, what happens if I have two input topics into a Pulsar
function with different schema?
----
2020-02-27 21:26:37 UTC - Joe Francis: This is a solution to a problem. Unless
the same problem exists in Pulsar, I am not clear on why Pulsar requires
something like this.
----
2020-02-27 21:29:03 UTC - Pedro Cardoso: I am simply looking to understand
whether a user can written custom logic for partition assignment based within
pulsar and perform rebalancing logic in a gradual way (ideally without
partition-consumption downtime)
----
2020-02-27 21:29:15 UTC - Joe Francis: Consumption side scaling is automatic in
Pulsar. Use a shared consumer, and scale up the number of consumers up and down
as needed.
----
2020-02-27 21:31:22 UTC - Pedro Cardoso: Is there documentation on how this
consumption side scaling is performed? Can I introduce custom logic into it?
----
2020-02-27 21:32:07 UTC - Pedro Cardoso: Do cluster rebalancements imply
downtime or can consumers continue to consume?
----
2020-02-27 21:32:09 UTC - Joe Francis: None of that repartitioning/regrouping
required. You will need to kind of get cured, if you move from K t o Pulsar.
----
2020-02-27 21:33:03 UTC - Joe Francis: Its a radically new, simpler way of
doing things. None of that concepts in K world apply
----
2020-02-27 21:34:36 UTC - Pedro Cardoso: Clearly there is some information I am
missing, I will need to investigate some more, thank you for your time.
----
2020-02-27 21:36:18 UTC - Joe Francis:
<https://pulsar.apache.org/docs/v1.19.0-incubating/getting-started/ConceptsAndArchitecture/>
----
2020-02-27 21:36:59 UTC - Joe Francis:
<https://streaml.io/blog/pulsar-segment-based-architecture>
----
2020-02-27 21:37:18 UTC - Ryan Slominski: Created pull request for docs
issue... <https://github.com/apache/pulsar/pull/6434>
----
2020-02-27 21:43:13 UTC - Joe Francis: Simply stated, if the load on your topic
increases, you spin up a few more instances of your consumer, to scale up
consumption rate, and shut them down when its done. It sounds like snake-oil,
but that is how it works.
----
2020-02-27 21:46:36 UTC - Pedro Cardoso: My particular use-case is for stateful
consumers, simply spinning up more instances is not enough. Think stream
processing in x nodes, need to scale up to x2 nodes, state must be copied over,
ideally you would want new nodes to be physically co-located.
----
2020-02-27 21:47:36 UTC - Pedro Cardoso: Perhaps the stream is partioned in
some way, you want rebalancing for a specific partition
----
2020-02-28 00:31:25 UTC - Eugen: When consuming like `Message&lt;byte[]&gt; msg
= consumer.receive()`, is there a way to get the partition of the `msg`?
----
2020-02-28 03:30:09 UTC - Sijie Guo: Yes. You can get the topic name and use
TopicPartition class to get the partition id
----
2020-02-28 03:46:15 UTC - Eugen: Thanks! I was under the impression that
`message.getTopicName()` returns the name as given by the producer, rather than
the internal name, which looks like
`topic:<persistent://public/default/topic-0-partition-4>`
----
2020-02-28 05:20:22 UTC - Sijie Guo: no it doesn’t help based on current
implementatioon
----
2020-02-28 07:20:47 UTC - Sijie Guo: the kafka API compatibility layer is a
Java API wrapper. it is not the protocol compatibility layer. we have developed
a kafka protocol handler for pulsar. it will be open sourced in one week or so.
----
2020-02-28 07:21:31 UTC - Sijie Guo: @Tobias Gustafsson you can use websocket
interface to publish to pulsar.
----
2020-02-28 07:40:41 UTC - Sijie Guo: &gt; but it’s not possible to refrence a
secret in a configmap
@eilonk you can mount the secret to the file path that you specify in
configmaps.
----
2020-02-28 07:44:48 UTC - Sijie Guo: @Pedro Cardoso - for autoscaled stream
processing, you can use key range consumers/readers. You don’t need to
rebalance partitions.

You can divide the key range into multiple key ranges. Then assign key ranges
to your stateful processors. You open a consumer/reader to read the events for
a key range and persisted the state by key ranges.

When you scaled up your stream processing, you can copy the key/range state to
your process node and then resume reading events from that key/range.

You can also do key/range split if a range becomes a bottleneck.

In this way, you controlled the scale-up for your state but you don’t need to
do any partition rebalance.
----
2020-02-28 07:45:14 UTC - Sijie Guo: you can also scale up the stateful
processing beyond the number of partitions.
----
2020-02-28 08:13:59 UTC - Manuel Mueller: I would support Tanner.
1. run `docker ps` while pulsar is up
2. find either the ID or the name
3. use `docker exec -it [yourID/name] bash` to get into the container and be
able to run pulsar-admin commands
----
2020-02-28 09:04:13 UTC - Rolf Arne Corneliussen: Have you considered topic
compaction, to reduce the time used to replay the change log?
----

Slack digest for #general - 2020-02-28

Reply via email to