Slack digest for #general - 2020-04-09

Apache Pulsar Slack Thu, 09 Apr 2020 02:11:21 -0700

2020-04-08 09:27:19 UTC - xue: I think there is a problem in the creation of 
function application. Suppose that the classpath in the jar of function 
application contains configuration files such as XML or properties, which can't 
be found at all. They are all files in Java instance.jar. You have to put these 
configuration files in Java instance.jar to find them, which is not reasonable
----
2020-04-08 09:27:49 UTC - Sijie Guo: I see. so it seems that it points to the 
wrong ledgers root path. I will have to take a closer look at it.
----
2020-04-08 09:29:55 UTC - Sijie Guo: They are ignored.
+1 : Poul Henriksen
----
2020-04-08 09:34:09 UTC - Sijie Guo: I think it is related how do you run your 
applications. Are you running the producers/consumers in multi-AZ? If your 
producers and consumers are mostly in one AZ, then geo-replication is making 
more sense.


If your producers and consumers are already spreading across multiple AZs, then 
a singe cluster spreading across multiple AZ is making more sense. You can also 
leverage rack-aware placement in bookkeeper to configure data placement across 
multiple AZ  and broker isolation strategy to configure failover topics between 
brokers in different AZs. Although you need to be aware of the network cost 
between AZs.
+1 : Hiroyuki Yamada
man-bowing : Hiroyuki Yamada
----
2020-04-08 09:34:19 UTC - Sijie Guo: Hope that is useful to you.
----
2020-04-08 09:36:53 UTC - Sijie Guo: @Ken Huang offloader only offload those 
closed ledgers. It doesn’t offoad the latest active ledger.
----
2020-04-08 09:37:57 UTC - Sijie Guo: For experimental purpose, try unload the 
topic first, then produce messages to create new ledger. Try unload and produce 
multiple times. It will create multiple ledgers. Then you can trigger offload 
command to verify if the old ledgers are offloaded.
----
2020-04-08 11:55:11 UTC - Ganga Lakshmanasamy: Hi, I am trying to connect to 
pulsar from a wildfly server. Both are running in docker and i am able to ping 
pulsar container from wildfly. But while trying to connect from app which is 
deployed inside wildfly, below error is thrown, 
org.apache.pulsar.shade.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 Connection refused: pulsar/172.27.0.6:6650
----
2020-04-08 12:16:59 UTC - Hiroyuki Yamada: Thank you very much @Sijie Guo for 
the quick reply.

It helps a lot but I still have some unknowns.

First, are they very different from availability or performance perspective ?

Second, use cases that fit geo-replication are supposed to manage topics in 
more isolated way ?
e.g.
Producers in AZ-a only create messages for topic-a in cluster-a in AZ-a, and 
producers in AZ-b only create messages for topic-b in cluster-b in AZ-b, and so 
on ?

Then, my application isn’t like that. Every producers (that are possibly spread 
in multi-azs) create messages for a single topic.

Also, it would be great if you can point me to some references about `broker 
isolation strategy`. I can’t find anything with the keyword.
----
2020-04-08 13:25:59 UTC - Damien: Hi, I am new to Pulsar and I came across  
this tweet : <https://twitter.com/sijieg/status/1202727993145098247> and 
wondered if this architecture, presented in the slide, already exist and it I 
could already directly query the “stream reader”, “sub-stream reader”, or 
“segment reader like @Sijie Guo is talking about for the flink’s integration ?
----
2020-04-08 13:38:39 UTC - Jon P: @Jon P has joined the channel
----
2020-04-08 13:52:44 UTC - Rattanjot Singh: Need to build a custom dockerImage 
for pulsar.
`<https://github.com/apache/pulsar/blob/master/docker/pulsar/Dockerfile>`
What should we pass as argument PULSAR_TARBALL?
----
2020-04-08 13:58:35 UTC - Addison Higham: @Rattanjot Singh maven will build the 
docker image for you. Do `mvn package -Pdocker -DskipTests`
----
2020-04-08 14:04:59 UTC - Michael Gokey: @Michael Gokey has joined the channel
----
2020-04-08 15:03:46 UTC - Jon P: Hi. I’m trying to implement OpenTracing/Jaeger 
tracing across Pulsar topics. I came across this page 
(<https://github.com/apache/pulsar/wiki/PIP-23:-Message-Tracing-By-Interceptors>)
 which started well but unfortunately gets a little light on detail when it 
comes to (de-)serializing trace context from the Message &amp; implementation 
of the concrete Producer/Consumer Interceptor. I had a look at the code for the 
PR and started an implementation based on ProducerInterceptor&lt;T&gt; but that 
class has now been deprecated in favour of an un-typed version 
‘ProducerInterceptor’ with no indication why in the comments. I’m grabbing the 
repo to check commit history but can anyone explain why the move to un-typed 
ProducerInterceptor, which means I can’t get a type-safe reference to my custom 
message class? Also, any suggestions where I might find good code examples of 
interceptors/tracing? Many thanks.
----
2020-04-08 16:10:01 UTC - Sijie Guo: Have you tried telnet that port?
----
2020-04-08 16:20:39 UTC - Sijie Guo: &gt;  are they very different from 
availability or performance perspective
I think it is more about availability. From performance wise, based on my 
experiences, cross zone latency should be fine.

&gt; use cases that fit geo-replication are supposed to manage topics in more 
isolated way
yes in some sense. so in your use case, you are more looking for a single 
cluster spreading across multiple AZs.

&gt; it would be great if you can point me to some references about `broker 
isolation strategy`
It is called namespace isolation strategy. sorry for my typo. Currently we have 
very little documentation about that. My team is working on filling up the 
documentation
<http://pulsar.apache.org/docs/en/pulsar-admin/#ns-isolation-policy>
man-bowing : Hiroyuki Yamada
+1 : Hiroyuki Yamada
----
2020-04-08 16:40:45 UTC - Sijie Guo: @Damien

Yes. It already exists in the current Pulsar architecture.

• “Stream Reader” is the pulsar reader, which read streams of events of a 
single topic (partition). Events are read in order.
• “Sub-Stream Reader” is the key_range reader, which was introduced recently. 
It will be released as part of 2.6.0 release. A key-range reader reads a subset 
of events in a stream (a topic partition). You can have multiple readers 
reading one stream.
• “Segment Reader” is a bypass-broker reader that reads data directly from 
bookkeeper or tiered storage. The segment reader is currently part of 
pulsar/bookkeeper storage API. Pulsar already uses it in implementing presto 
connector which is known as Pulsar SQL.
If you are looking for individual reader API, they are already there. but they 
are not well connected to present a whole picture. If you are looking for a 
more concrete example on how these APIs are used, the pulsar-flink integration 
(<https://github.com/streamnative/pulsar-flink>) we are doing is a good example.
This is the connector that we are contributing to upstream Flink as part of 
FLIP-72.

Eventually we will do the same thing for other integrations such as 
pulsar-spark and pulsar-presto integrations.
----
2020-04-08 16:48:13 UTC - Damien: thanks very much for you quick answer @Sijie 
Guo! OK I understand now. I first thought that the pulsar-flink was going to be 
“plugged” to pulsar thought these readers only for the need of the incoming 
“pulsar state backend”.
By the way, beside FLIP-72 (which does not seem to mention the statebackend 
integration), where can I follow more of the development ? and specifically 
about the statebackend integration? any github branch? another Jira ticket?
----
2020-04-08 16:52:16 UTC - Damien: about “bypass-broker reader that reads data 
directly from bookkeeper or tiered storage”, is is transparent in a client 
point of view when querying the *Segment Reader* ? This is the feature I’m 
mainly interested into.
----
2020-04-08 17:00:59 UTC - Sijie Guo: @Jon P

&gt; gets a little light on detail when it comes to (de-)serializing trace 
context from the Message &amp; implementation of the concrete Producer/Consumer 
Interceptor
You can (de-)serialize trace context into message properties.

Currently the `Message` interface doesn’t provider any methods to mutate the 
message.  So you might have to check if the `Message` is a `MessageImpl` and 
cast the `Message` to `MessageImpl`.

The MessageImpl has a MessageBuilder underneath, where you can use for adding 
properties.

```MessageImpl msg = (MessageImpl) message;
                msg.getMessageBuilder()
                   
.addProperties(KeyValue.newBuilder().setKey(tag).setValue(tagValue));```
This is not an elegant solution. But there is something you can achieve using 
current API. We can look into how to expose this builder to the interceptors so 
people can easily access the builder.

&gt; class has now been deprecated in favour of an un-typed version 
‘ProducerInterceptor’ with no indication why in the comments
The `api.ProducerInterceptor&lt;T&gt;` was deprecated in favor of 
`api.interceptor.ProducerInterceptor` because multiple schemas support in 
producer. Because 1) it is hard to support multiple schemas in one single 
producer if interceptor is strongly typed with one type. 2) In most of 
interceptors, the `value` of a messages ideally shouldn’t be changed after 
interception. Most of the time people are using interceptor on enriching a 
message.

&gt; Also, any suggestions where I might find good code examples of 
interceptors/tracing?
@Penghui Li - we can think about providing an open-tracing interceptor example 
for your reference.

---

The interceptor was added a few releases ago. We are still filling the gaps 
between code and documentation. Sorry for lack of documentation and code 
examples. You are also welcome to make any contributions to either code 
examples or documentation.
----
2020-04-08 17:17:58 UTC - Sijie Guo: Currently FLIP-72 is more focusing on 
getting existing pulsar-flink integration including source, sink, catalog into 
upstream Flink. So people can consume an “official” pulsar-flink connector from 
Flink distribution. It doesn’t include the state integration.

---

Until FLIP-72 is done, I think most of discussions and integrations will be 
added to <https://github.com/streamnative/pulsar-flink> . We are working with 
Flink community to first get FLIP-72 done. Once FLIP-72 is done, we can move 
the flink-pulsar discussions to flink development list and gain helps from 
Flink community.

There used to be a Github issue for tracking the roadmap of the features we are 
adding to this integration. Currently we are still designing the whole state 
integration, because we want to make sure whatever state integration we do in 
pulsar-flink can be commonly used by both pulsar functions and pulsar-flink (or 
any integrations with other data processing engines). You can star and follow 
the repo :wink:

Some references:

<https://cwiki.apache.org/confluence/display/FLINK/FLIP-72%3A+Introduce+Pulsar+Connector>
<https://issues.apache.org/jira/browse/FLINK-14146>
<https://github.com/apache/flink/pull/10875>
<https://github.com/apache/flink/pull/10455>

If you are interested in this, I would suggest making comments to the JIRA 
issue. So Flink community can realize the need of this integration.
----
2020-04-08 17:19:45 UTC - Damien: awesome, thank you again! I will try to 
contribute as much as possible :wink:
----
2020-04-08 17:22:28 UTC - Shivji Kumar Jha: Hi @Sijie Guo Do you suggest that 
we move the flush to catch instead of finally? Will that be okay? Reference: 
<https://github.com/apache/pulsar/pull/6695>
----
2020-04-08 17:29:39 UTC - Shivji Kumar Jha: +1 on this. Was trying the same 
sometime back but then dropped it to pick up later. Would be a useful addition 
for us.
```we can think about providing an open-tracing interceptor example for your 
reference.```
----
2020-04-08 17:36:41 UTC - p: @p has joined the channel
----
2020-04-08 17:53:01 UTC - Sijie Guo: Can flush throw exception?

If flush also throws exception, then we might need to know where the exception 
is thrown and whether do we need to call flush.

Alternatively, we can consider adding a flag for indicating if the flush 
succeeded or not.
+1 : Shivji Kumar Jha
----
2020-04-08 18:05:46 UTC - Frank Kelly: Newbie question here - we're considering 
using Pulsar to stream binary data (Audio) in a continuous stream back to 
clients. Is there a recommended implementation / design to realize that? I've 
seen mention of WebSocket but was wondering if there are alternatives. Thanks 
in advance.
----
2020-04-08 18:19:55 UTC - Matteo Merli: You could consider using non-persistent 
topics
----
2020-04-08 18:21:32 UTC - Frank Kelly: My limited understanding of Pulsar - is 
that like most pub/sub systems -  by defaylt it is transmitting discrete 
"chunks" of data rather than a stream - but I could be wrong
----
2020-04-08 18:23:35 UTC - p: Has anyone here had issues with pulsar corrupting 
data? also a colleague of mine tried copying data from one pulsar instance to a 
new one, but hasn't been able to successful in starting the new instance. 
(using standalone)
----
2020-04-08 18:23:40 UTC - Matteo Merli: That's correct. regarding persistent vs 
non-persistent: you could see that difference like TCP vs UDP
----
2020-04-08 18:24:08 UTC - Frank Kelly: OK thanks I will take a look
----
2020-04-08 18:24:41 UTC - Matteo Merli: persistent: data is replicated and 
stored on disk
non-persistent: best-effort in memory delivery
----
2020-04-08 18:25:58 UTC - p: you aren't going to get much out of pulsar unless 
you are dealing with audio segments that you want to store in order. pulsar is 
going to deliver 1000s of events to your consumer at a time, and you will go 
through one by one, and stream the corresponding data to your clients.
----
2020-04-08 18:26:08 UTC - p: the streaming part is on you, pulsar isn't going 
to help you
+1 : Frank Kelly
----
2020-04-08 18:29:30 UTC - p: Also, i have a standalone instance running, and 
it's taking 10% cpu, even though i'm not doing anything with it. is this normal?
----
2020-04-08 18:31:21 UTC - Matteo Merli: @p if you read the description of 
non-persistent topics, you can find the above statement is not necessarily true.

Also, ordering is not the only reason one would use Pulsar
----
2020-04-08 18:32:44 UTC - p: i know it's not the only reason. the topic is on 
streaming though. i think the topic type is actually off-topic in this thread
----
2020-04-08 18:33:26 UTC - Matteo Merli: no, it's not
----
2020-04-08 18:33:35 UTC - p: how so?
----
2020-04-08 18:33:44 UTC - Matteo Merli: non-persistent topic behavior is 
fondamentally different
----
2020-04-08 18:34:56 UTC - Matteo Merli: and no: non-persistent topic is really 
not about "data streaming"
----
2020-04-08 18:35:20 UTC - p: pulsar isn't really about data streaming, in 
general
----
2020-04-08 18:35:34 UTC - Matteo Merli: uhm..
----
2020-04-08 18:35:43 UTC - Matteo Merli: really?
----
2020-04-08 18:36:28 UTC - p: yes
----
2020-04-08 18:36:45 UTC - p: i have a streaming API ontop of pulsar as my data 
layer
----
2020-04-08 18:37:05 UTC - p: pulsar didn't help very much in the streaming part
----
2020-04-08 18:37:30 UTC - p: also, pulsar sends huge chunks of messages to it's 
clients at a time
----
2020-04-08 18:37:56 UTC - Matteo Merli: how big it's controllable
----
2020-04-08 18:38:06 UTC - p: default is 1000 messages at a time
----
2020-04-08 18:39:08 UTC - p: unless you are breaking up your data into multiple 
events, i don't see how one would even get the illusion of streaming from pulsar
----
2020-04-08 18:41:47 UTC - p: pulsar isn't going to stream you bytes of an 
event, you get the whole event and work with it one whole event at a time.
----
2020-04-08 18:42:04 UTC - Matteo Merli: are you still referring to data 
streaming or audio-streaming?
----
2020-04-08 18:43:08 UTC - p: is there a difference?
----
2020-04-08 18:43:33 UTC - Matteo Merli: Of course there is.
----
2020-04-08 18:43:48 UTC - Matteo Merli: A message is the fundamental atomic 
unit. you can have messages as small as needed
----
2020-04-08 18:47:14 UTC - p: sure, you can cut up your binary data into a bunch 
of messages, and maybe that'll help you. i did mention this. but pulsar isn't 
going to help you with the streaming. you will have to figure out how to get 
your pulsar events working with whatever streaming protocol you are working 
with (HTTP, or something else).
----
2020-04-08 19:27:27 UTC - Jon P: Thanks @Sijie Guo that’s useful background. 
I’d started down a different route - adding `Map&lt;String, String&gt; 
spanContext` to my data/schema (Message.getValue()) for injecting/extracting 
the context as we hope to trace over multiple different messaging systems. If 
the builder provides a way to set properties on MessageImpl that gives me 
another option - I’ll give it a try.
I did manage to find a reference to the ticket in commit history so read the 
detail on why the move to untyped ProducerInterceptor - I take the point about 
supporting multiple schema in one interceptor.
If you were able to provide an OpenTracing (soon to be OpenTelemetry but I’m 
sure the principles are the same) producer/consumer interceptor pair that would 
be really useful, although I’m getting there slowly - mostly trying to get my 
head around how to use inject/extract to link up multiple spans across 
services/transports right now!
One last minor question - do you plan to support the 
`Format.Builtin.TEXT_MAP_INJECT` &amp; `Format.Builtin.TEXT_MAP_EXTRACT` 
carrier types in pulsar-client? Thanks again for the help!
----
2020-04-08 19:59:29 UTC - Sijie Guo: &gt; for injecting/extracting the context 
as we hope to trace over multiple different messaging systems.
Yeah. Good point. I think most of the message systems already support 
properties or headers. It should be ‘safe’ to make this assumption and use 
properties to propage the tracing context.

&gt; do you plan to support the `Format.Builtin.TEXT_MAP_INJECT` &amp; 
`Format.Builtin.TEXT_MAP_EXTRACT` carrier types in pulsar-client? Thanks again 
for the help!
If we are mapping the properties as the carrier, it seems that most naturally 
we will support TEXT_MAP_INJECT and TEXT_MAP_EXTRACT. We might be able to 
support other carrier type as well.
----
2020-04-08 20:52:23 UTC - Markus Kobler: @Markus Kobler has joined the channel
----
2020-04-08 20:57:13 UTC - Rahul: I'm wondering if there is any way to authorize 
a JWT token with multiple roles (array of valid pulsar roles).

Is there a workaround if there is no inbuilt support for it?
----
2020-04-08 21:02:52 UTC - Rahul: Is it possible to use a nested value for 
`tokenAuthClaim` broker configuration?
----
2020-04-08 21:04:12 UTC - Sijie Guo: Currently I don’t think it support that 
yet. What is your use case?
----
2020-04-08 21:13:41 UTC - Rahul: Currently the upstream system is issuing us 
JWT token as an array of roles. Right now we have only one role in that array.

Later on it is possible we get multiple roles. This will help us keep the roles 
at a very granular level. But at the same time we can compose multiple roles to 
give more permission.
@Sijie Guo
----
2020-04-08 21:30:26 UTC - Matteo Merli: You can make that work by using a 
custom AuthorizationProvider. There you will be able to access the token data 
and decide if this has access to the specific topic
----
2020-04-08 22:27:54 UTC - Sijie Guo: @Rahul yeah, as what Matteo mentioned. it 
can be done in a customized AuthorizationProvider. If you think it is a generic 
use case, I would also suggest you creating a github issue for that.
----
2020-04-09 02:47:46 UTC - Hiroyuki Yamada: Thank you very much ! It is very 
helpful.
----
2020-04-09 03:48:22 UTC - Rahul: @Sijie Guo @Matteo Merli I have already 
considered `AuthorizationProvider` interface. But it takes role as a string 
parameter. So that's where I'm stuck.
----
2020-04-09 04:43:27 UTC - Ganga Lakshmanasamy: Nope. I tried ping. Let me try 
telnet
----
2020-04-09 05:00:09 UTC - Sijie Guo: @Rahul I think you need to also implement 
AuthenticationProvider. I think you need to interpret the token and return a 
string that represents a list of roles. Then the AuthorizationProvider can 
interpret the string as the list of roles.
----
2020-04-09 05:02:17 UTC - Rahul: @Sijie Guo got it. It's a hack but this is 
helpful for the time being. Thanks.
----
2020-04-09 05:03:45 UTC - Sijie Guo: Yes. It is a hack. It will be great if you 
can create an issue for us, that we can consider enhancing our interfaces.
----
2020-04-09 05:04:26 UTC - Rahul: @Sijie Guo sure. Will do that.
+1 : Sijie Guo
----
2020-04-09 07:10:57 UTC - Frans Guelinckx: @Frans Guelinckx has joined the 
channel
----
2020-04-09 07:13:53 UTC - Jon P: Apologies, I realised my last question about 
TEXT_MAP_INJECT was daft - it’s Jaeger which needs to support the additional 
carrier types, not Pulsar!
----

Slack digest for #general - 2020-04-09

Reply via email to