Slack digest for #general - 2020-08-26

Apache Pulsar Slack Wed, 26 Aug 2020 02:12:11 -0700

2020-08-25 09:32:28 UTC - Evion Cane: Hi guys. I am trying to configure 
authentication and authorization in Pulsar to connect PulsarAdmin (Java API) to 
it.
I have modified the broker.conf file with the following properties -&gt;
```authenticationEnabled=true,
authenticationProviders=org.apache.pulsar.broker.authentication.AuthenticationProviderToken,
authorizationEnabled=true,
authorizationProvider=org.apache.pulsar.broker.authorization.PulsarAuthorizationProvider,
superUserRoles=admin,
tokenPublicKey=my-public.key```
The problem is when I test authentication and authorization they do not seem to 
work. When I try to access tenants from pulsar without authorization as follow:
```PulsarAdmin admin = PulsarAdmin.builder()
                .serviceHttpUrl(url)
                .build();
List&lt;String&gt; tenants = admin.tenants().getTenants();```
From what I understand from the documentation you need to authenticate to get 
the tentants but I still can access the default tenants without authentication.
----
2020-08-25 11:38:43 UTC - Ravi Shah: Hi Guys-


We are facing issue similar to "Consumer stop receive messages from 
broker(<https://github.com/apache/pulsar/issues/3131>)".

We are using logstash and input 
plugin(<https://github.com/se7enkings/logstash-input-pulsar>).
----
2020-08-25 11:39:05 UTC - Ravi Shah: Any help would be appreciated.
----
2020-08-25 14:43:40 UTC - Joshua Decosta: I don’t have that enabled but it 
seems like it is
----
2020-08-25 14:43:50 UTC - Joshua Decosta: The functions are being created as 
pods 
----
2020-08-25 14:52:09 UTC - Joshua Decosta: Is this running in standalone or is 
this deployed in a cloud cluster? 
----
2020-08-25 14:53:05 UTC - Joshua Decosta: There seems to be a few missing 
configurations regardless. You will need the brokerClient configs as well. It 
also doesn’t look like you’re passing a token from your admin calls which would 
lead to an error as well
----
2020-08-25 15:00:24 UTC - Evion Cane: I am running it in standalone. I was 
testing authentication so I did not put a token o purpose to wait for an error 
but it did not throw an error. If I am using java admin API do I need to 
configure brokerClient(client.config) also?
----
2020-08-25 15:06:23 UTC - Joshua Decosta: You need to configure all of those 
aspects in the standalone conf 
----
2020-08-25 15:06:49 UTC - Joshua Decosta: And yes, you need the brokerClient 
configs as well since those communicate and authenticate 
----
2020-08-25 15:26:43 UTC - Chris DiGiovanni: If you have set a retention of `-1` 
and you are offloading to s3.  Is there any way that would prevent these 
messages from getting deleted?  Want to make sure that backlog quotas and or 
TTLs can't prevent Pulsar from deleting ledgers from a bookie even when 
retention is infinite and your are offloading.  Currently my bookie disk space 
usage doesn't look right.  Pulsar is reporting 54G of ledgers on disk.  I have 
a 3x replica set and a combined raw storage of 1.3TB.  Though I'm roughly 
sitting at 1TB of used disk space...
----
2020-08-25 15:27:25 UTC - Chris DiGiovanni: This is running 2.5.2 of Pulsar
----
2020-08-25 15:46:59 UTC - Addison Higham: yeah if you have logs from the 
stateful sets. The role your functions are running as by default is the role 
you deployed your functions with. That role needs the "functions" permissions 
to download the package
----
2020-08-25 15:48:23 UTC - Joshua Decosta: Do these aspects rely on the AuthZ 
methods at all? 
----
2020-08-25 15:49:30 UTC - Joshua Decosta: Is there no way to configure some 
default functions permission to use with function creation 
----
2020-08-25 15:52:21 UTC - Addison Higham: you can check which ledgers have been 
offloaded with `pulsar-admin topics stats-internal &lt;topic&gt;` and that will 
have details on which ledgers have been offloaded

As far as bookkeeper disk usage, see this doc 
<https://bookkeeper.apache.org/docs/4.11.0/getting-started/concepts/>, 
specifically the "disk compaction" for details, but essentially, bookkeeper 
packs multiple ledgers into entry files. When you delete a ledger, you don't 
immediately get back space as it has to wait until the entry is above a certain 
ratio deleted to be compacted. You can tune those settings as needed
----
2020-08-25 15:55:24 UTC - Addison Higham: it does rely on AuthZ, see 
`org.apache.pulsar.broker.authorization.AuthorizationProvider#allowFunctionOpsAsync`

And yes, you can implement another interface to have much more control over 
function auth, see 
`org.apache.pulsar.functions.auth.KubernetesFunctionAuthProvider`
----
2020-08-25 15:56:54 UTC - Joshua Decosta: Ok, but when creating the function, 
does it take my client auth data and use that for the function permissions? 
Considering I’m using tokens that will expire that would be a problem for the 
functions 
----
2020-08-25 15:56:58 UTC - Addison Higham: The way the default implementation 
works:
- it snags the token you used to deploy the function
- it stores that token in a kuberentes secret
- the statefulset mounts that secret and uses it to interact with the cluster

If you aren't using token auth, you need to implement your own 
`KubernetesFunctionAuthProvider`
+1 : Joshua Decosta
----
2020-08-25 15:59:25 UTC - Joshua Decosta: Is there a way to set the function 
creds within config files? 
----
2020-08-25 15:59:33 UTC - Joshua Decosta: Similar to the brokerClient, etc?
----
2020-08-25 15:59:57 UTC - Addison Higham: 
<https://pulsar.apache.org/docs/en/next/functions-runtime/#kubernetes-functions-authentication>
 &lt;- this doc isn't in 2.6.1 yet (we should backport docs to release 
branches) but here is the doc on master
+1 : Joshua Decosta
----
2020-08-25 16:00:14 UTC - Joshua Decosta: Thanks 
----
2020-08-25 16:07:24 UTC - Addison Higham: that doc adds a lot more context, so 
hopefully it helps

I have thought about this problem a fair bit and with the current extension 
points I think you have 3 primary options:

1. Implement the KubernetesFunctionAuthProvider, it could easily go to talk 
another service to fetch a token and it can updateTokens, but it doesn't have a 
built-in way to refresh those automatically. However, you do have access to 
modify the stateful set in that API, so adding a sidecar pod that would refresh 
tokens and store them in a shared volume would be doable
2. You could also use the 
`org.apache.pulsar.functions.runtime.kubernetes.KubernetesManifestCustomizer` 
to do much the same and modify the statefulset spec to add in a sidecar or init 
container
3. Use the default implementation then have some other task outside pulsar that 
refresh tokens by manually updating the secrets. Because the secret is mounted 
as a volume, it does get updates. So you could simply have a k8s cron job that 
updates the secrets. The biggest downside here is you are still going to have 
the restriction of at least the initial token being the token that was used to 
upload the function
----
2020-08-25 16:18:25 UTC - Joshua Decosta: Is it not possible to configure the 
clientAuth* keys in the function-worker to allow for a default token setup used 
with the functions? 
----
2020-08-25 16:43:55 UTC - Ming: Neither does it support topic regex because of 
restriction of url formatting.
----
2020-08-25 17:04:36 UTC - Evion Cane: Ok, thank you. I will try that.
----
2020-08-25 17:05:08 UTC - Joshua Decosta: I’ve managed to get this working in 
standalone before so feel free to reach back here
----
2020-08-25 18:20:05 UTC - Tim Corbett: In trying to minimize unnecessary 
cross-rack/region traffic to minimize bandwidth costs, does the broker have any 
mechanism to prefer reads from its local bookie?  Does the broker even have a 
notion of which rack/region it is in?
----
2020-08-25 19:54:36 UTC - Addison Higham: @Tim Corbett that is possible using 
`bookkeeperClientReorderReadSequenceEnabled` and by enabling a 
rackAware/regionAware placement policy: see 
<https://github.com/apache/pulsar/pull/3171> and the original issue for more 
details
+1 : Tim Corbett
----
2020-08-25 20:03:20 UTC - Aaron: What draft of Json schema does Pulsar use?
----
2020-08-25 20:57:25 UTC - Addison Higham: I think it might be possible if you 
overwrite the start command, but I am not completely sure if there would be a 
more fine grained way of doing it. See 
<https://github.com/apache/pulsar/blob/3c5a4232461720b657c6be16f3d789d410da243b/pulsar-functions/runtime/src/main/java/org/apache/pulsar/functions/runtime/kubernetes/KubernetesRuntime.java#L236>
 for where that command is built
----
2020-08-25 20:58:12 UTC - Joshua Decosta: Is the function-worker.conf only 
meant for standalone? 
----
2020-08-25 21:04:56 UTC - Addison Higham: no, it gets loaded by the 
functions-worker regardless of how the functions-worker is run.

However, it is important to remember that functions executed in their own pod 
*do not* run the function-worker. It is a bit poorly named, but is an outgrowth 
of when functions started first with modes that either were in VM or forked a 
VM.

With kubernetes, you might think of it as "function-coordinator" as all it does 
is basically manage the creation and deletion of stateful sets.
----
2020-08-25 21:05:51 UTC - Pushkar Sawant: Is anyone running pulsar-manager with 
JWT authentication? I am getting this error when i try to list tenants on a 
2.6.0 cluster.
```&lt;html&gt;
&lt;head&gt;
&lt;meta http-equiv="Content-Type" content="text/html;charset=utf-8"/&gt;
&lt;title&gt;Error 401 Authentication required&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;&lt;h2&gt;HTTP ERROR 401&lt;/h2&gt;
&lt;p&gt;Problem accessing /admin/v2/tenants/public. Reason:
&lt;pre&gt;    Authentication required&lt;/pre&gt;&lt;/p&gt;&lt;hr&gt;&lt;a 
href="<http://eclipse.org/jetty>"&gt;Powered by Jetty:// 
9.4.20.v20190813&lt;/a&gt;&lt;hr/&gt;

&lt;/body&gt;
&lt;/html&gt;```
+1 : Ryan Nowacoski
----
2020-08-25 21:06:56 UTC - Joshua Decosta: If I don’t have “functionAsPods” 
enabled. How should the pods operate? 
----
2020-08-25 21:07:14 UTC - Joshua Decosta: Cause i don’t have it enabled but i 
keep seeing function pods showing up 
----
2020-08-25 21:07:30 UTC - Addison Higham: what is the way they are named?
----
2020-08-25 21:07:55 UTC - Joshua Decosta: I’ll have to respond next week to 
that cause i destroyed my cluster 
----
2020-08-25 21:08:12 UTC - Addison Higham: intentionally I hope :wink:
----
2020-08-25 21:08:26 UTC - Joshua Decosta: For sure 
----
2020-08-25 21:08:49 UTC - Joshua Decosta: Currently trying to determine best 
way to run those token scripts during a cluster config 
----
2020-08-25 21:09:38 UTC - Addison Higham: oh it looks like both `functions` and 
`functionsAsPods` both configure the functions worker to execute functions as 
pods
----
2020-08-25 21:09:50 UTC - Addison Higham: (I haven't worked with the chart a 
ton recently)
----
2020-08-25 21:10:31 UTC - Addison Higham: they should be created like 
`pf-&lt;functionname&gt;-&lt;parallel number&gt;`
----
2020-08-25 21:10:48 UTC - Joshua Decosta: That looks like the naming convention 
i did see
----
2020-08-25 21:11:09 UTC - Joshua Decosta: Is that for when they aren’t running 
as pods?
----
2020-08-25 21:12:08 UTC - Addison Higham: they are running as pods, I was just 
mistaken, `functionsAsPods: true` and `functions: true` configure it the exact 
same way now to both have them run as pods
----
2020-08-25 21:13:21 UTC - Joshua Decosta: Oh, is that on purpose? 
----
2020-08-25 21:15:54 UTC - Addison Higham: yes
----
2020-08-25 21:16:25 UTC - Addison Higham: we basically assume that if you want 
to run functions on k8s, you want to run as pods, otherwise it runs the 
functions as processes in the broker, which isn't ideal
----
2020-08-25 21:17:06 UTC - Joshua Decosta: That makes sense although the custom 
class is tossing a wrench into my function stuff 
----
2020-08-25 21:17:24 UTC - Joshua Decosta: Every time i think I’ve finished my 
classes i then need to implement another class 
----
2020-08-25 21:18:08 UTC - Tim Corbett: In trying to validate some performance 
characteristics of the .Net pulsar client, we're comparing against the 
pulsar-perf tool.  However, we're seeing results that don't match, and one 
large difference is the pulsar-perf tool does not seem to set/vary its routing 
keys, so all messages produced are going to only one consumer, even when 
setting the consumers to key-shared mode.  Is there a way to force it to set 
random keys, or some other way to validate performance with a key routed 
workload?
----
2020-08-25 21:29:31 UTC - Addison Higham: I assume you are talking about the 
the performance producer? it does not set any keys, but does use round-robin, 
meaning it will spread load evenly across partitioned topics.

With a consumer against partitioned topics, that is actually multiple consumers.

I would assume what you need to do is enable round-robin in your sending mode
----
2020-08-25 21:33:04 UTC - Tim Corbett: Specifically talking about the included 
pulsar-perf tool, and yes, in producer mode.  Round-robin for partitions is 
fine and all, but what we also need is random or incrementing routing keys so 
KeyShared subscriptions can route to all available consumers?
----
2020-08-25 21:33:32 UTC - Tim Corbett: For any given partition, if we have say 
200 consumers connected, currently all messages are going to only one of them.
----
2020-08-25 21:36:47 UTC - Addison Higham: pulsar-perf does not currently have a 
way to set keys on messages, but I am sure it would be a useful feature!

If you wanted to open an issue for it that would be wonderful. If you are 
interested in contributing it, it should be fairly straight forward, we already 
just generate a payload, it would make sense to have an argument to just hash 
the payload to set a key
----
2020-08-25 21:37:37 UTC - Tim Corbett: Alright, just wanted to make sure it 
wasn't something I was missing.  Thanks!
----
2020-08-26 06:14:45 UTC - Guillaume Audic: Hi, we are faced with an issue with 
event processing.
We have a batch job which produces a lot of messages (thousand of messages)  to 
a topic and a scaled microservice which consume.
To trigger an another event we have to know if all of the messages of the batch 
job have been consumed.

I found this approach with Kafka:  
<https://medium.com/wix-engineering/6-event-driven-architecture-patterns-part-2-455cc73b22e1>
 in the subpart 6.

I'm looking for something similar with Pulsar, but I don't know if it exists

We are thinking to use a KV store to store the state of the job (id of job, 
number of messages produces, number of messages consumes, status)

When producer start to produce events, he puts the number of produced messages  
in the kv store
The consumer increments number of consumed message in the kv store

If we have a number of producted messages equal to consumed messages then we 
set the status in the KV store to complete.

If there is no native approach with pulsar, can you give me some advices ?
----
2020-08-26 07:04:28 UTC - Julius S: A (possible?) generic solution: Instead of 
storing external state, put the state incrementally into each message. If your 
producer knows enough about the batch it is working on then the producer could 
place some metadata into each message about job_id and the number of the 
message within the job. Eg. (job_id xyz, msg 3/10), (job_id  xyz, msg 4/10)...  
Your consumer can publish to another topic to update its status on how the job 
is going. Once consumer hits msg with 10/10 it can act accordingly. 

Re: your q about native solution the Pulsar functions framework and the state 
store it offers via Pulsar’s own bookkeeper is definitely something to look at 
if you really need to store state. 

Other ideas: Pulsar also has message chunking so depending on the size of the 
batch you might actually be able to put the entire job into a single message. 

Also, Pulsar’s individual message ack (instead of just high water mark like 
Kafka) will help you with a cleaner solution to a queuing task like this. 
----
2020-08-26 07:35:32 UTC - Olivier Chicha: @Addison Higham No, sorry I am kind 
of overloaded. Now that we have identified the issue we had we have implemented 
a workaround and I had to move on something else.
regards,
Olivier
----

Slack digest for #general - 2020-08-26

Reply via email to