Slack digest for #general - 2019-06-28

Apache Pulsar Slack Fri, 28 Jun 2019 02:11:28 -0700

2019-06-27 09:48:51 UTC - Guillaume Rosauro: I am using Pulsar with a Debezium 
Postgres connector. Does anybody know how to set the debezium configuration 
like "tables.whitelist", "transforms.xx" and so on ?
----
2019-06-27 10:24:18 UTC - Alexandre DUVAL: Hi, on `bin/pulsar-admin 
broker-stats load-report`, why topics have this encoding: 
```orga_858600a8-74f4-4d75-a8a3-f5b868be093c/app_3b65d74f-66f2-4218-bc91-6bce8d1e486a/0x80000000_0xc0000000```
----
2019-06-27 10:24:43 UTC - Alexandre DUVAL: what's the value of the topic name 
`0x80000000_0xc0000000`? and why?
----
2019-06-27 12:06:11 UTC - Mate Varga: Hello! I know this is a _very_ generic 
question, but maybe ... so would you recommend using Avro schemas/messages over 
JSON or protobuf for system that has not been using any of those so far (so 
there's no legacy, no investment in either of those). We're just deploying 
Pulsar and start using it for the first few use cases, and we're trying to make 
a decision about what to use. (We have our own opinions on this but it'd be 
good to hear what the community/authors prefer or recommend.)
----
2019-06-27 12:47:51 UTC - Sijie Guo: @jia zhai @tuteng can you help with this 
question?
----
2019-06-27 12:50:01 UTC - Sijie Guo: &gt; how many broker daemon we require


the minimal of number of brokers is 1+.

it depends your throughput, the number of partitions and  many other factors of 
your workload. I would suggest doing a load test using your workload and 
estimated based on your business traffic.

&gt; when should be specify the zk, book keeper configuration (in which daemon

I am not sure what the question is. can you explain more?
----
2019-06-27 12:50:31 UTC - Sijie Guo: 0x80000000_0xc0000000 this is the bundle
----
2019-06-27 12:53:52 UTC - Alexandre DUVAL: what do you mean by bundle?
----
2019-06-27 12:53:58 UTC - Alexandre DUVAL: whole topics?
----
2019-06-27 13:01:08 UTC - jia zhai: @Guillaume Rosauro Here is the link of how 
to set the debezium perameters:
<http://pulsar.apache.org/docs/en/next/io-cdc-debezium/>
----
2019-06-27 13:02:16 UTC - jia zhai: By default, there is no differences with 
original debezium settings
----
2019-06-27 13:04:32 UTC - Guillaume Rosauro: @jia zhai: it is not true. If I 
add some parameters not known by pulsar verification process (for example 
tables.whitelist), I receive a configuration error...
----
2019-06-27 13:04:54 UTC - Sijie Guo: 
<http://pulsar.apache.org/docs/en/administration-load-balance/>
----
2019-06-27 13:05:02 UTC - Sijie Guo: I would suggest checking out this page 
first
----
2019-06-27 13:05:41 UTC - jia zhai: Are you using the latest version?
----
2019-06-27 13:06:09 UTC - jia zhai: try this 
one:<https://dist.apache.org/repos/dist/dev/pulsar/pulsar-2.4.0-candidate-2/connectors/>
----
2019-06-27 13:06:42 UTC - Guillaume Rosauro: I have tried with 2.4.0 RC1. 
Someting has changed on this with the RC2 ?
----
2019-06-27 13:07:09 UTC - jia zhai: I think no more changes between them
----
2019-06-27 13:07:21 UTC - Guillaume Rosauro: I think so too
----
2019-06-27 13:07:53 UTC - jia zhai: what parameter did you set?
----
2019-06-27 13:08:52 UTC - Guillaume Rosauro: "table.whitelist"
----
2019-06-27 13:09:13 UTC - Guillaume Rosauro: see 
<https://debezium.io/docs/connectors/postgresql/>
----
2019-06-27 13:14:05 UTC - jia zhai: In former message , you mentioned, you are 
using “tables.whitelist”
----
2019-06-27 13:14:17 UTC - jia zhai: not “table.whitelist”
----
2019-06-27 13:15:21 UTC - jia zhai: @Guillaume Rosauro Please help check if 
there is some typo.
----
2019-06-27 13:16:07 UTC - jia zhai: Pulsar did not check these parameters, and 
only simply passed parameters into Debezium
----
2019-06-27 13:29:34 UTC - Guillaume Rosauro: OK thanks, I will re-check it. 
Maybe that was just a typo.
----
2019-06-27 13:43:22 UTC - Alexandre DUVAL: got it!
----
2019-06-27 13:43:22 UTC - Alexandre DUVAL: thx
----
2019-06-27 13:49:07 UTC - Richard Sherman: subscription names are unique per 
topic
----
2019-06-27 14:09:04 UTC - Guillaume Rosauro: @jia zhai I still have this error :
```apache-pulsar-2.4.0 bin/pulsar-admin source localrun  --sourceConfigFile 
debezium-postgres-charlie-config.yaml
while scanning a double-quoted scalar
 in 'reader', line 16, column 20:
      table.whitelist: "public\.(t_).*"```
----
2019-06-27 14:12:01 UTC - jia zhai: what is the whole content of  file 
debezium-postgres-charlie-config.yaml?
----
2019-06-27 14:12:17 UTC - Aaron: Is there a way to force delete a namespace 
with active topics/subscriptions?
----
2019-06-27 14:24:58 UTC - Chris Bartholomew: @Aaron Not that I know of.  I do 
this programatically by force deleting all the topics in the namespace and then 
deleting the namespace. I think you can still run into issues if clients are 
connected and allowAutoTopicCreation is on. Then you have a race to delete the 
namespace before the client can recreate the topic.
----
2019-06-27 14:26:43 UTC - Aaron: Okay, thanks.
----
2019-06-27 14:26:55 UTC - Sijie Guo: I would suggest avro.
----
2019-06-27 14:32:29 UTC - Sam Leung: :thumbsup:
----
2019-06-27 14:33:13 UTC - Ryan Samo: Hey guys, I’m playing with pulsar 
functions which are super sweet so thanks for all of the hard work! My question 
is when it comes to performance of your functions on bare metal hardware, is 
there other settings to tweak besides parallelism? I know on consumers you can 
change the receiverQueueSize, etc to help with backlog and things like that, 
and functions are just more producers and consumers. Is there a way to make 
further tweaks by specifying it via the admin cli or possibly the function yml? 
Just not seeing much on documentation around function performance.

Thanks!
----
2019-06-27 14:43:59 UTC - Ryan Samo: Maybe using the ConsumerConfig inputSpecs 
in the FunctionConfig class?
----
2019-06-27 14:48:56 UTC - Sijie Guo: I think it depends on the latency for 
executing your function. Each function instance invokes functions synchronously 
in one thread. So if your function takes time to process an event, you have to 
increase the nmber of instances to parallelize processing the events.
----
2019-06-27 14:51:05 UTC - Ryan Samo: Ok makes sense, I’m just trying out the 
exclamation function for now in general testing and seeing backlogs so I 
thought maybe I would tweak it a bit. Change the parallelism, make it a 
partitioned topic, etc.
----
2019-06-27 14:51:56 UTC - ishara: Hello I have a little project using Pulsar, 
not many people are going to use it. Can we use the standalone ver. or should 
we bring more machines ?
----
2019-06-27 14:53:12 UTC - Addison Higham: speaking of pulsar functions... what 
is the state of the pulsar functions being executed against k8s? I see options 
in the example config file for how to launch them in the cluster, I also see 
references to it in the pulsar function docs, but not much more than that
----
2019-06-27 14:56:23 UTC - David Kjerrumgaard: @ishara The standalone version is 
really intended for sandbox environments and dev work. It is fine for 
collaborating in those types of scenarios, but for any non-dev work I would 
recommend spinning up a small cluster with 2 or 3 nodes.
----
2019-06-27 14:57:11 UTC - ishara: Ok thanks for the answer 
:slightly_smiling_face:
----
2019-06-27 14:58:20 UTC - David Kjerrumgaard: @Addison Higham What type of 
details would you like to see covered in the documentation?  What is missing 
that we can add?  Thanks in advance for the feedback, as it will make our docs 
better
----
2019-06-27 14:58:46 UTC - Guillaume Rosauro: here it is :
```
tenant: "public"
namespace: "default"
name: "debezium-postgres-charlie"
topicName: "debezium-postgres-charlie-topic"
archive: "connectors/pulsar-io-debezium-postgres-2.4.0.nar"
parallelism: 1
configs:
  database.hostname: "localhost"
  database.port: "5432"
  database.user: "postgres"
  database.password: "postgres"
  database.dbname: "charlie_db"
  database.server.name: "dbserver1"
  schema.whitelist: "public"
  table.whitelist: "public\.(t_).*"
  pulsar.service.url: "<pulsar://127.0.0.1:6650>"
```
----
2019-06-27 15:01:47 UTC - Aaron: @Chris Bartholomew I was able to delete the 
namespace, but I am still getting IO warnings about topics in the deleted 
namespace. Do you know why these are showing up? The pulsar-admin cli shows 
there are no topics under this namespace now.
----
2019-06-27 15:02:04 UTC - Addison Higham: looking at the master docs, it 
appears to be quite a bit improved with this page: 
<https://pulsar.apache.org/docs/en/next/functions-runtime/>
however, it is still missing from the reference section: 
<https://pulsar.apache.org/docs/en/next/reference-configuration/>
----
2019-06-27 15:14:02 UTC - jia zhai: @Guillaume Rosauro It seems failed at yaml 
file format check
----
2019-06-27 15:14:58 UTC - jia zhai: “public\.(t_).*”
----
2019-06-27 15:20:35 UTC - Sijie Guo: @Addison Higham most of the settings in 
functions worker config are already self explained.
----
2019-06-27 15:24:42 UTC - Addison Higham: yeah, I just wasn't seeing any docs 
or any options because I was looking at the 2.3.2 page initially, switching to 
master docs greatly improve the situation :slightly_smiling_face:
+1 : David Kjerrumgaard
----
2019-06-27 15:27:06 UTC - Aaron: It appears that the broker is doing internal 
topic lookups on topics that don't exist anymore
----
2019-06-27 15:55:08 UTC - Baliles-Heroku: @Baliles-Heroku has joined the channel
----
2019-06-27 17:56:50 UTC - Benjamin.Hess: @Benjamin.Hess has joined the channel
----
2019-06-27 18:24:49 UTC - Chris Bartholomew: @Aaron Do you have clients that 
are still out there trying to produce and consume on the namespace? I did a 
quick test where I connected a consuming client, then ran my delete routine. I 
didn't see any IO warnings, but my delete routine also deletes the tenant, so 
it's not an apples to apples comparison.
----
2019-06-27 18:29:07 UTC - Aaron: Yes, thats what it was. Thanks for your help.
+1 : Chris Bartholomew
----
2019-06-27 19:30:55 UTC - Sergii Zhevzhyk: @Sergii Zhevzhyk has joined the 
channel
----
2019-06-27 19:58:16 UTC - Aaron: When authorization and authentication are 
turned on, is there a way to set the user-role for an unauthenticated user 
(i.e. one that connects via the regular 6650 port)? The anonymousUserRole field 
seems to only go into effect if authentication is disabled.
----
2019-06-27 19:59:35 UTC - Matteo Merli: the port you use 6650 for unencrypted 
or 6651 for TLS, is not strictly tied to whether the client is passing 
credentials
----
2019-06-27 20:01:15 UTC - Matteo Merli: `anonymousUserRole=anonymous` will 
treat every unauthenticated user as if has passed credentials with principal 
`anounymous`. You can then grant permission to `anonymous` to perform certain 
actions (eg: produce/consume on certain namespaces)
----
2019-06-27 20:04:41 UTC - David Fisher: I have a question about tenants and 
clusters. Suppose I create a private pulsar cluster for a particular tenant and 
then want to join that tenant to a much, much larger multi-tenant multi-cluster 
in the cloud. Does that just "work"?
----
2019-06-27 20:05:25 UTC - Aaron: I tried that, and I get lots of "Role null is 
not allowed to lookup topic" and "Failed to authorized null on cluster 
<persistent://public/default/rando>". The namespace public/default shows that 
anonymous has permission to produce/consume on the namespace.
----
2019-06-27 20:09:43 UTC - David Fisher: Or is a PIP/PR required?
----
2019-06-27 20:11:22 UTC - Guillaume Rosauro: ok I have found the problem : I 
must escape the string like this “public\\.(t_).*”
100 : Sijie Guo
+1 : Sijie Guo
----
2019-06-27 20:25:12 UTC - David Kjerrumgaard: @David Fisher There are a couple 
of options in such a scenario depending upon what you are looking to 
accomplish. One approach would be to add the "private" cluster and the larger 
pulsar cluster to the same Pulsar instance. Another approach would be to 
migrate the existing data into the larger cluster (and eventually 
retire/repurpose the smaller cluster)  and segment off access to that data via 
normal namespace-level access control policies
----
2019-06-27 20:25:57 UTC - David Kjerrumgaard: @David Fisher But it really 
depends on your envisioned "end state" and how isolated you want the private 
clusters data to be.
----
2019-06-27 20:27:57 UTC - David Fisher: That's not what I have in mind at all. 
I mean for a private cluster to be exchanging messages with the larger cloud 
cluster. The private cluster would be for Tenant0 and the cloud for 
Tenant0-1000.
----
2019-06-27 20:28:51 UTC - David Fisher: The second part of the use case is that 
the two clusters may become detached for days due to network disruptions like 
from a wildfire.
----
2019-06-27 20:30:25 UTC - David Fisher: An analogy would be how Lotus Notes 
worked 25 years ago with modem based synchronization
----
2019-06-27 20:31:02 UTC - David Kjerrumgaard: @David Fisher So the "private" 
cluster would act in a "store and forward" mode to the larger cluster? Would 
the communication be bi-directional?
----
2019-06-27 20:31:24 UTC - David Fisher: Yes and Ideally yes.
----
2019-06-27 20:33:21 UTC - David Kjerrumgaard: @David Fisher In that case, 
adding the "private" cluster and the larger pulsar cluster to the same Pulsar 
instance (where a Pulsar instance is just a collection of multiple Pulsar 
clusters) would be the best approach, as it would allow you to enable 
geo-replication of the topic data.
----
2019-06-27 20:35:47 UTC - David Fisher: @David Kjerrumgaard I thought so. Is 
the "private" cluster automatically limited to a particular tenant's namespaces 
and topics?
----
2019-06-27 20:38:32 UTC - David Kjerrumgaard: @David Fisher No, that is the 
other tricky part. So a pulsar instance has a separate metastore that tracks 
topics, namespaces and access policies across the multiple clusters.  This is 
what enables geo-replication of a topic, as the "instance" is aware of the 
topic and namespace, not just an individual cluster.
----
2019-06-27 20:39:20 UTC - David Kjerrumgaard: @David Fisher But if you define 
the access policies correctly, and use a central authentication provider, then 
you can limit access to the data.
----
2019-06-27 20:40:27 UTC - David Kjerrumgaard: give the "edge" private cluster 
user write only access on a topic and allow a different user that would access 
the data from the cloud based cluster read access, etc.
----
2019-06-27 20:42:11 UTC - David Fisher: @David Kjerrumgaard Even if the access 
policies limit things does all of the topic data (BK) flow back and forth or 
just the metastore (ZK)?
----
2019-06-27 20:45:38 UTC - David Kjerrumgaard: @David Fisher The meta store data 
for the Pulsar Instance would be stored in a single ZK quorum and BOTH clusters 
would have read access to that data. They would read the data to enforce client 
access and implement replication.  Anyone with admin credentials would be able 
to modify those policies using the pulsar-admin CLI, etc.
----
2019-06-27 20:47:04 UTC - David Kjerrumgaard: @David Fisher The message data 
would be replicated (copied) between the "edge" clusters storage layer and the 
cloud clusters storage layer. It would be under the same topic/namespace, so 
clients on the cloud can access it IF they have sufficient permission
----
2019-06-27 20:49:55 UTC - David Fisher: @David Kjerrumgaard Up here in wildfire 
country our networks will break. we lost internet for 3 days in the Tubbs fire 
in October 2017. So, I have concerns about the ZK portion.
----
2019-06-27 20:51:49 UTC - David Kjerrumgaard: @David Fisher That makes sense. I 
faced similar issues with a customer who had mining operations across 
Australia. These were very remote locations and the space for the entire 
hardware deployment on-site was no bigger than a coat closet.  :smiley:
----
2019-06-27 20:52:10 UTC - David Fisher: Several years ago we had bad issues 
with ZK breakage with only outages in the &lt;10 minute level.
----
2019-06-27 20:52:49 UTC - David Fisher: In the same datacenter (thank you 
VMWare migrations)
----
2019-06-27 20:53:35 UTC - David Kjerrumgaard: The key to making that work was 
to having one local ZK node in the global quorum co-located at the same site. 
Then we just had to accept the fact that "eventually" consistent meant in 3-4 
days.....and plan for those scenarios where the data wasn't fresh for that 
period
----
2019-06-27 20:54:12 UTC - David Kjerrumgaard: Since the management of the 
polices was done at HQ, I suggesting using a secondary communication channel to 
communicate the changes.....
----
2019-06-27 20:55:02 UTC - David Kjerrumgaard: the best, low-tech solution 
turned out to be a phone.......we just called the admin on-site and had them 
check for the replicated changes.  If they didn't see them, they were 
authorized to make them locally on the ZK node  :smiley:
----
2019-06-27 20:55:31 UTC - David Kjerrumgaard: not sexy, but effective
----
2019-06-27 20:57:06 UTC - David Kjerrumgaard: Bear in mind, these are policy 
changes, so they change very infrequently
----
2019-06-27 20:57:49 UTC - David Kjerrumgaard: We are NOT talking about BK 
placement data. That is kept in a separate ZK cluster on EACH cluster.
----
2019-06-27 20:58:04 UTC - David Fisher: @David Kjerrumgaard OK, A SOP would be 
provided to put in any needed ZK change, but that would be unlikely as in that 
scenario any changes would be not allowed, or could wait.
----
2019-06-27 20:59:00 UTC - David Kjerrumgaard: @David Fisher Correct, these 
changes were often a result of someone requesting access to the data, or a 
security audit finding an issue
----
2019-06-27 20:59:27 UTC - David Fisher: @David Kjerrumgaard Yeah I know BK is 
ZK + DistributedLog
+1 : David Kjerrumgaard
----
2019-06-27 21:01:31 UTC - David Fisher: Thanks for the discussion it helps me 
understand how to architect this use case. About year ago, someone asked for 
Pulsar to do something special for the edge. I think it is just a question of 
cooking up a scenario like we just discussed.
----
2019-06-27 21:01:50 UTC - David Kjerrumgaard: No problem, glad to 
help.......and good luck
----
2019-06-27 21:03:07 UTC - David Fisher: I don't know if the guy I was talking 
to will proceed, but this is certainly a common situation.
----
2019-06-27 21:03:51 UTC - Aaron: @Matteo Merli any ideas?
----
2019-06-27 21:16:39 UTC - Matteo Merli: &gt; Role null

That shouldn’t happen if `anonymousUserRole=anonymous` is set in `broker.conf`
----
2019-06-27 21:27:51 UTC - Aaron: I have it set in both the broker.conf and 
standalone.conf, and I am running the standalone.
----
2019-06-28 03:25:39 UTC - Addison Higham: hrm... trying to understand PIP-36 
(<https://github.com/apache/pulsar/pull/4247>).

There is also PIP-37 which works by chunking messages into many sizes, however, 
PIP-36 works by potentially making messages (potentially) unbounded, but I 
imagine it has a practical limit based on other constraints on the system (i.e. 
bookkeeper segment size, message receive buffer, etc)  it may also have serious 
performance impacts.

Curious if there is any idea of what a practical limit is... like would 50 MB 
be reasonable?
----
2019-06-28 03:27:25 UTC - Sijie Guo: @Addison Higham PIP-36 is just to make 
settings configurable. It doesn’t suggest setting a very high value. as that is 
not reasonable. PIP-37 should be the one for supporting large message size.
----
2019-06-28 03:27:39 UTC - Addison Higham: :thumbsup: that is what I assumed
----
2019-06-28 03:27:54 UTC - Addison Higham: curious if you have any thoughts 
though as to where we do get with PIP-36 though
----
2019-06-28 03:28:24 UTC - Addison Higham: I don't need anything huge... but 
thinking of a CDC use case where I can get some big rows that might be like... 
15 MB or so with some large values
----
2019-06-28 03:32:03 UTC - Sijie Guo: PIIP-36 is for getting around issues in CDC
----
2019-06-28 03:32:26 UTC - Sijie Guo: there will be some changes are very large
----
2019-06-28 03:32:27 UTC - Addison Higham: so 10 - 15 MB seem perhaps reasonable?
----
2019-06-28 03:32:42 UTC - Sijie Guo: yes 15MS is reasonable
----
2019-06-28 03:34:18 UTC - Addison Higham: okay, I am currently porting some 
stuff from my own debezium stuff (that writes to kinesis) and is getting around 
Kinesis limits by doing a claim check pattern to write the real object to s3, 
deciding if I should keep that or just dump it
----
2019-06-28 03:42:45 UTC - Sijie Guo: @Addison Higham if you are sure the 
maximum cap, you can probably adjust. but if you are not very sure, I would 
suggest writing to s3.
----
2019-06-28 03:43:13 UTC - Addison Higham: that is kinda where I am leaning 
since I already have it implemented...
+1 : Sijie Guo
----
2019-06-28 03:45:57 UTC - Addison Higham: We have made a lot of contributions 
back into debezium (particularly psql) and while I think we will probably just 
build on top of the `KafkaConnectSource` adapter to enable things like the 
claim checking, I would be curious to see if there is anything that pulsar 
would be interesting in upstreaming
----
2019-06-28 04:17:56 UTC - Ritesh Chandra Nailwal: Question:My question is not 
about the number of pulsar broker required instead I was asking about the 
pulsar-deamon (apache-pulsar-2.3.2-bin.tar.gz). suppose I have 3 zk node then 
should I install the pulsar-daemon in each of 3 zookeeper node and configure 
the zk in zookeeper.conf of pulsar daemon (each conf file of pulsar-daemon have 
single zookeeper server config) or should I have one installation of 
pulsar-daemon and specify all of 3 zookeeeper config in one zookeeper.conf. 
This type of information I am not able to find.
----
2019-06-28 07:22:05 UTC - Zhenhao Li: @Zhenhao Li has joined the channel
----

Slack digest for #general - 2019-06-28

Reply via email to