2019-06-27 09:48:51 UTC - Guillaume Rosauro: I am using Pulsar with a Debezium Postgres connector. Does anybody know how to set the debezium configuration like "tables.whitelist", "transforms.xx" and so on ? ---- 2019-06-27 10:24:18 UTC - Alexandre DUVAL: Hi, on `bin/pulsar-admin broker-stats load-report`, why topics have this encoding: ```orga_858600a8-74f4-4d75-a8a3-f5b868be093c/app_3b65d74f-66f2-4218-bc91-6bce8d1e486a/0x80000000_0xc0000000``` ---- 2019-06-27 10:24:43 UTC - Alexandre DUVAL: what's the value of the topic name `0x80000000_0xc0000000`? and why? ---- 2019-06-27 12:06:11 UTC - Mate Varga: Hello! I know this is a _very_ generic question, but maybe ... so would you recommend using Avro schemas/messages over JSON or protobuf for system that has not been using any of those so far (so there's no legacy, no investment in either of those). We're just deploying Pulsar and start using it for the first few use cases, and we're trying to make a decision about what to use. (We have our own opinions on this but it'd be good to hear what the community/authors prefer or recommend.) ---- 2019-06-27 12:47:51 UTC - Sijie Guo: @jia zhai @tuteng can you help with this question? ---- 2019-06-27 12:50:01 UTC - Sijie Guo: > how many broker daemon we require
the minimal of number of brokers is 1+. it depends your throughput, the number of partitions and many other factors of your workload. I would suggest doing a load test using your workload and estimated based on your business traffic. > when should be specify the zk, book keeper configuration (in which daemon I am not sure what the question is. can you explain more? ---- 2019-06-27 12:50:31 UTC - Sijie Guo: 0x80000000_0xc0000000 this is the bundle ---- 2019-06-27 12:53:52 UTC - Alexandre DUVAL: what do you mean by bundle? ---- 2019-06-27 12:53:58 UTC - Alexandre DUVAL: whole topics? ---- 2019-06-27 13:01:08 UTC - jia zhai: @Guillaume Rosauro Here is the link of how to set the debezium perameters: <http://pulsar.apache.org/docs/en/next/io-cdc-debezium/> ---- 2019-06-27 13:02:16 UTC - jia zhai: By default, there is no differences with original debezium settings ---- 2019-06-27 13:04:32 UTC - Guillaume Rosauro: @jia zhai: it is not true. If I add some parameters not known by pulsar verification process (for example tables.whitelist), I receive a configuration error... ---- 2019-06-27 13:04:54 UTC - Sijie Guo: <http://pulsar.apache.org/docs/en/administration-load-balance/> ---- 2019-06-27 13:05:02 UTC - Sijie Guo: I would suggest checking out this page first ---- 2019-06-27 13:05:41 UTC - jia zhai: Are you using the latest version? ---- 2019-06-27 13:06:09 UTC - jia zhai: try this one:<https://dist.apache.org/repos/dist/dev/pulsar/pulsar-2.4.0-candidate-2/connectors/> ---- 2019-06-27 13:06:42 UTC - Guillaume Rosauro: I have tried with 2.4.0 RC1. Someting has changed on this with the RC2 ? ---- 2019-06-27 13:07:09 UTC - jia zhai: I think no more changes between them ---- 2019-06-27 13:07:21 UTC - Guillaume Rosauro: I think so too ---- 2019-06-27 13:07:53 UTC - jia zhai: what parameter did you set? ---- 2019-06-27 13:08:52 UTC - Guillaume Rosauro: "table.whitelist" ---- 2019-06-27 13:09:13 UTC - Guillaume Rosauro: see <https://debezium.io/docs/connectors/postgresql/> ---- 2019-06-27 13:14:05 UTC - jia zhai: In former message , you mentioned, you are using “tables.whitelist” ---- 2019-06-27 13:14:17 UTC - jia zhai: not “table.whitelist” ---- 2019-06-27 13:15:21 UTC - jia zhai: @Guillaume Rosauro Please help check if there is some typo. ---- 2019-06-27 13:16:07 UTC - jia zhai: Pulsar did not check these parameters, and only simply passed parameters into Debezium ---- 2019-06-27 13:29:34 UTC - Guillaume Rosauro: OK thanks, I will re-check it. Maybe that was just a typo. ---- 2019-06-27 13:43:22 UTC - Alexandre DUVAL: got it! ---- 2019-06-27 13:43:22 UTC - Alexandre DUVAL: thx ---- 2019-06-27 13:49:07 UTC - Richard Sherman: subscription names are unique per topic ---- 2019-06-27 14:09:04 UTC - Guillaume Rosauro: @jia zhai I still have this error : ```apache-pulsar-2.4.0 bin/pulsar-admin source localrun --sourceConfigFile debezium-postgres-charlie-config.yaml while scanning a double-quoted scalar in 'reader', line 16, column 20: table.whitelist: "public\.(t_).*"``` ---- 2019-06-27 14:12:01 UTC - jia zhai: what is the whole content of file debezium-postgres-charlie-config.yaml? ---- 2019-06-27 14:12:17 UTC - Aaron: Is there a way to force delete a namespace with active topics/subscriptions? ---- 2019-06-27 14:24:58 UTC - Chris Bartholomew: @Aaron Not that I know of. I do this programatically by force deleting all the topics in the namespace and then deleting the namespace. I think you can still run into issues if clients are connected and allowAutoTopicCreation is on. Then you have a race to delete the namespace before the client can recreate the topic. ---- 2019-06-27 14:26:43 UTC - Aaron: Okay, thanks. ---- 2019-06-27 14:26:55 UTC - Sijie Guo: I would suggest avro. ---- 2019-06-27 14:32:29 UTC - Sam Leung: :thumbsup: ---- 2019-06-27 14:33:13 UTC - Ryan Samo: Hey guys, I’m playing with pulsar functions which are super sweet so thanks for all of the hard work! My question is when it comes to performance of your functions on bare metal hardware, is there other settings to tweak besides parallelism? I know on consumers you can change the receiverQueueSize, etc to help with backlog and things like that, and functions are just more producers and consumers. Is there a way to make further tweaks by specifying it via the admin cli or possibly the function yml? Just not seeing much on documentation around function performance. Thanks! ---- 2019-06-27 14:43:59 UTC - Ryan Samo: Maybe using the ConsumerConfig inputSpecs in the FunctionConfig class? ---- 2019-06-27 14:48:56 UTC - Sijie Guo: I think it depends on the latency for executing your function. Each function instance invokes functions synchronously in one thread. So if your function takes time to process an event, you have to increase the nmber of instances to parallelize processing the events. ---- 2019-06-27 14:51:05 UTC - Ryan Samo: Ok makes sense, I’m just trying out the exclamation function for now in general testing and seeing backlogs so I thought maybe I would tweak it a bit. Change the parallelism, make it a partitioned topic, etc. ---- 2019-06-27 14:51:56 UTC - ishara: Hello I have a little project using Pulsar, not many people are going to use it. Can we use the standalone ver. or should we bring more machines ? ---- 2019-06-27 14:53:12 UTC - Addison Higham: speaking of pulsar functions... what is the state of the pulsar functions being executed against k8s? I see options in the example config file for how to launch them in the cluster, I also see references to it in the pulsar function docs, but not much more than that ---- 2019-06-27 14:56:23 UTC - David Kjerrumgaard: @ishara The standalone version is really intended for sandbox environments and dev work. It is fine for collaborating in those types of scenarios, but for any non-dev work I would recommend spinning up a small cluster with 2 or 3 nodes. ---- 2019-06-27 14:57:11 UTC - ishara: Ok thanks for the answer :slightly_smiling_face: ---- 2019-06-27 14:58:20 UTC - David Kjerrumgaard: @Addison Higham What type of details would you like to see covered in the documentation? What is missing that we can add? Thanks in advance for the feedback, as it will make our docs better ---- 2019-06-27 14:58:46 UTC - Guillaume Rosauro: here it is : ``` tenant: "public" namespace: "default" name: "debezium-postgres-charlie" topicName: "debezium-postgres-charlie-topic" archive: "connectors/pulsar-io-debezium-postgres-2.4.0.nar" parallelism: 1 configs: database.hostname: "localhost" database.port: "5432" database.user: "postgres" database.password: "postgres" database.dbname: "charlie_db" database.server.name: "dbserver1" schema.whitelist: "public" table.whitelist: "public\.(t_).*" pulsar.service.url: "<pulsar://127.0.0.1:6650>" ``` ---- 2019-06-27 15:01:47 UTC - Aaron: @Chris Bartholomew I was able to delete the namespace, but I am still getting IO warnings about topics in the deleted namespace. Do you know why these are showing up? The pulsar-admin cli shows there are no topics under this namespace now. ---- 2019-06-27 15:02:04 UTC - Addison Higham: looking at the master docs, it appears to be quite a bit improved with this page: <https://pulsar.apache.org/docs/en/next/functions-runtime/> however, it is still missing from the reference section: <https://pulsar.apache.org/docs/en/next/reference-configuration/> ---- 2019-06-27 15:14:02 UTC - jia zhai: @Guillaume Rosauro It seems failed at yaml file format check ---- 2019-06-27 15:14:58 UTC - jia zhai: “public\.(t_).*” ---- 2019-06-27 15:20:35 UTC - Sijie Guo: @Addison Higham most of the settings in functions worker config are already self explained. ---- 2019-06-27 15:24:42 UTC - Addison Higham: yeah, I just wasn't seeing any docs or any options because I was looking at the 2.3.2 page initially, switching to master docs greatly improve the situation :slightly_smiling_face: +1 : David Kjerrumgaard ---- 2019-06-27 15:27:06 UTC - Aaron: It appears that the broker is doing internal topic lookups on topics that don't exist anymore ---- 2019-06-27 15:55:08 UTC - Baliles-Heroku: @Baliles-Heroku has joined the channel ---- 2019-06-27 17:56:50 UTC - Benjamin.Hess: @Benjamin.Hess has joined the channel ---- 2019-06-27 18:24:49 UTC - Chris Bartholomew: @Aaron Do you have clients that are still out there trying to produce and consume on the namespace? I did a quick test where I connected a consuming client, then ran my delete routine. I didn't see any IO warnings, but my delete routine also deletes the tenant, so it's not an apples to apples comparison. ---- 2019-06-27 18:29:07 UTC - Aaron: Yes, thats what it was. Thanks for your help. +1 : Chris Bartholomew ---- 2019-06-27 19:30:55 UTC - Sergii Zhevzhyk: @Sergii Zhevzhyk has joined the channel ---- 2019-06-27 19:58:16 UTC - Aaron: When authorization and authentication are turned on, is there a way to set the user-role for an unauthenticated user (i.e. one that connects via the regular 6650 port)? The anonymousUserRole field seems to only go into effect if authentication is disabled. ---- 2019-06-27 19:59:35 UTC - Matteo Merli: the port you use 6650 for unencrypted or 6651 for TLS, is not strictly tied to whether the client is passing credentials ---- 2019-06-27 20:01:15 UTC - Matteo Merli: `anonymousUserRole=anonymous` will treat every unauthenticated user as if has passed credentials with principal `anounymous`. You can then grant permission to `anonymous` to perform certain actions (eg: produce/consume on certain namespaces) ---- 2019-06-27 20:04:41 UTC - David Fisher: I have a question about tenants and clusters. Suppose I create a private pulsar cluster for a particular tenant and then want to join that tenant to a much, much larger multi-tenant multi-cluster in the cloud. Does that just "work"? ---- 2019-06-27 20:05:25 UTC - Aaron: I tried that, and I get lots of "Role null is not allowed to lookup topic" and "Failed to authorized null on cluster <persistent://public/default/rando>". The namespace public/default shows that anonymous has permission to produce/consume on the namespace. ---- 2019-06-27 20:09:43 UTC - David Fisher: Or is a PIP/PR required? ---- 2019-06-27 20:11:22 UTC - Guillaume Rosauro: ok I have found the problem : I must escape the string like this “public\\.(t_).*” 100 : Sijie Guo +1 : Sijie Guo ---- 2019-06-27 20:25:12 UTC - David Kjerrumgaard: @David Fisher There are a couple of options in such a scenario depending upon what you are looking to accomplish. One approach would be to add the "private" cluster and the larger pulsar cluster to the same Pulsar instance. Another approach would be to migrate the existing data into the larger cluster (and eventually retire/repurpose the smaller cluster) and segment off access to that data via normal namespace-level access control policies ---- 2019-06-27 20:25:57 UTC - David Kjerrumgaard: @David Fisher But it really depends on your envisioned "end state" and how isolated you want the private clusters data to be. ---- 2019-06-27 20:27:57 UTC - David Fisher: That's not what I have in mind at all. I mean for a private cluster to be exchanging messages with the larger cloud cluster. The private cluster would be for Tenant0 and the cloud for Tenant0-1000. ---- 2019-06-27 20:28:51 UTC - David Fisher: The second part of the use case is that the two clusters may become detached for days due to network disruptions like from a wildfire. ---- 2019-06-27 20:30:25 UTC - David Fisher: An analogy would be how Lotus Notes worked 25 years ago with modem based synchronization ---- 2019-06-27 20:31:02 UTC - David Kjerrumgaard: @David Fisher So the "private" cluster would act in a "store and forward" mode to the larger cluster? Would the communication be bi-directional? ---- 2019-06-27 20:31:24 UTC - David Fisher: Yes and Ideally yes. ---- 2019-06-27 20:33:21 UTC - David Kjerrumgaard: @David Fisher In that case, adding the "private" cluster and the larger pulsar cluster to the same Pulsar instance (where a Pulsar instance is just a collection of multiple Pulsar clusters) would be the best approach, as it would allow you to enable geo-replication of the topic data. ---- 2019-06-27 20:35:47 UTC - David Fisher: @David Kjerrumgaard I thought so. Is the "private" cluster automatically limited to a particular tenant's namespaces and topics? ---- 2019-06-27 20:38:32 UTC - David Kjerrumgaard: @David Fisher No, that is the other tricky part. So a pulsar instance has a separate metastore that tracks topics, namespaces and access policies across the multiple clusters. This is what enables geo-replication of a topic, as the "instance" is aware of the topic and namespace, not just an individual cluster. ---- 2019-06-27 20:39:20 UTC - David Kjerrumgaard: @David Fisher But if you define the access policies correctly, and use a central authentication provider, then you can limit access to the data. ---- 2019-06-27 20:40:27 UTC - David Kjerrumgaard: give the "edge" private cluster user write only access on a topic and allow a different user that would access the data from the cloud based cluster read access, etc. ---- 2019-06-27 20:42:11 UTC - David Fisher: @David Kjerrumgaard Even if the access policies limit things does all of the topic data (BK) flow back and forth or just the metastore (ZK)? ---- 2019-06-27 20:45:38 UTC - David Kjerrumgaard: @David Fisher The meta store data for the Pulsar Instance would be stored in a single ZK quorum and BOTH clusters would have read access to that data. They would read the data to enforce client access and implement replication. Anyone with admin credentials would be able to modify those policies using the pulsar-admin CLI, etc. ---- 2019-06-27 20:47:04 UTC - David Kjerrumgaard: @David Fisher The message data would be replicated (copied) between the "edge" clusters storage layer and the cloud clusters storage layer. It would be under the same topic/namespace, so clients on the cloud can access it IF they have sufficient permission ---- 2019-06-27 20:49:55 UTC - David Fisher: @David Kjerrumgaard Up here in wildfire country our networks will break. we lost internet for 3 days in the Tubbs fire in October 2017. So, I have concerns about the ZK portion. ---- 2019-06-27 20:51:49 UTC - David Kjerrumgaard: @David Fisher That makes sense. I faced similar issues with a customer who had mining operations across Australia. These were very remote locations and the space for the entire hardware deployment on-site was no bigger than a coat closet. :smiley: ---- 2019-06-27 20:52:10 UTC - David Fisher: Several years ago we had bad issues with ZK breakage with only outages in the <10 minute level. ---- 2019-06-27 20:52:49 UTC - David Fisher: In the same datacenter (thank you VMWare migrations) ---- 2019-06-27 20:53:35 UTC - David Kjerrumgaard: The key to making that work was to having one local ZK node in the global quorum co-located at the same site. Then we just had to accept the fact that "eventually" consistent meant in 3-4 days.....and plan for those scenarios where the data wasn't fresh for that period ---- 2019-06-27 20:54:12 UTC - David Kjerrumgaard: Since the management of the polices was done at HQ, I suggesting using a secondary communication channel to communicate the changes..... ---- 2019-06-27 20:55:02 UTC - David Kjerrumgaard: the best, low-tech solution turned out to be a phone.......we just called the admin on-site and had them check for the replicated changes. If they didn't see them, they were authorized to make them locally on the ZK node :smiley: ---- 2019-06-27 20:55:31 UTC - David Kjerrumgaard: not sexy, but effective ---- 2019-06-27 20:57:06 UTC - David Kjerrumgaard: Bear in mind, these are policy changes, so they change very infrequently ---- 2019-06-27 20:57:49 UTC - David Kjerrumgaard: We are NOT talking about BK placement data. That is kept in a separate ZK cluster on EACH cluster. ---- 2019-06-27 20:58:04 UTC - David Fisher: @David Kjerrumgaard OK, A SOP would be provided to put in any needed ZK change, but that would be unlikely as in that scenario any changes would be not allowed, or could wait. ---- 2019-06-27 20:59:00 UTC - David Kjerrumgaard: @David Fisher Correct, these changes were often a result of someone requesting access to the data, or a security audit finding an issue ---- 2019-06-27 20:59:27 UTC - David Fisher: @David Kjerrumgaard Yeah I know BK is ZK + DistributedLog +1 : David Kjerrumgaard ---- 2019-06-27 21:01:31 UTC - David Fisher: Thanks for the discussion it helps me understand how to architect this use case. About year ago, someone asked for Pulsar to do something special for the edge. I think it is just a question of cooking up a scenario like we just discussed. ---- 2019-06-27 21:01:50 UTC - David Kjerrumgaard: No problem, glad to help.......and good luck ---- 2019-06-27 21:03:07 UTC - David Fisher: I don't know if the guy I was talking to will proceed, but this is certainly a common situation. ---- 2019-06-27 21:03:51 UTC - Aaron: @Matteo Merli any ideas? ---- 2019-06-27 21:16:39 UTC - Matteo Merli: > Role null That shouldn’t happen if `anonymousUserRole=anonymous` is set in `broker.conf` ---- 2019-06-27 21:27:51 UTC - Aaron: I have it set in both the broker.conf and standalone.conf, and I am running the standalone. ---- 2019-06-28 03:25:39 UTC - Addison Higham: hrm... trying to understand PIP-36 (<https://github.com/apache/pulsar/pull/4247>). There is also PIP-37 which works by chunking messages into many sizes, however, PIP-36 works by potentially making messages (potentially) unbounded, but I imagine it has a practical limit based on other constraints on the system (i.e. bookkeeper segment size, message receive buffer, etc) it may also have serious performance impacts. Curious if there is any idea of what a practical limit is... like would 50 MB be reasonable? ---- 2019-06-28 03:27:25 UTC - Sijie Guo: @Addison Higham PIP-36 is just to make settings configurable. It doesn’t suggest setting a very high value. as that is not reasonable. PIP-37 should be the one for supporting large message size. ---- 2019-06-28 03:27:39 UTC - Addison Higham: :thumbsup: that is what I assumed ---- 2019-06-28 03:27:54 UTC - Addison Higham: curious if you have any thoughts though as to where we do get with PIP-36 though ---- 2019-06-28 03:28:24 UTC - Addison Higham: I don't need anything huge... but thinking of a CDC use case where I can get some big rows that might be like... 15 MB or so with some large values ---- 2019-06-28 03:32:03 UTC - Sijie Guo: PIIP-36 is for getting around issues in CDC ---- 2019-06-28 03:32:26 UTC - Sijie Guo: there will be some changes are very large ---- 2019-06-28 03:32:27 UTC - Addison Higham: so 10 - 15 MB seem perhaps reasonable? ---- 2019-06-28 03:32:42 UTC - Sijie Guo: yes 15MS is reasonable ---- 2019-06-28 03:34:18 UTC - Addison Higham: okay, I am currently porting some stuff from my own debezium stuff (that writes to kinesis) and is getting around Kinesis limits by doing a claim check pattern to write the real object to s3, deciding if I should keep that or just dump it ---- 2019-06-28 03:42:45 UTC - Sijie Guo: @Addison Higham if you are sure the maximum cap, you can probably adjust. but if you are not very sure, I would suggest writing to s3. ---- 2019-06-28 03:43:13 UTC - Addison Higham: that is kinda where I am leaning since I already have it implemented... +1 : Sijie Guo ---- 2019-06-28 03:45:57 UTC - Addison Higham: We have made a lot of contributions back into debezium (particularly psql) and while I think we will probably just build on top of the `KafkaConnectSource` adapter to enable things like the claim checking, I would be curious to see if there is anything that pulsar would be interesting in upstreaming ---- 2019-06-28 04:17:56 UTC - Ritesh Chandra Nailwal: Question:My question is not about the number of pulsar broker required instead I was asking about the pulsar-deamon (apache-pulsar-2.3.2-bin.tar.gz). suppose I have 3 zk node then should I install the pulsar-daemon in each of 3 zookeeper node and configure the zk in zookeeper.conf of pulsar daemon (each conf file of pulsar-daemon have single zookeeper server config) or should I have one installation of pulsar-daemon and specify all of 3 zookeeeper config in one zookeeper.conf. This type of information I am not able to find. ---- 2019-06-28 07:22:05 UTC - Zhenhao Li: @Zhenhao Li has joined the channel ----
