2019-02-08 09:18:33 UTC - Niketh Sabbineni: @Sijie Guo  Thank you :ok_hand:
----
2019-02-08 13:46:04 UTC - Stephen Dotz: @Stephen Dotz has joined the channel
----
2019-02-08 13:51:31 UTC - Stephen Dotz: Hi! We're using Apache Kafka for my 
project and there are a number of features pulsar offers which we really need. 
Geo-replication, protobuf schema registry, tiered storage. One thing I gathered 
from reading through the docs is that this is a very similar project to Kafka, 
but Kafka is seldom mentioned. Lots of features are 1:1 between them with 
different names. I am wondering if there is some reading I can do, or if 
somebody can elaborate on the primary differences, or why a separate project 
was created. Not trying to create a this vs. that, just trying to understand 
more about the two projects and the primary differences, and motivations behind 
pulsar. Thanks!
----
2019-02-08 13:54:52 UTC - Sijie Guo: Yahoo’s blog post is a good post about the 
motivation. 
<https://yahooeng.tumblr.com/post/150078336821/open-sourcing-pulsar-pub-sub-messaging-at-scale>

for the differences, you can checkout:

- <https://streaml.io/blog/pulsar-streaming-queuing>
- <https://streaml.io/blog/pulsar-segment-based-architecture>
- 
<https://streaml.io/blog/apache-pulsar-architecture-designing-for-streaming-performance-and-scalability>
----
2019-02-08 14:00:03 UTC - Stephen Dotz: Perfect! thanks! Will read these and 
get back with any questions
----
2019-02-08 14:06:59 UTC - Sijie Guo: @Stephen Dotz feel free to ping us any time
----
2019-02-08 14:29:49 UTC - Matan Shalit: @Matan Shalit has joined the channel
----
2019-02-08 14:38:38 UTC - Matan Shalit: hey :slightly_smiling_face:
i need to create a "statistics" service of sorts, this service needs to read 
all messages that were accumulated in the last week for a topic.
i have found a way to start a reader at a specific message id but the thing is 
that i have no way of knowing what was the message id from 7 days back.

is there a way to create a reader that starts consuming messages by timespan? 
or by timestamp?
any help appreciated! :innocent:
----
2019-02-08 14:44:55 UTC - Stephen Dotz: This is a longshot, but one of the 
biggest things we struggle w/ using kafka is the difficulty of swapping out 
topics from underneath clients. e.g. if we want to re-write a topic to remove 
some data for a GDPR request, we could stream the topic into another and filter 
it, but we also have to coordinate with consumers to switch their topic to the 
new one and find their new offset. There are other scenarios where we want to 
re-write history too, to just clean the data up. Does Pulsar offer anything 
that could potentially make that easier?
----
2019-02-08 15:20:13 UTC - Sijie Guo: currently I don’t think there is a direct 
time-based “seek” api in either consumer and reader. we might consider adding 
some.

for getting around, you can use consumer api to subscribe to latest. then using 
pulsar admin tool to reset-cursor by time:

<http://pulsar.apache.org/docs/en/admin-api-persistent-topics/#reset-cursor>

or using resetCursor() api in PulsarAdmin
----
2019-02-08 15:21:18 UTC - Stephen Dotz: Does anyone know if PulsarSQL can be 
used to query against protobuf schemas?
----
2019-02-08 15:24:24 UTC - Sijie Guo: current PulsarSQL only supports json and 
avro.

but Pulsar support protobuf schema. so it should be easy to add protobuf 
support in PulsarSQL. /cc @Jerry Peng @Jerry Peng (in case he has more to add 
on)
----
2019-02-08 15:48:23 UTC - Stephen Dotz: I haven't tried to make this work but 
we did look into it briefly w/ KSQL. I thought since PBs map to and from JSON 
deterministically it would be a matter of plugging in a Ser/De but it wasn't 
that simple (nothing ever is)
----
2019-02-08 15:52:54 UTC - Stephen Dotz: However :crossed_fingers: that it _is_ 
that simple w/ this
----
2019-02-08 15:59:10 UTC - Matan Shalit: Thx, ill give the reset cursor command 
a shot. :slightly_smiling_face:

having this as part of the consumer / reader interface would indeed be great 
should i open a suggestion / ticket somewhere for it?
----
2019-02-08 16:29:05 UTC - Vincent Ngan: Is it possible to start the standalone 
pulsar server programatically? I need to run unit tests of my application with 
the pulsar server running. Can I start an embedded pulsar server from my unit 
test code and shut it down when the tests finish?
----
2019-02-08 16:30:15 UTC - Grant Wu: That sounds possible, yes.  I actually do 
that.
----
2019-02-08 16:31:31 UTC - Vincent Ngan: Can you show me how to do it?
----
2019-02-08 16:31:43 UTC - Grant Wu: It really depends on your CI setup
----
2019-02-08 16:31:55 UTC - Grant Wu: We use GitLab CI
----
2019-02-08 16:32:13 UTC - Grant Wu: We start the image using
```
  services:
    - name: ${PULSAR_TESTING_IMAGE}
      alias: pulsar-broker
```
(which is actually a deprecated legacy feature…….)
----
2019-02-08 16:32:33 UTC - Grant Wu: ```
ARG DEV_REGISTRY
ARG BASE_IMAGE
FROM ${DEV_REGISTRY}/${BASE_IMAGE}
EXPOSE 6650 8080
RUN bin/apply-config-from-env.py conf/standalone.conf &amp;&amp; 
bin/apply-config-from-env.py conf/pulsar_env.sh
CMD ["bin/pulsar", "standalone"]
```
----
2019-02-08 16:32:35 UTC - Grant Wu: This is our Dockerfile
----
2019-02-08 16:33:04 UTC - Grant Wu: ```
docker-ci:
        $(DOCKER) build --tag $(PACKAGE_REPOSITORY):$(GIT_HEAD_SHA1) --label 
wraps=$(PULSAR_IMAGE) --build-arg DEV_REGISTRY=$(DEV_REGISTRY) --build-arg 
BASE_IMAGE=$(PULSAR_IMAGE) ci
```
and the Makefile command we use to build it
----
2019-02-08 16:33:42 UTC - Grant Wu: DEV_REGISTRY is our internal Dockerhub, 
PACKAGE_REGISTRY is just the path for the name of this image internally, 
PULSAR_IMAGE is… pulsar:2.1.0 :man-facepalming:(I really need to update this)
----
2019-02-08 16:34:05 UTC - Grant Wu: Basically this is more of a roll-your-own 
thing
----
2019-02-08 16:36:22 UTC - Matteo Merli: @Stephen Dotz we have it in the roadmap 
though we haven’t got to it yet. 

The main difference with protobuf is that it’s based on having the generated 
code available. 

In the case of Pulsar SQL we won’t have the generated protobuf code in the 
engine. Though we have added preliminary support in our schema definition to 
include mapping between the field names and the proto filed tags. 

That will allow one to “manually” parse the serialized protobuf and convert it 
into a generic map object that can be fed into Presto. 
----
2019-02-08 16:39:27 UTC - Vincent Ngan: Your suggestion is to configure the CI 
build scripts to start the server. I know that this should work, but I am 
looking for a even leaner approach by starting the pulsar server as an embedded 
server from my Java unit test code during the setup stage of my test suite and 
shutdown it down in the wrap up stage.
----
2019-02-08 16:40:32 UTC - Grant Wu: oh, I see
----
2019-02-08 16:40:34 UTC - Grant Wu: I have no idea then
----
2019-02-08 16:41:07 UTC - Vincent Ngan: I am looking for something like this: 
<https://github.com/salesforce/kafka-junit>
----
2019-02-08 16:41:09 UTC - Grant Wu: Didn’t think about the implications of 
“from my unit test code”
----
2019-02-08 16:49:19 UTC - Vincent Ngan: For Kafka, there are quite a lot of 
embedded solutions, e.g. <https://dzone.com/articles/unit-testing-of-kafka>
<https://github.com/tuplejump/embedded-kafka>
<https://github.com/Mayvenn/embedded-kafka>
<https://objectpartners.com/2017/10/25/functional-testing-with-embedded-kafka/>
----
2019-02-08 16:50:52 UTC - Vincent Ngan: Is there anyone doing this for Pulsar?
----
2019-02-08 16:58:04 UTC - Sijie Guo: try 
<https://github.com/apache/pulsar/blob/133331c15587400b8d7846ff1ca84cf3a8802565/pulsar-broker/src/main/java/org/apache/pulsar/PulsarStandaloneStarter.java>
----
2019-02-08 16:58:32 UTC - Sijie Guo: if you can file an issue for that, it will 
be really great!
----
2019-02-08 17:29:44 UTC - Chris DiGiovanni: I'm doing some testing with pulsar 
and I am using StatefulSets.  I bounced my two bookies in K8s and now things 
are just messed up.  Mostly due to the identification via IP which is a problem 
with StatefulSets vs. DaemonSets.  I deleted the journal and ledgers PVCs so 
they are now clean for both bookies.  I also added the config option 
useHostNameAsBookieID: "true". Though I'm still getting issues around the old 
cluster, etc... How do I get Pulsar/Zookeeper to forget about the other nodes 
and continue w/ the new ones?
----
2019-02-08 17:32:24 UTC - Emma Pollum: That is the host bookie
----
2019-02-08 17:40:13 UTC - Ali Ahmed: @Vincent Ngan here is an example 
<https://github.com/streamlio/pulsar-embedded-tutorial/blob/master/src/test/java/org/apache/pulsar/PulsarEmbeddedTest.java>
----
2019-02-08 17:41:05 UTC - Chris DiGiovanni: It also loosk like 
useHostNameAsBookieID: "true" in my configmap didn't have the desired affect 
rebuilding the whole cluster.

```
org.apache.bookkeeper.bookie.BookieException$InvalidCookieException: Cookie [4
bookieHost: "10.250.10.238:3181"
journalDir: "data/bookkeeper/journal"
ledgerDirs: "1\tdata/bookkeeper/ledgers"
instanceId: "6c5247f4-da4c-4414-bc85-cd9cf0246093"
] is not matching with [4
bookieHost: "10.250.10.237:3181"
journalDir: "data/bookkeeper/journal"
ledgerDirs: "1\tdata/bookkeeper/ledgers"
instanceId: "6c5247f4-da4c-4414-bc85-cd9cf0246093"
```
----
2019-02-08 17:43:58 UTC - Matteo Merli: @Stephen Dotz Not 100% sure of the 
context, but the fact that subscriptions are fully managed by Pulsar makes it 
easier for consumer to not have to “think” about offset, just attach to a 
subscription on a topic that will be positioned in the right place.
----
2019-02-08 17:59:20 UTC - Stephen Dotz: It does seem easy in the API to 
manipulate the consumers offset, but is there any way to also manipulate their 
topic? Or the contents of the topic? Like being able to atomically rename a 
topic and have their subscription continue from a specified offset (we would 
handle translating the offset probably)
----
2019-02-08 18:00:01 UTC - Stephen Dotz: In essence, once we say to a consumer 
"consume your data from topic xyz", we can never re-write or clean that topic 
without coordinating with the consumers to start consuming some different topic
----
2019-02-08 18:02:00 UTC - Matteo Merli: I see, so the filtering can always 
happen after-the-fact, right?
----
2019-02-08 18:03:30 UTC - Matteo Merli: I don’t think there’s a direct way to 
do exactly that right now, though some time back we were thinking of adding a 
way to force producers/consumers to reconnect to  a different topic, or even a 
different serviceURL.
----
2019-02-08 18:03:51 UTC - Stephen Dotz: Yeah that would be a really interesting 
feature to me
----
2019-02-08 18:04:05 UTC - Matteo Merli: Probably, this kind of mechanism will 
allow for that operation
----
2019-02-08 18:07:51 UTC - Stephen Dotz: So far we've considered doing sort of a 
topic name translation inside of a proxy layer that sits in front of Kafka. 
Also would handle mapping their offset in one topic to a sensible offset in the 
other. It's super rough POC and I'm not sure how well it will pan out yet.
----
2019-02-08 18:14:58 UTC - Matteo Merli: Thinking loud, one other way could be 
to have some sort of “custom” compaction filter, to filter out the messages in 
place.
----
2019-02-08 18:22:15 UTC - Matteo Merli: @Chris DiGiovanni Yes, setting up 
stateful services in K8S is a fun experience :slightly_smiling_face:

Bookies need to advertise a specific “address” because the data they contain is 
already set in the metadata as available at that specific address. That’s the 
reason for the cookie checks.

To wipe out a bookie, you can use the :

```
bin/bookkeeper shell bookieformat -force -deleteCookie
```
----
2019-02-08 18:22:27 UTC - Matteo Merli: (From the bookie you want to wipe)
----
2019-02-08 18:23:54 UTC - Matteo Merli: You can also check an example of bookie 
deployment with stateful sets in 
<https://github.com/apache/pulsar/blob/master/deployment/kubernetes/google-kubernetes-engine/bookie.yaml>
----
2019-02-08 18:30:45 UTC - Stephen Dotz: Haha, we have a plan to do something 
like that for GDPR requests. Basically we uniquely key every single message, so 
if we need to remove one, we can re-use the key and have it compacted away
----
2019-02-08 18:31:57 UTC - Stephen Dotz: What is a custom compaction filter 
though? That sounds a bit cooler
----
2019-02-08 18:32:35 UTC - Matteo Merli: Pass application defined logic to 
decide whether to keep or drop a particular message
----
2019-02-08 18:32:52 UTC - Matteo Merli: So that you don’t have to keep a unique 
key to refer to these messages
----
2019-02-08 18:33:45 UTC - Matteo Merli: Notice: this is not something that’s 
available today, though it’s not a long stretch either
----
2019-02-08 18:34:03 UTC - Stephen Dotz: ah okay I was searching around for it. 
That sounds like a cool idea
----
2019-02-08 18:35:02 UTC - Stephen Dotz: I wonder if this mechanism could be 
used to edit the data in place, rather than tombstoning it?
----
2019-02-08 18:35:47 UTC - Matteo Merli: Since the data get’s rewritten when 
compaction runs, I guess that would possible
----
2019-02-08 18:36:57 UTC - Matteo Merli: Right now, the compaction in Pulsar can 
be run with a CLI tool even from outside a Pulsar broker. That’s why I’m 
thinking that passing a custom Filter/Modifier class is easy to do.
----
2019-02-08 18:38:07 UTC - Stephen Dotz: Interesting. does the compactor work 
directly with bookkeeper API? Or perhaps the physical log format?
----
2019-02-08 18:47:32 UTC - Matteo Merli: Yes, it reads from broker and store the 
compacted data in BookKeeper, sending back a pointer with the compacted ledger
----
2019-02-08 19:07:50 UTC - Chris DiGiovanni: Thanks I'll take a look.
----
2019-02-08 20:38:37 UTC - Stephen Dotz: That makes sense. Thanks. Is there any 
GH issue or something I can follow? I'd like to try it as soon as it is ready, 
whenever that is.
----
2019-02-08 20:52:35 UTC - Matteo Merli: It was only on internal roadmaps 
:slightly_smiling_face: Just created issue: 
<https://github.com/apache/pulsar/issues/3554>
+1 : Sijie Guo
----
2019-02-08 22:04:30 UTC - Grant Wu: Is there a way to disable logging from the 
C++ client (from within the Python client wrapper?)
----
2019-02-08 22:04:49 UTC - Grant Wu: Apparently its log format causes Apache 
Livy to misbehave :man-facepalming:
----
2019-02-08 22:05:45 UTC - Matteo Merli: Not from Python unfortunately
----
2019-02-08 22:06:02 UTC - Matteo Merli: It’s possible to pass a logger function 
in C++ and Go clients
----
2019-02-08 22:06:28 UTC - Matteo Merli: it would be easy to also add for 
Python, but it’s not there at this moment
----

Reply via email to