Slack digest for #general - 2018-10-02

Apache Pulsar Slack Tue, 02 Oct 2018 02:12:02 -0700

2018-10-01 11:30:42 UTC - Jean-Bernard van Zuylen: @Sijie Guo Pull request 
available: <https://github.com/apache/pulsar/pull/2690>
----
2018-10-01 14:54:19 UTC - Sijie Guo: :+1: 
----
2018-10-01 15:23:02 UTC - Grant Wu: I am confused as to why people are worried 
about the size of the docker image.
----
2018-10-01 15:23:21 UTC - Grant Wu: There is no docker image for Functions, as 
I understand it.  There’s a docker image for Pulsar.  But Pulsar does not 
invoke a Docker image for functions.
----
2018-10-01 15:23:30 UTC - Grant Wu: I am also confused as to why people are 
concerned about the overhead of function calls.
----
2018-10-01 15:23:46 UTC - Grant Wu: I feel like any such overhead is dwarfed by 
wanting to use higher level languages like Javascript
----
2018-10-01 15:26:14 UTC - Grant Wu: Or, hell, the overhead from doing IPC
----
2018-10-01 15:30:01 UTC - Matteo Merli: There’s no IPC involved though, the 
runtime has a consumer/producer instance in the function process. It the same 
as using directly the pub-sub API. 
----
2018-10-01 15:30:39 UTC - Grant Wu: Er, doesn’t that still need to talk to 
Pulsar
----
2018-10-01 15:30:52 UTC - Grant Wu: Over the binary protocol
----
2018-10-01 15:31:15 UTC - Grant Wu: Like, directly using the pub-sub API 
requires talking over a protocol
----
2018-10-01 15:31:50 UTC - Matteo Merli: Sure, but that’s already optimized for 
high throughput through batching
----
2018-10-01 15:32:08 UTC - Grant Wu: Sure, I’m just saying that the overhead of 
_invoking a function_ is likely to be minimal still
----
2018-10-01 15:32:30 UTC - Matteo Merli: And the use of flow control to push 
many messages to a consumer insurance
----
2018-10-01 15:32:38 UTC - Matteo Merli: (Instance)
----
2018-10-01 15:34:25 UTC - Grant Wu: Although, I don’t know much about the 
Python/NodeJS interpreters, maybe invoking a function is relatively expensive.  
I doubt it, though, especially in the case of invoking the same function over 
and over
----
2018-10-01 15:44:16 UTC - Grant Wu: Is a docker container started to run the 
Pulsar function?  That’s not the impression I got, but, correct me if my 
assumption is wrong
----
2018-10-01 16:35:38 UTC - Sanjeev Kulkarni: The overhead of invoking is fairly 
small even for interpreted languages like Python.
----
2018-10-01 16:36:36 UTC - Grant Wu: I believe that too, I’m just… willing to 
entertain the possibility :stuck_out_tongue: My philosophy re: performance is 
“pls bring measurements”
----
2018-10-01 16:36:47 UTC - Sanjeev Kulkarni: However what could be large is any 
kind of overhead associated within a function logic. For instance if the 
function is looking  up a value in a database, then we need to make sure that 
it doesnt happen on a per message basis. This is where initialzing those 
connections in constructors and reusing them in function code makes sense
----
2018-10-01 16:36:54 UTC - Grant Wu: That’s a good point!
----
2018-10-01 16:39:00 UTC - Sanjeev Kulkarni: WRT docker, I think the point is, 
when pulsar supports submitting functions to 
kubernetes(<https://github.com/apache/pulsar/pull/1950>), then every function 
submission starts a kubernetes job which will need to download a docker image 
from somewhere to init the pods. These pods only need the stuff for running 
functions and need not have anything for running pulsar itself. Thus, it might 
make sense to have a functions only docker images that might make this image 
very small.
----
2018-10-01 16:39:44 UTC - Grant Wu: Ah, okay, I didn’t realize that was done
----
2018-10-01 16:40:09 UTC - Grant Wu: Wouldn’t the docker images already be 
downloaded to run Pulsar though
----
2018-10-01 16:40:17 UTC - Grant Wu: i.e. wouldn’t they be available locally
----
2018-10-01 16:40:27 UTC - Grant Wu: Or am I misunderstanding the architecture
----
2018-10-01 16:41:30 UTC - Grant Wu: I guess it’s not necessarily the case
----
2018-10-01 16:41:43 UTC - Grant Wu: Because it might not be on a machine which 
has downloaded the images before
----
2018-10-01 16:42:14 UTC - Sanjeev Kulkarni: It might. I’m not sure if 
kubernetes master caches some images locally so that they need not be 
downloaded from internet. But even if there is caching, imagine starting a 
function with parallelism of 100, and suddenly you see 100 copies of this large 
image being copied within a local network. Might starve other application even 
if momentarily.
----
2018-10-01 16:42:30 UTC - Grant Wu: I see, makes sense.
----
2018-10-01 16:45:45 UTC - Dave Southwell: Noob question.  Last week I setup a 
staging pulsar instance and today I came in to find my /tmp was full of 
./librocksdbjni956141922235128626.so type files.  Did I mis-configure 
something?  I mean clearly I have, but can some one point me in the right 
direction?
----
2018-10-01 16:47:46 UTC - Matteo Merli: Did you have any repeated crashes 
there? The`librocksdbjnixxxx.so` are automatically extracted by RocksDB jar. 
Under normal conditions, these files should be removed on shutdown
----
2018-10-01 16:47:58 UTC - Grant Wu: Is that related to 
<https://apache-pulsar.slack.com/archives/C5ZSVEN4E/p1538348632000100>
----
2018-10-01 16:48:52 UTC - Dave Southwell: I'll check the logs for crashes.
----
2018-10-01 17:26:06 UTC - Guillem: @Grant Wu i think a lot of questions here 
are coming from seeing the parallelism between pulsar and serverless
----
2018-10-01 17:26:32 UTC - Grant Wu: I'm seeing no parallelism
----
2018-10-01 17:26:34 UTC - Guillem: in serverless, you usually instantiate a 
container when you need to run a function, and destroy it if it hasn't been 
used for a while
----
2018-10-01 17:26:49 UTC - Guillem: as such, you want to minimize the footprint 
of the image for a number of reasons
----
2018-10-01 17:27:14 UTC - Guillem: disk usage, that potential download/caching 
of the image, time to start a container itself
----
2018-10-01 17:28:29 UTC - Grant Wu: Pulsar functions and serverless, sure
----
2018-10-01 17:28:33 UTC - Guillem: also not sure what other services may run in 
a pulsar container with the current image
----
2018-10-01 17:28:45 UTC - Guillem: so maybe there's something running even if 
you want that container for functions only
----
2018-10-01 17:28:51 UTC - Guillem: and that may be using some memory
----
2018-10-01 17:28:57 UTC - Grant Wu: Presumably nothing else would be turned on
----
2018-10-01 17:29:03 UTC - Guillem: (although this is just an assumption, i 
don't now the inner details)
----
2018-10-01 17:29:04 UTC - Matteo Merli: Nope, nothing is running in the 
container — if you don’t start it
----
2018-10-01 17:29:09 UTC - Guillem: ok, cool
----
2018-10-01 17:29:27 UTC - Guillem: then it would only be a matter of optimizing 
disk space and container startup times i guess
----
2018-10-01 17:30:04 UTC - Matteo Merli: &gt; in serverless, you usually 
instantiate a container when you need to run a function, and destroy it if it 
hasn’t been used for a while


This is achieved through the parallelism setting of the function. It controls 
how many instances you have active. Still manual setting for now.
----
2018-10-01 17:30:48 UTC - Guillem: yep, i'm aware of it @Matteo Merli and i 
think in the world of pulsar, it makes sense to keep those containers running 
unless you explicitely kill them
----
2018-10-01 17:31:10 UTC - Guillem: it's not really 1:1 with serverless, so 
keeping those containers alive for stream processing make a lot of sense to me
----
2018-10-01 17:31:23 UTC - Guillem: it's probably more an issue of optimizing 
resource and also managing the scaling easily
----
2018-10-01 17:36:51 UTC - Guillem: so in terms of how functions work, can 
somebody tell me if my current understanding of how pulsar does it is correct?
- the worker container (i think this happens in the brokers?) will use the 
client library specific to the runtime to connect to the pulsar queue (source 
as input, sink as output)
- then the container will instantiate the class that you embed your function 
into so it does 'preload' things like what was discussed before around DB 
connections and so
- then, when a new message is received in the pulsar source, it will be sent to 
the process() method of the instantiated class and the output piped to the 
output
- at the end, the class will still be instantiated and waiting for new messages 
to arrive to call the process() method again
----
2018-10-01 17:40:48 UTC - Sanjeev Kulkarni: Its a little different than that. 
This is a brief summary of the workflow
----
2018-10-01 17:41:55 UTC - Sanjeev Kulkarni: 1. User submits a pulsar function 
using the rest api. The rest call can be serviced by a broker who has 
functions_worker config enabled. Or it could go to a server that is dedicated 
to handle function requests
----
2018-10-01 17:43:16 UTC - Sanjeev Kulkarni: 2. Function_workers are configured 
to use some kind of runtime. They could be configured to use 
threadruntime(applicable for java functions only), process runtime or 
kubernetes runtime. So depending on the runtime and the function, this worker 
ensemble collectively starts the requested number of function instances amongst 
them
----
2018-10-01 17:44:08 UTC - Sanjeev Kulkarni: 3. The runtime decides what kind of 
action to do. Threadruntime just starts a new thread to service the function. 
ProcessRuntime starts a new process and Kubernetes runtime launches a k8 job
----
2018-10-01 17:44:49 UTC - Sanjeev Kulkarni: 4. What is started by these runtime 
is a function instance that is nothing but a wrapped (producer -&gt; function 
-&gt; consumer) application.
----
2018-10-01 17:46:20 UTC - Sanjeev Kulkarni: As such producer/consumer of the 
function instance gets the data directly from puslar using the pulsar api
----
2018-10-01 17:46:52 UTC - Sanjeev Kulkarni: that is different from the usual 
serverless architecture where the producer sits outside the serverless function 
and pipes the data to it
----
2018-10-01 18:41:32 UTC - Nicolas Ha: Is there a web healthcheck endpoint for 
Pulsar? Or just an endpoint that responds when the broker is alive?
This would be useful for CI/Healthcheck. It would be awesome if it did conform 
to kuberntes livenessProbe too
----
2018-10-01 18:41:59 UTC - Nicolas Ha: Pretty sure I asked a while back and 
there wasn’t one - if that’s not the case should I create a ticket?
----
2018-10-01 18:46:17 UTC - Ali Ahmed: @Nicolas Ha yes there is you can use 
“http://{broker-host}:8080/admin/brokers/configuration”
----
2018-10-01 18:46:29 UTC - Ali Ahmed: and check for 200 Ok
----
2018-10-01 18:46:57 UTC - Nicolas Ha: that would work for me yes, thank you 
:slightly_smiling_face: Do you know by any chance if it requires authentication?
----
2018-10-01 18:52:14 UTC - Nicolas Ha: (I’ll try and see)
----
2018-10-01 19:05:09 UTC - Nicolas Ha: no need for auth it seems 
:slightly_smiling_face: thanks Ahmed
----
2018-10-02 05:05:59 UTC - Nathanial Murphy: So I'm trying to backfill my pulsar 
instance with data from a datasource on a single partition. What are the common 
bottlenecks with pulsar that I can avoid to speed this up? I need to keep this 
topic to a single partition for the strict/total ordering guarantees
----
2018-10-02 05:18:48 UTC - Nathanial Murphy: also, second question - is there 
any plans to support distributed transactions across topics like kafka 
currently does?
----
2018-10-02 05:32:25 UTC - Matteo Merli: Make sure you publish asynchronously, 
to pipeline messages from client to broker and achieve higher throughput 
----
2018-10-02 05:33:01 UTC - Matteo Merli: Yes, there are plans to get into that 
as well
----
2018-10-02 05:34:32 UTC - Nathanial Murphy: you're the real mvp @Matteo Merli
last question - what's the easiest way to get the last published message to a 
topic
----
2018-10-02 05:50:31 UTC - Matteo Merli: from a producer’s perspective? you mean 
after a crash or when publishing?
----
2018-10-02 05:57:14 UTC - Nathanial Murphy: After a crash. I need to be able to 
know where to resume on both the Pulsar and the mysql binlog side.
----
2018-10-02 05:59:41 UTC - Matteo Merli: Take a look at 
<http://pulsar.apache.org/docs/en/cookbooks-deduplication/>
----
2018-10-02 06:00:31 UTC - Matteo Merli: and 
<https://streaml.io/blog/pulsar-effectively-once> for a more prosaic version
----
2018-10-02 06:01:17 UTC - Matteo Merli: Once the deduplication is enabled, you 
can use `long lastSequenceId = producer.getLastSequenceId();` to fetch what was 
the last message published by a particular producer
----
2018-10-02 06:05:18 UTC - Nathanial Murphy: can I use that sequence ID to look 
up a message ID?
----
2018-10-02 06:05:40 UTC - Nathanial Murphy: I'm trying to resume another stream 
from a separate system to get data into pulsar - in this case, a (filename, 
byte offset) tuple from a mysql binlog. This information is available in the 
last message my producer published to a given topic. I can guarantee that my 
producer is the only one writing to this topic, and that my topic only covers 
one partition.
----
2018-10-02 06:27:50 UTC - Matteo Merli: You can assign any meaning to the 
sequence id, as long as it’s monotonically increasing (jumping ahead is fine)
----
2018-10-02 06:28:32 UTC - Matteo Merli: I mean, when you publish a message, you 
can specify the sequence id.
----
2018-10-02 06:33:21 UTC - Nathanial Murphy: Okay, sure. I'd rather not encode a 
filepath in a sequenceID though. Is there any way of getting the last published 
message on a topic and/or partition?
----
2018-10-02 06:36:24 UTC - Matteo Merli: Not directly. You could use a Reader 
but it would be either posistion on oldest message, specific message or latest 
message (but excluded). There’s currently no option to position on latest 
message “included” :confused:
----
2018-10-02 06:40:48 UTC - Nathanial Murphy: hm. I could apply an incredibly 
aggressive compaction scheme to minimise the number of reads as it doesn't make 
sense to "compact" this topic
----
2018-10-02 06:41:04 UTC - Nathanial Murphy: idk. You've given me a lot to 
ruminate on. Thanks.
----

Slack digest for #general - 2018-10-02

Reply via email to