Re: Multilang, C and binary data

Xin Wang Tue, 22 Mar 2016 20:37:57 -0700

I have provided an implementation `MessagePackSerializer` for improving
muti-lang performance. (PR: https://github.com/apache/storm/pull/1136). You
can take a look at this. It's not merged yet.



Thanks,
Xin


2016-03-22 18:37 GMT+08:00 Brian Candler <b.cand...@pobox.com>:

> Hello,
>
> I have some questions about external workers and the multi-lang protocol.
> We have a bunch of existing C code for running processing steps over binary
> data and I'm looking to see how feasible it is to hook it into Storm.
>
>
> (1) Is it possible to handle binary data with multi-lang? Or is there
> existing support for hooking C into Storm?
>
> The multi-lang protocol is JSON, so that implies either base64-encoding
> everything or passing round a URL to where the binary data is stored.
>
> But looking at the source I see that topology.multilang.serializer is
> pluggable, so perhaps it's possible to make a version using (e.g.) MsgPack?
> Ah yes:
> https://github.com/pystorm/pystorm/issues/5
>
> So maybe there's a C library comparable to pystorm? Or I can use this
> serializer to talk msgpack to a spawned C process?
>
>
> (2) Is there a practical maximum size to a tuple? In some cases we have
> chunks of around 50MB to pass from step to step. Is it reasonable to pass
> these directly? Or should they be written into some intermediate store like
> an NFS server?
>
>
> (3) http://storm.apache.org/documentation/Multilang-protocol.html
> "The shell bolt protocol is asynchronous. You will receive tuples on STDIN
> as soon as they are available"
>
> So just to be clear: it's fine for me to write a multi-threaded external
> process which handles multiple overlapping requests?
>
> Furthermore: if all the threads are busy, can I simply stop reading from
> stdin and let the sender block until I'm ready to receive more tuples?
>
>
> I also have some general questions about the Storm architecture.
>
> (4) http://storm.apache.org/documentation/Concepts.html
>
> " Shuffle grouping: Tuples are randomly distributed across the bolt's
> tasks in a way such that each bolt is guaranteed to get an equal number of
> tuples."
>
> Suppose the bolt's tasks are split across two servers, one of which is
> slower than the other. Does this mean that the slower server will be 100%
> utilised while the faster servers will have idle periods? Or is there some
> flow-control mechanism which kicks into play and gives a larger share to
> the faster servers?
>
> Specifically I am thinking of:
> - A heterogenous cluster, where some servers are older and slower than
> others
> - A cluster where one server happens to be busier than another (e.g. it is
> also working on a different topology)
>
> Through googling I found topology.max.spout.pending, so I see there is an
> overall control of the number of in-flight (unacked) tuples, except for
> unreliable spouts:
> http://stackoverflow.com/questions/24413088/storm-max-spout-pending
>
> But other than that, will the shufflegrouping deal them out as fast as
> possible into the downstream bolts?
>
>
> (5)
> http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
>
> This says that a single thread (executor) can run multiple task instances
> of the same component.
>
> How does that work? That is, if those multiple tasks are in the same
> thread, how do they run concurrently? Or if they can't run concurrently,
> what is the benefit of having multiple tasks in a thread instead of just
> one task?
>
>
> (6) How does Storm distribute tasks over workers and servers? For example,
> suppose spout A connects to bolt B. I have two servers, and I run a
> topology with 2 workers, 4 tasks of A and 4 tasks of B. Will I get 4A on
> one server and 4B on the other, or 2A+2B on both, or something else?
>
>
> Many thanks,
>
> Brian Candler.
>
>

Re: Multilang, C and binary data

Reply via email to