Re: Multilang, C and binary data

Xiang Wang Tue, 22 Mar 2016 22:12:31 -0700

Hi,

You may have a look at: http://demeter.inf.ed.ac.uk/cross/stormcpp.html
It may help you to run storm with binary c code.


I am a storm beginner, and cannot help you with other questions...





-------------------------------
Xiang Wang PhD Candidate
Database Research Group
School of Computer Science and Engineering
The University of New South Wales
Sydney, Australia

On Wed, Mar 23, 2016 at 2:37 PM, Xin Wang <data.xinw...@gmail.com> wrote:

> I have provided an implementation `MessagePackSerializer` for improving
> muti-lang performance. (PR: https://github.com/apache/storm/pull/1136).
> You can take a look at this. It's not merged yet.
>
>
> Thanks,
> Xin
>
>
> 2016-03-22 18:37 GMT+08:00 Brian Candler <b.cand...@pobox.com>:
>
>> Hello,
>>
>> I have some questions about external workers and the multi-lang protocol.
>> We have a bunch of existing C code for running processing steps over binary
>> data and I'm looking to see how feasible it is to hook it into Storm.
>>
>>
>> (1) Is it possible to handle binary data with multi-lang? Or is there
>> existing support for hooking C into Storm?
>>
>> The multi-lang protocol is JSON, so that implies either base64-encoding
>> everything or passing round a URL to where the binary data is stored.
>>
>> But looking at the source I see that topology.multilang.serializer is
>> pluggable, so perhaps it's possible to make a version using (e.g.) MsgPack?
>> Ah yes:
>> https://github.com/pystorm/pystorm/issues/5
>>
>> So maybe there's a C library comparable to pystorm? Or I can use this
>> serializer to talk msgpack to a spawned C process?
>>
>>
>> (2) Is there a practical maximum size to a tuple? In some cases we have
>> chunks of around 50MB to pass from step to step. Is it reasonable to pass
>> these directly? Or should they be written into some intermediate store like
>> an NFS server?
>>
>>
>> (3) http://storm.apache.org/documentation/Multilang-protocol.html
>> "The shell bolt protocol is asynchronous. You will receive tuples on
>> STDIN as soon as they are available"
>>
>> So just to be clear: it's fine for me to write a multi-threaded external
>> process which handles multiple overlapping requests?
>>
>> Furthermore: if all the threads are busy, can I simply stop reading from
>> stdin and let the sender block until I'm ready to receive more tuples?
>>
>>
>> I also have some general questions about the Storm architecture.
>>
>> (4) http://storm.apache.org/documentation/Concepts.html
>>
>> " Shuffle grouping: Tuples are randomly distributed across the bolt's
>> tasks in a way such that each bolt is guaranteed to get an equal number of
>> tuples."
>>
>> Suppose the bolt's tasks are split across two servers, one of which is
>> slower than the other. Does this mean that the slower server will be 100%
>> utilised while the faster servers will have idle periods? Or is there some
>> flow-control mechanism which kicks into play and gives a larger share to
>> the faster servers?
>>
>> Specifically I am thinking of:
>> - A heterogenous cluster, where some servers are older and slower than
>> others
>> - A cluster where one server happens to be busier than another (e.g. it
>> is also working on a different topology)
>>
>> Through googling I found topology.max.spout.pending, so I see there is an
>> overall control of the number of in-flight (unacked) tuples, except for
>> unreliable spouts:
>> http://stackoverflow.com/questions/24413088/storm-max-spout-pending
>>
>> But other than that, will the shufflegrouping deal them out as fast as
>> possible into the downstream bolts?
>>
>>
>> (5)
>> http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
>>
>> This says that a single thread (executor) can run multiple task instances
>> of the same component.
>>
>> How does that work? That is, if those multiple tasks are in the same
>> thread, how do they run concurrently? Or if they can't run concurrently,
>> what is the benefit of having multiple tasks in a thread instead of just
>> one task?
>>
>>
>> (6) How does Storm distribute tasks over workers and servers? For
>> example, suppose spout A connects to bolt B. I have two servers, and I run
>> a topology with 2 workers, 4 tasks of A and 4 tasks of B. Will I get 4A on
>> one server and 4B on the other, or 2A+2B on both, or something else?
>>
>>
>> Many thanks,
>>
>> Brian Candler.
>>
>>
>

Re: Multilang, C and binary data

Reply via email to