I have provided an implementation `MessagePackSerializer` for improving muti-lang performance. (PR: https://github.com/apache/storm/pull/1136). You can take a look at this. It's not merged yet.
Thanks, Xin 2016-03-22 18:37 GMT+08:00 Brian Candler <b.cand...@pobox.com>: > Hello, > > I have some questions about external workers and the multi-lang protocol. > We have a bunch of existing C code for running processing steps over binary > data and I'm looking to see how feasible it is to hook it into Storm. > > > (1) Is it possible to handle binary data with multi-lang? Or is there > existing support for hooking C into Storm? > > The multi-lang protocol is JSON, so that implies either base64-encoding > everything or passing round a URL to where the binary data is stored. > > But looking at the source I see that topology.multilang.serializer is > pluggable, so perhaps it's possible to make a version using (e.g.) MsgPack? > Ah yes: > https://github.com/pystorm/pystorm/issues/5 > > So maybe there's a C library comparable to pystorm? Or I can use this > serializer to talk msgpack to a spawned C process? > > > (2) Is there a practical maximum size to a tuple? In some cases we have > chunks of around 50MB to pass from step to step. Is it reasonable to pass > these directly? Or should they be written into some intermediate store like > an NFS server? > > > (3) http://storm.apache.org/documentation/Multilang-protocol.html > "The shell bolt protocol is asynchronous. You will receive tuples on STDIN > as soon as they are available" > > So just to be clear: it's fine for me to write a multi-threaded external > process which handles multiple overlapping requests? > > Furthermore: if all the threads are busy, can I simply stop reading from > stdin and let the sender block until I'm ready to receive more tuples? > > > I also have some general questions about the Storm architecture. > > (4) http://storm.apache.org/documentation/Concepts.html > > " Shuffle grouping: Tuples are randomly distributed across the bolt's > tasks in a way such that each bolt is guaranteed to get an equal number of > tuples." > > Suppose the bolt's tasks are split across two servers, one of which is > slower than the other. Does this mean that the slower server will be 100% > utilised while the faster servers will have idle periods? Or is there some > flow-control mechanism which kicks into play and gives a larger share to > the faster servers? > > Specifically I am thinking of: > - A heterogenous cluster, where some servers are older and slower than > others > - A cluster where one server happens to be busier than another (e.g. it is > also working on a different topology) > > Through googling I found topology.max.spout.pending, so I see there is an > overall control of the number of in-flight (unacked) tuples, except for > unreliable spouts: > http://stackoverflow.com/questions/24413088/storm-max-spout-pending > > But other than that, will the shufflegrouping deal them out as fast as > possible into the downstream bolts? > > > (5) > http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html > > This says that a single thread (executor) can run multiple task instances > of the same component. > > How does that work? That is, if those multiple tasks are in the same > thread, how do they run concurrently? Or if they can't run concurrently, > what is the benefit of having multiple tasks in a thread instead of just > one task? > > > (6) How does Storm distribute tasks over workers and servers? For example, > suppose spout A connects to bolt B. I have two servers, and I run a > topology with 2 workers, 4 tasks of A and 4 tasks of B. Will I get 4A on > one server and 4B on the other, or 2A+2B on both, or something else? > > > Many thanks, > > Brian Candler. > >