OP here, thanks to everyone for your thoughts. Replies inline: In reply to Jakob, thanks for the pointer to SAMZA-18, that's good to know. I probably won't hold my breath waiting for it though :)
In reply to Martin: > Were you envisaging that such API calls could be made over a protocol > that uses stdout and stdin as its transport? I'm sure it can be done, but > the protocol would not be totally straightforward, as it would have to do > things like match a request from the child process (on stdout) with a > response to the API call (on stdin). Samza currently has the nice property of having one message in flight at a time per partition. I think we'd want to preserve that and avoid adding any pipelining by having multiple messages in flight. So that means we could use a synchronous protocol over stdin/stdout that doesn't need to match requests to responses. Please correct me if I'm wrong on this, I'm still very new to Samza. In reply to Jay, I take your point about Hadoop Streaming. Adding lifecycle/control messages to the protocol seems beneficial. We could also allow the external process to access the KV store in the JVM via the stdin/stdout protocol. We'd also want to use an encoding scheme that doesn't have problems with separators (Hadoop Streaming has trouble with data that contains internal tabs and newlines). I think I'd prefer protobufs over JSON for performance reasons, but it could be configurable. So it seems like this is doable. There are some design questions but maybe a proof of concept would be a good next step. We'll be in touch if/when that happens. If anyone else wants to try it, we'd welcome that also :) Cheers, Dave On Tue, Mar 11, 2014 at 8:58 AM, Jay Kreps <[email protected]> wrote: > Yes, I think all I am saying is that stdin/stdout aren't so bad. The > mistake made by Hadoop streaming I think was to not specify a more detailed > extensible control protocol for the client process. (or at least that > seemed to be true a long-ass time ago when I last used Hadoop streaming). > You would presumably model commands like commit as outputs and config and > such as input and you would need so predetermined data format to read and > write the topic/partition/key/value pairs. Sounds like Storm did a better > job here. > > As you say this protocol needn't go over stdin/out but it is fairly cheap > and I think unix domain sockets are not well supported in Java. > > As you say you definitely don't get the full power out of the box. For > example the key-value store would have to be in the child process and all > we could provide would be the backing changelog stream. > > Basically I am saying the maintenance burden of doing this seems low--it's > just a simple samza job that manages a native subprocess and feeds it > formatted input--so there is no harm in pursuing both approaches... > > -Jay > > > On Mon, Mar 10, 2014 at 6:10 PM, Martin Kleppmann > <[email protected]>wrote: > > > If we want to make non-JVM languages first-class citizens in Samza (which > > I think we should), they will need access to all the Samza APIs, > including > > reading and writing the key-value store and the like. > > > > Were you envisaging that such API calls could be made over a protocol > that > > uses stdout and stdin as its transport? I'm sure it can be done, but the > > protocol would not be totally straightforward, as it would have to do > > things like match a request from the child process (on stdout) with a > > response to the API call (on stdin). > > > > Then there is a separate question of what transport the non-JVM process > > uses to talk to the Samza container. stdout/stdin streams is an option, > or > > it could use Unix sockets, or TCP. Whatever method of transport is used, > a > > protocol will need to encode the Samza API calls in raw bytes. > > > > For reference, Storm uses a JSON-based protocol over stdin/stdout: > > https://github.com/nathanmarz/storm/wiki/Multilang-protocol -- however, > > it doesn't support all Storm features ( > > https://issues.apache.org/jira/browse/STORM-151). > > > > Martin > > > > On 10 Mar 2014, at 23:21, Jay Kreps <[email protected]> wrote: > > > FWIW, I actually think the Hadoop streaming approach has some benefits. > > It > > > is less efficient then writing and embedding a C library but also much > > much > > > easier to implement and with less duplicate logic. I think we should be > > > open to both of these--the streaming approach is so easy, it seems to > me > > > like there is not a huge downside to having that available. > > > > > > I think the mistake that Hadoop streaming might have made was > > > over-simplifying the interaction with the client process. You probably > > need > > > a richer protocol than just the data (though I haven't thought this > > > through). > > > > > > -Jay > > > > > > > > > On Mon, Mar 10, 2014 at 12:26 PM, Jakob Homan <[email protected]> > wrote: > > > > > >> Hey Dave- > > >> Thanks for taking a look at Samza. No one in the community is > > currently > > >> working on this at the moment, to our knowledge. SAMZA-18 ( > > >> https://issues.apache.org/jira/browse/SAMZA-18) has the beginnings > of a > > >> discussion about creating a single C library to help provide > > multilanguage > > >> support in Samza (which I believe would be accessible to Go as well). > > >> There's currently no JIRA for Hadoop-style streaming, but one could > > >> certainly be created and it would be something we'd be interested in. > > >> Thanks, > > >> Jakob > > >> > > >> > > >> > > >> On Mon, Mar 10, 2014 at 12:19 PM, Dave Revell <[email protected]> wrote: > > >> > > >>> Hi all, > > >>> > > >>> We're considering using Samza for our high-throughput stream > processing > > >>> workload, but we don't want to rewrite all of our existing Go code. > > We're > > >>> considering writing something analogous to Hadoop Streaming, where > the > > >>> Samza consumer would start an external process and communicate with > it > > by > > >>> passing protobufs via stdin/stdout. We like Samza's fault tolerance, > > >> state > > >>> management, and load balancing features and don't want to rewrite > them. > > >>> > > >>> This possibility is mentioned in the documentation ( > > >>> > > >>> > > >> > > > http://samza.incubator.apache.org/learn/documentation/0.7.0/comparisons/storm.html > > >>> , > > >>> search for "stdin") as something that might exist some day. My > > >>> questions > > >>> are: > > >>> > > >>> 1. Is anyone working on this, or planning to? I couldn't find any > > related > > >>> JIRAs. > > >>> 2. Any advice for implementing this? Are there any challenges that > > might > > >>> not be obvious? > > >>> 3. Should we try to merge this upstream? > > >>> > > >>> Thanks a bunch, > > >>> Dave > > >>> > > >> > > > > >
