Yes, I think all I am saying is that stdin/stdout aren't so bad. The mistake made by Hadoop streaming I think was to not specify a more detailed extensible control protocol for the client process. (or at least that seemed to be true a long-ass time ago when I last used Hadoop streaming). You would presumably model commands like commit as outputs and config and such as input and you would need so predetermined data format to read and write the topic/partition/key/value pairs. Sounds like Storm did a better job here.
As you say this protocol needn't go over stdin/out but it is fairly cheap and I think unix domain sockets are not well supported in Java. As you say you definitely don't get the full power out of the box. For example the key-value store would have to be in the child process and all we could provide would be the backing changelog stream. Basically I am saying the maintenance burden of doing this seems low--it's just a simple samza job that manages a native subprocess and feeds it formatted input--so there is no harm in pursuing both approaches... -Jay On Mon, Mar 10, 2014 at 6:10 PM, Martin Kleppmann <[email protected]>wrote: > If we want to make non-JVM languages first-class citizens in Samza (which > I think we should), they will need access to all the Samza APIs, including > reading and writing the key-value store and the like. > > Were you envisaging that such API calls could be made over a protocol that > uses stdout and stdin as its transport? I'm sure it can be done, but the > protocol would not be totally straightforward, as it would have to do > things like match a request from the child process (on stdout) with a > response to the API call (on stdin). > > Then there is a separate question of what transport the non-JVM process > uses to talk to the Samza container. stdout/stdin streams is an option, or > it could use Unix sockets, or TCP. Whatever method of transport is used, a > protocol will need to encode the Samza API calls in raw bytes. > > For reference, Storm uses a JSON-based protocol over stdin/stdout: > https://github.com/nathanmarz/storm/wiki/Multilang-protocol -- however, > it doesn't support all Storm features ( > https://issues.apache.org/jira/browse/STORM-151). > > Martin > > On 10 Mar 2014, at 23:21, Jay Kreps <[email protected]> wrote: > > FWIW, I actually think the Hadoop streaming approach has some benefits. > It > > is less efficient then writing and embedding a C library but also much > much > > easier to implement and with less duplicate logic. I think we should be > > open to both of these--the streaming approach is so easy, it seems to me > > like there is not a huge downside to having that available. > > > > I think the mistake that Hadoop streaming might have made was > > over-simplifying the interaction with the client process. You probably > need > > a richer protocol than just the data (though I haven't thought this > > through). > > > > -Jay > > > > > > On Mon, Mar 10, 2014 at 12:26 PM, Jakob Homan <[email protected]> wrote: > > > >> Hey Dave- > >> Thanks for taking a look at Samza. No one in the community is > currently > >> working on this at the moment, to our knowledge. SAMZA-18 ( > >> https://issues.apache.org/jira/browse/SAMZA-18) has the beginnings of a > >> discussion about creating a single C library to help provide > multilanguage > >> support in Samza (which I believe would be accessible to Go as well). > >> There's currently no JIRA for Hadoop-style streaming, but one could > >> certainly be created and it would be something we'd be interested in. > >> Thanks, > >> Jakob > >> > >> > >> > >> On Mon, Mar 10, 2014 at 12:19 PM, Dave Revell <[email protected]> wrote: > >> > >>> Hi all, > >>> > >>> We're considering using Samza for our high-throughput stream processing > >>> workload, but we don't want to rewrite all of our existing Go code. > We're > >>> considering writing something analogous to Hadoop Streaming, where the > >>> Samza consumer would start an external process and communicate with it > by > >>> passing protobufs via stdin/stdout. We like Samza's fault tolerance, > >> state > >>> management, and load balancing features and don't want to rewrite them. > >>> > >>> This possibility is mentioned in the documentation ( > >>> > >>> > >> > http://samza.incubator.apache.org/learn/documentation/0.7.0/comparisons/storm.html > >>> , > >>> search for "stdin") as something that might exist some day. My > >>> questions > >>> are: > >>> > >>> 1. Is anyone working on this, or planning to? I couldn't find any > related > >>> JIRAs. > >>> 2. Any advice for implementing this? Are there any challenges that > might > >>> not be obvious? > >>> 3. Should we try to merge this upstream? > >>> > >>> Thanks a bunch, > >>> Dave > >>> > >> > >
