Cool! A prototype would be good, but I think for this kind of thing where
potentially many client langs would integrate it would make sense to do a
quick wiki page on how the protocol might work and work that out up-front
or in parallel...

-Jay


On Tue, Mar 11, 2014 at 9:54 AM, Dave Revell <[email protected]> wrote:

> OP here, thanks to everyone for your thoughts. Replies inline:
>
> In reply to Jakob, thanks for the pointer to SAMZA-18, that's good to know.
> I probably won't hold my breath waiting for it though :)
>
> In reply to Martin:
>
> > Were you envisaging that such API calls could be made over a protocol
> > that uses stdout and stdin as its transport? I'm sure it can be done, but
> > the protocol would not be totally straightforward, as it would have to do
> > things like match a request from the child process (on stdout) with a
> > response to the API call (on stdin).
>
> Samza currently has the nice property of having one message in flight at a
> time per partition. I think we'd want to preserve that and avoid adding any
> pipelining by having multiple messages in flight. So that means we could
> use a synchronous protocol over stdin/stdout that doesn't need to match
> requests to responses. Please correct me if I'm wrong on this, I'm still
> very new to Samza.
>
> In reply to Jay, I take your point about Hadoop Streaming. Adding
> lifecycle/control messages to the protocol seems beneficial. We could also
> allow the external process to access the KV store in the JVM via the
> stdin/stdout protocol. We'd also want to use an encoding scheme that
> doesn't have problems with separators (Hadoop Streaming has trouble with
> data that contains internal tabs and newlines). I think I'd prefer
> protobufs over JSON for performance reasons, but it could be configurable.
>
> So it seems like this is doable. There are some design questions but maybe
> a proof of concept would be a good next step. We'll be in touch if/when
> that happens. If anyone else wants to try it, we'd welcome that also :)
>
> Cheers,
> Dave
>
>
>
>
> On Tue, Mar 11, 2014 at 8:58 AM, Jay Kreps <[email protected]> wrote:
>
> > Yes, I think all I am saying is that stdin/stdout aren't so bad. The
> > mistake made by Hadoop streaming I think was to not specify a more
> detailed
> > extensible control protocol for the client process. (or at least that
> > seemed to be true a long-ass time ago when I last used Hadoop streaming).
> > You would presumably model commands like commit as outputs and config and
> > such as input and you would need so predetermined data format to read and
> > write the topic/partition/key/value pairs. Sounds like Storm did a better
> > job here.
> >
> > As you say this protocol needn't go over stdin/out but it is fairly cheap
> > and I think unix domain sockets are not well supported in Java.
> >
> > As you say you definitely don't get the full power out of the box. For
> > example the key-value store would have to be in the child process and all
> > we could provide would be the backing changelog stream.
> >
> > Basically I am saying the maintenance burden of doing this seems
> low--it's
> > just a simple samza job that manages a native subprocess and feeds it
> > formatted input--so there is no harm in pursuing both approaches...
> >
> > -Jay
> >
> >
> > On Mon, Mar 10, 2014 at 6:10 PM, Martin Kleppmann
> > <[email protected]>wrote:
> >
> > > If we want to make non-JVM languages first-class citizens in Samza
> (which
> > > I think we should), they will need access to all the Samza APIs,
> > including
> > > reading and writing the key-value store and the like.
> > >
> > > Were you envisaging that such API calls could be made over a protocol
> > that
> > > uses stdout and stdin as its transport? I'm sure it can be done, but
> the
> > > protocol would not be totally straightforward, as it would have to do
> > > things like match a request from the child process (on stdout) with a
> > > response to the API call (on stdin).
> > >
> > > Then there is a separate question of what transport the non-JVM process
> > > uses to talk to the Samza container. stdout/stdin streams is an option,
> > or
> > > it could use Unix sockets, or TCP. Whatever method of transport is
> used,
> > a
> > > protocol will need to encode the Samza API calls in raw bytes.
> > >
> > > For reference, Storm uses a JSON-based protocol over stdin/stdout:
> > > https://github.com/nathanmarz/storm/wiki/Multilang-protocol --
> however,
> > > it doesn't support all Storm features (
> > > https://issues.apache.org/jira/browse/STORM-151).
> > >
> > > Martin
> > >
> > > On 10 Mar 2014, at 23:21, Jay Kreps <[email protected]> wrote:
> > > > FWIW, I actually think the Hadoop streaming approach has some
> benefits.
> > > It
> > > > is less efficient then writing and embedding a C library but also
> much
> > > much
> > > > easier to implement and with less duplicate logic. I think we should
> be
> > > > open to both of these--the streaming approach is so easy, it seems to
> > me
> > > > like there is not a huge downside to having that available.
> > > >
> > > > I think the mistake that Hadoop streaming might have made was
> > > > over-simplifying the interaction with the client process. You
> probably
> > > need
> > > > a richer protocol than just the data (though I haven't thought this
> > > > through).
> > > >
> > > > -Jay
> > > >
> > > >
> > > > On Mon, Mar 10, 2014 at 12:26 PM, Jakob Homan <[email protected]>
> > wrote:
> > > >
> > > >> Hey Dave-
> > > >>   Thanks for taking a look at Samza.  No one in the community is
> > > currently
> > > >> working on this at the moment, to our knowledge.  SAMZA-18 (
> > > >> https://issues.apache.org/jira/browse/SAMZA-18) has the beginnings
> > of a
> > > >> discussion about creating a single C library to help provide
> > > multilanguage
> > > >> support in Samza (which I believe would be accessible to Go as
> well).
> > > >> There's currently no JIRA for Hadoop-style streaming, but one could
> > > >> certainly be created and it would be something we'd be interested
> in.
> > > >> Thanks,
> > > >> Jakob
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Mar 10, 2014 at 12:19 PM, Dave Revell <[email protected]>
> wrote:
> > > >>
> > > >>> Hi all,
> > > >>>
> > > >>> We're considering using Samza for our high-throughput stream
> > processing
> > > >>> workload, but we don't want to rewrite all of our existing Go code.
> > > We're
> > > >>> considering writing something analogous to Hadoop Streaming, where
> > the
> > > >>> Samza consumer would start an external process and communicate with
> > it
> > > by
> > > >>> passing protobufs via stdin/stdout. We like Samza's fault
> tolerance,
> > > >> state
> > > >>> management, and load balancing features and don't want to rewrite
> > them.
> > > >>>
> > > >>> This possibility is mentioned in the documentation (
> > > >>>
> > > >>>
> > > >>
> > >
> >
> http://samza.incubator.apache.org/learn/documentation/0.7.0/comparisons/storm.html
> > > >>> ,
> > > >>> search for "stdin") as something that might exist some day. My
> > > >>> questions
> > > >>> are:
> > > >>>
> > > >>> 1. Is anyone working on this, or planning to? I couldn't find any
> > > related
> > > >>> JIRAs.
> > > >>> 2. Any advice for implementing this? Are there any challenges that
> > > might
> > > >>> not be obvious?
> > > >>> 3. Should we try to merge this upstream?
> > > >>>
> > > >>> Thanks a bunch,
> > > >>> Dave
> > > >>>
> > > >>
> > >
> > >
> >
>

Reply via email to