Re: Thoughts and obesrvations on Samza

Jordan Shaw Mon, 13 Jul 2015 11:27:13 -0700

Jay,
I think doing this iteratively in smaller chunks is a better way to go as
new issues arise. As Navina said Kafka is a "stream system" and Samza is a
"stream processor" and those two ideas should be mutually exclusive.


-Jordan

On Mon, Jul 13, 2015 at 10:06 AM, Jay Kreps <[email protected]> wrote:

> Hmm, thought about this more. Maybe this is just too much too quick.
> Overall I think there is some enthusiasm for the proposal but it's not
> really unanimous enough to make any kind of change this big cleanly. The
> board doesn't really like the merging stuff, user's are concerned about
> compatibility, I didn't feel there was unanimous agreement on dropping
> SystemConsumer, etc. Even if this is the right end state to get to,
> probably trying to push all this through at once isn't the right way to do
> it.
>
> So let me propose a kind of fifth (?) option which I think is less dramatic
> and let's things happen gradually. I think this is kind of like combining
> the first part of Yi's proposal and Jakob's third option, leaving the rest
> to be figured out incrementally:
>
> Option 5: We continue the prototype I shared and propose that as a kind of
> "transformer" client API in Kafka. This isn't really a full-fledged stream
> processing layer, more like a supped up consumer api for munging topics.
> This would let us figure out some of the technical bits, how to do this on
> Kafka's group management features, how to integrate the txn feature to do
> the exactly-once stuff in these transformations, and get all this stuff
> solid. This api would have valid uses in it's own right, especially when
> your transformation will be embedded inside an existing service or
> application which isn't possible with Samza (or other existing systems that
> I know of).
>
> Independently we can iterate on some of the ideas of the original proposal
> individually and figure out how (if at all) to make use of this
> functionality. This can be done bit-by-bit:
> - Could be that the existing StreamTask API ends up wrapping this
> - Could end up exposed directly in Samza as Yi proposed
> - Could be that just the lower-level group-management stuff get's used, and
> in this case it could be either just for standalone mode, or always
> - Could be that it stays as-is
>
> The advantage of this is it is lower risk...we basically don't have to make
> 12 major decisions all at once that kind of hinge on what amounts to a
> pretty aggressive rewrite. The disadvantage of this is it is a bit more
> confusing as all this is getting figured out.
>
> As with some of the other stuff, this would require a further discussion in
> the Kafka community if people do like this approach.
>
> Thoughts?
>
> -Jay
>
>
>
>
> On Sun, Jul 12, 2015 at 10:52 PM, Jay Kreps <[email protected]> wrote:
>
> > Hey Chris,
> >
> > Yeah, I'm obviously in favor of this.
> >
> > The sub-project approach seems the ideal way to take a graceful step in
> > this direction, so I will ping the board folks and see why they are
> > discouraged, it would be good to understand that. If we go that route we
> > would need to do a similar discussion in the Kafka list (but makes sense
> to
> > figure out first if it is what Samza wants).
> >
> > Irrespective of how it's implemented, though, to me the important things
> > are the following:
> > 1. Unify the website, config, naming, docs, metrics, etc--basically fix
> > the product experience so the "stream" and the "processing" feel like a
> > single user experience and brand. This seems minor but I think is a
> really
> > big deal.
> > 2. Make "standalone" mode a first class citizen and have a real technical
> > plan to be able to support cluster managers other than YARN.
> > 3. Make the config and out-of-the-box experience more usable
> >
> > I think that prototype gives a practical example of how 1-3 could be done
> > and we should pursue it. This is a pretty radical change, so I wouldn't
> be
> > shocked if people didn't want to take a step like that.
> >
> > Maybe it would make sense to see if people are on board with that general
> > idea, and then try to get some advice on sub-projects in parallel and
> nail
> > down those details?
> >
> > -Jay
> >
> > On Sun, Jul 12, 2015 at 5:54 PM, Chris Riccomini <[email protected]>
> > wrote:
> >
> >> Hey all,
> >>
> >> I want to start by saying that I'm absolutely thrilled to be a part of
> >> this
> >> community. The amount of level-headed, thoughtful, educated discussion
> >> that's gone on over the past ~10 days is overwhelming. Wonderful.
> >>
> >> It seems like discussion is waning a bit, and we've reached some
> >> conclusions. There are several key emails in this threat, which I want
> to
> >> call out:
> >>
> >> 1. Jakob's summary of the three potential ways forward.
> >>
> >>
> >>
> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVu-hxdBfyQ4qm3LDC55cUQbPdmbe4zGzTOOatYF1Pz43A%40mail.gmail.com%3E
> >> 2. Julian's call out that we should be focusing on community over code.
> >>
> >>
> >>
> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCAPSgeESZ_7bVFbwN%2Bzqi5MH%3D4CWu9MZUSanKg0-1woMqt55Fvg%40mail.gmail.com%3E
> >> 3. Martin's summary about the benefits of merging communities.
> >>
> >>
> >>
> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CBFB866B6-D9D8-4578-93C0-FFAEB1DF00FC%40kleppmann.com%3E
> >> 4. Jakob's comments about the distinction between community and code
> >> paths.
> >>
> >>
> >>
> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVtWPjHLLDsmxvz9KggVA5DfBi-nUvfqB6QdA-du%2B_a9Ng%40mail.gmail.com%3E
> >>
> >> I agree with the comments on all of these emails. I think Martin's
> summary
> >> of his position aligns very closely with my own. To that end, I think we
> >> should get concrete about what the proposal is, and call a vote on it.
> >> Given that Jay, Martin, and I seem to be aligning fairly closely, I
> think
> >> we should start with:
> >>
> >> 1. [community] Make Samza a subproject of Kafka.
> >> 2. [community] Make all Samza PMC/committers committers of the
> subproject.
> >> 3. [community] Migrate Samza's website/documentation into Kafka's.
> >> 4. [code] Have the Samza community and the Kafka community start a
> >> from-scratch reboot together in the new Kafka subproject. We can
> >> borrow/copy &  paste significant chunks of code from Samza's code base.
> >> 5. [code] The subproject would intentionally eliminate support for both
> >> other streaming systems and all deployment systems.
> >> 6. [code] Attempt to provide a bridge from our SystemConsumer to KIP-26
> >> (copy cat)
> >> 7. [code] Attempt to provide a bridge from the new subproject's
> processor
> >> interface to our legacy StreamTask interface.
> >> 8. [code/community] Sunset Samza as a TLP when we have a working Kafka
> >> subproject that has a fault-tolerant container with state management.
> >>
> >> It's likely that (6) and (7) won't be fully drop-in. Still, the closer
> we
> >> can get, the better it's going to be for our existing community.
> >>
> >> One thing that I didn't touch on with (2) is whether any Samza PMC
> members
> >> should be rolled into Kafka PMC membership as well (though, Jay and
> Jakob
> >> are already PMC members on both). I think that Samza's community
> deserves
> >> a
> >> voice on the PMC, so I'd propose that we roll at least a few PMC members
> >> into the Kafka PMC, but I don't have a strong framework for which people
> >> to
> >> pick.
> >>
> >> Before (8), I think that Samza's TLP can continue to commit bug fixes
> and
> >> patches as it sees fit, provided that we openly communicate that we
> won't
> >> necessarily migrate new features to the new subproject, and that the TLP
> >> will be shut down after the migration to the Kafka subproject occurs.
> >>
> >> Jakob, I could use your guidance here about about how to achieve this
> from
> >> an Apache process perspective (sorry).
> >>
> >> * Should I just call a vote on this proposal?
> >> * Should it happen on dev or private?
> >> * Do committers have binding votes, or just PMC?
> >>
> >> Having trouble finding much detail on the Apache wikis. :(
> >>
> >> Cheers,
> >> Chris
> >>
> >> On Fri, Jul 10, 2015 at 2:38 PM, Yan Fang <[email protected]> wrote:
> >>
> >> > Thanks, Jay. This argument persuaded me actually. :)
> >> >
> >> > Fang, Yan
> >> > [email protected]
> >> >
> >> > On Fri, Jul 10, 2015 at 2:33 PM, Jay Kreps <[email protected]> wrote:
> >> >
> >> > > Hey Yan,
> >> > >
> >> > > Yeah philosophically I think the argument is that you should capture
> >> the
> >> > > stream in Kafka independent of the transformation. This is
> obviously a
> >> > > Kafka-centric view point.
> >> > >
> >> > > Advantages of this:
> >> > > - In practice I think this is what e.g. Storm people often end up
> >> doing
> >> > > anyway. You usually need to throttle any access to a live serving
> >> > database.
> >> > > - Can have multiple subscribers and they get the same thing without
> >> > > additional load on the source system.
> >> > > - Applications can tap into the stream if need be by subscribing.
> >> > > - You can debug your transformation by tailing the Kafka topic with
> >> the
> >> > > console consumer
> >> > > - Can tee off the same data stream for batch analysis or Lambda arch
> >> > style
> >> > > re-processing
> >> > >
> >> > > The disadvantage is that it will use Kafka resources. But the idea
> is
> >> > > eventually you will have multiple subscribers to any data source (at
> >> > least
> >> > > for monitoring) so you will end up there soon enough anyway.
> >> > >
> >> > > Down the road the technical benefit is that I think it gives us a
> good
> >> > path
> >> > > towards end-to-end exactly once semantics from source to
> destination.
> >> > > Basically the connectors need to support idempotence when talking to
> >> > Kafka
> >> > > and we need the transactional write feature in Kafka to make the
> >> > > transformation atomic. This is actually pretty doable if you
> separate
> >> > > connector=>kafka problem from the generic transformations which are
> >> > always
> >> > > kafka=>kafka. However I think it is quite impossible to do in a
> >> > all_things
> >> > > => all_things environment. Today you can say "well the semantics of
> >> the
> >> > > Samza APIs depend on the connectors you use" but it is actually
> worse
> >> > then
> >> > > that because the semantics actually depend on the pairing of
> >> > connectors--so
> >> > > not only can you probably not get a usable "exactly once" guarantee
> >> > > end-to-end it can actually be quite hard to reverse engineer what
> >> > property
> >> > > (if any) your end-to-end flow has if you have heterogenous systems.
> >> > >
> >> > > -Jay
> >> > >
> >> > > On Fri, Jul 10, 2015 at 2:00 PM, Yan Fang <[email protected]>
> >> wrote:
> >> > >
> >> > > > {quote}
> >> > > > maintained in a separate repository and retaining the existing
> >> > > > committership but sharing as much else as possible (website, etc)
> >> > > > {quote}
> >> > > >
> >> > > > Overall, I agree on this idea. Now the question is more about "how
> >> to
> >> > do
> >> > > > it".
> >> > > >
> >> > > > On the other hand, one thing I want to point out is that, if we
> >> decide
> >> > to
> >> > > > go this way, how do we want to support
> >> > > > otherSystem-transformation-otherSystem use case?
> >> > > >
> >> > > > Basically, there are four user groups here:
> >> > > >
> >> > > > 1. Kafka-transformation-Kafka
> >> > > > 2. Kafka-transformation-otherSystem
> >> > > > 3. otherSystem-transformation-Kafka
> >> > > > 4. otherSystem-transformation-otherSystem
> >> > > >
> >> > > > For group 1, they can easily use the new Samza library to achieve.
> >> For
> >> > > > group 2 and 3, they can use copyCat -> transformation -> Kafka or
> >> > Kafka->
> >> > > > transformation -> copyCat.
> >> > > >
> >> > > > The problem is for group 4. Do we want to abandon this or still
> >> support
> >> > > it?
> >> > > > Of course, this use case can be achieved by using copyCat ->
> >> > > transformation
> >> > > > -> Kafka -> transformation -> copyCat, the thing is how we
> persuade
> >> > them
> >> > > to
> >> > > > do this long chain. If yes, it will also be a win for Kafka too.
> Or
> >> if
> >> > > > there is no one in this community actually doing this so far,
> maybe
> >> ok
> >> > to
> >> > > > not support the group 4 directly.
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Fang, Yan
> >> > > > [email protected]
> >> > > >
> >> > > > On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps <[email protected]>
> >> wrote:
> >> > > >
> >> > > > > Yeah I agree with this summary. I think there are kind of two
> >> > questions
> >> > > > > here:
> >> > > > > 1. Technically does alignment/reliance on Kafka make sense
> >> > > > > 2. Branding wise (naming, website, concepts, etc) does alignment
> >> with
> >> > > > Kafka
> >> > > > > make sense
> >> > > > >
> >> > > > > Personally I do think both of these things would be really
> >> valuable,
> >> > > and
> >> > > > > would dramatically alter the trajectory of the project.
> >> > > > >
> >> > > > > My preference would be to see if people can mostly agree on a
> >> > direction
> >> > > > > rather than splintering things off. From my point of view the
> >> ideal
> >> > > > outcome
> >> > > > > of all the options discussed would be to make Samza a closely
> >> aligned
> >> > > > > subproject, maintained in a separate repository and retaining
> the
> >> > > > existing
> >> > > > > committership but sharing as much else as possible (website,
> >> etc). No
> >> > > > idea
> >> > > > > about how these things work, Jacob, you probably know more.
> >> > > > >
> >> > > > > No discussion amongst the Kafka folks has happened on this, but
> >> > likely
> >> > > we
> >> > > > > should figure out what the Samza community actually wants first.
> >> > > > >
> >> > > > > I admit that this is a fairly radical departure from how things
> >> are.
> >> > > > >
> >> > > > > If that doesn't fly, I think, yeah we could leave Samza as it is
> >> and
> >> > do
> >> > > > the
> >> > > > > more radical reboot inside Kafka. From my point of view that
> does
> >> > leave
> >> > > > > things in a somewhat confusing state since now there are two
> >> stream
> >> > > > > processing systems more or less coupled to Kafka in large part
> >> made
> >> > by
> >> > > > the
> >> > > > > same people. But, arguably that might be a cleaner way to make
> the
> >> > > > cut-over
> >> > > > > and perhaps less risky for Samza community since if it works
> >> people
> >> > can
> >> > > > > switch and if it doesn't nothing will have changed. Dunno, how
> do
> >> > > people
> >> > > > > feel about this?
> >> > > > >
> >> > > > > -Jay
> >> > > > >
> >> > > > > On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan <
> [email protected]>
> >> > > wrote:
> >> > > > >
> >> > > > > > >  This leads me to thinking that merging projects and
> >> communities
> >> > > > might
> >> > > > > > be a good idea: with the union of experience from both
> >> communities,
> >> > > we
> >> > > > > will
> >> > > > > > probably build a better system that is better for users.
> >> > > > > > Is this what's being proposed though? Merging the projects
> seems
> >> > like
> >> > > > > > a consequence of at most one of the three directions under
> >> > > discussion:
> >> > > > > > 1) Samza 2.0: The Samza community relies more heavily on Kafka
> >> for
> >> > > > > > configuration, etc. (to a greater or lesser extent to be
> >> > determined)
> >> > > > > > but the Samza community would not automatically merge withe
> >> Kafka
> >> > > > > > community (the Phoenix/HBase example is a good one here).
> >> > > > > > 2) Samza Reboot: The Samza community continues to exist with a
> >> > > limited
> >> > > > > > project scope, but similarly would not need to be part of the
> >> Kafka
> >> > > > > > community (ie given committership) to progress.  Here, maybe
> the
> >> > > Samza
> >> > > > > > team would become a subproject of Kafka (the Board frowns on
> >> > > > > > subprojects at the moment, so I'm not sure if that's even
> >> > feasible),
> >> > > > > > but that would not be required.
> >> > > > > > 3) Hey Samza! FYI, Kafka does streaming now: In this option
> the
> >> > Kafka
> >> > > > > > team builds its own streaming library, possibly off of Jay's
> >> > > > > > prototype, which has not direct lineage to the Samza team.
> >> There's
> >> > > no
> >> > > > > > reason for the Kafka team to bring in the Samza team.
> >> > > > > >
> >> > > > > > Is the Kafka community on board with this?
> >> > > > > >
> >> > > > > > To be clear, all three options under discussion are
> interesting,
> >> > > > > > technically valid and likely healthy directions for the
> project.
> >> > > > > > Also, they are not mutually exclusive.  The Samza community
> >> could
> >> > > > > > decide to pursue, say, 'Samza 2.0', while the Kafka community
> >> went
> >> > > > > > forward with 'Hey Samza!'  My points above are directed
> >> entirely at
> >> > > > > > the community aspect of these choices.
> >> > > > > > -Jakob
> >> > > > > >
> >> > > > > > On 10 July 2015 at 09:10, Roger Hoover <
> [email protected]>
> >> > > wrote:
> >> > > > > > > That's great.  Thanks, Jay.
> >> > > > > > >
> >> > > > > > > On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps <
> [email protected]>
> >> > > wrote:
> >> > > > > > >
> >> > > > > > >> Yeah totally agree. I think you have this issue even today,
> >> > right?
> >> > > > > I.e.
> >> > > > > > if
> >> > > > > > >> you need to make a simple config change and you're running
> in
> >> > YARN
> >> > > > > today
> >> > > > > > >> you end up bouncing the job which then rebuilds state. I
> >> think
> >> > the
> >> > > > fix
> >> > > > > > is
> >> > > > > > >> exactly what you described which is to have a long timeout
> on
> >> > > > > partition
> >> > > > > > >> movement for stateful jobs so that if a job is just getting
> >> > > bounced,
> >> > > > > and
> >> > > > > > >> the cluster manager (or admin) is smart enough to restart
> it
> >> on
> >> > > the
> >> > > > > same
> >> > > > > > >> host when possible, it can optimistically reuse any
> existing
> >> > state
> >> > > > it
> >> > > > > > finds
> >> > > > > > >> on disk (if it is valid).
> >> > > > > > >>
> >> > > > > > >> So in this model the charter of the CM is to place
> processes
> >> as
> >> > > > > > stickily as
> >> > > > > > >> possible and to restart or re-place failed processes. The
> >> > charter
> >> > > of
> >> > > > > the
> >> > > > > > >> partition management system is to control the assignment of
> >> work
> >> > > to
> >> > > > > > these
> >> > > > > > >> processes. The nice thing about this is that the work
> >> > assignment,
> >> > > > > > timeouts,
> >> > > > > > >> behavior, configs, and code will all be the same across all
> >> > > cluster
> >> > > > > > >> managers.
> >> > > > > > >>
> >> > > > > > >> So I think that prototype would actually give you exactly
> >> what
> >> > you
> >> > > > > want
> >> > > > > > >> today for any cluster manager (or manual placement +
> restart
> >> > > script)
> >> > > > > > that
> >> > > > > > >> was sticky in terms of host placement since there is
> already
> >> a
> >> > > > > > configurable
> >> > > > > > >> partition movement timeout and task-by-task state reuse
> with
> >> a
> >> > > check
> >> > > > > on
> >> > > > > > >> state validity.
> >> > > > > > >>
> >> > > > > > >> -Jay
> >> > > > > > >>
> >> > > > > > >> On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover <
> >> > > > [email protected]
> >> > > > > >
> >> > > > > > >> wrote:
> >> > > > > > >>
> >> > > > > > >> > That would be great to let Kafka do as much heavy lifting
> >> as
> >> > > > > possible
> >> > > > > > and
> >> > > > > > >> > make it easier for other languages to implement Samza
> apis.
> >> > > > > > >> >
> >> > > > > > >> > One thing to watch out for is the interplay between
> Kafka's
> >> > > group
> >> > > > > > >> > management and the external scheduler/process manager's
> >> fault
> >> > > > > > tolerance.
> >> > > > > > >> > If a container dies, the Kafka group membership protocol
> >> will
> >> > > try
> >> > > > to
> >> > > > > > >> assign
> >> > > > > > >> > it's tasks to other containers while at the same time the
> >> > > process
> >> > > > > > manager
> >> > > > > > >> > is trying to relaunch the container.  Without some
> >> > consideration
> >> > > > for
> >> > > > > > this
> >> > > > > > >> > (like a configurable amount of time to wait before Kafka
> >> > alters
> >> > > > the
> >> > > > > > group
> >> > > > > > >> > membership), there may be thrashing going on which is
> >> > especially
> >> > > > bad
> >> > > > > > for
> >> > > > > > >> > containers with large amounts of local state.
> >> > > > > > >> >
> >> > > > > > >> > Someone else pointed this out already but I thought it
> >> might
> >> > be
> >> > > > > worth
> >> > > > > > >> > calling out again.
> >> > > > > > >> >
> >> > > > > > >> > Cheers,
> >> > > > > > >> >
> >> > > > > > >> > Roger
> >> > > > > > >> >
> >> > > > > > >> >
> >> > > > > > >> > On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps <
> >> [email protected]>
> >> > > > > wrote:
> >> > > > > > >> >
> >> > > > > > >> > > Hey Roger,
> >> > > > > > >> > >
> >> > > > > > >> > > I couldn't agree more. We spent a bunch of time talking
> >> to
> >> > > > people
> >> > > > > > and
> >> > > > > > >> > that
> >> > > > > > >> > > is exactly the stuff we heard time and again. What
> makes
> >> it
> >> > > > hard,
> >> > > > > of
> >> > > > > > >> > > course, is that there is some tension between
> >> compatibility
> >> > > with
> >> > > > > > what's
> >> > > > > > >> > > there now and making things better for new users.
> >> > > > > > >> > >
> >> > > > > > >> > > I also strongly agree with the importance of
> >> multi-language
> >> > > > > > support. We
> >> > > > > > >> > are
> >> > > > > > >> > > talking now about Java, but for application development
> >> use
> >> > > > cases
> >> > > > > > >> people
> >> > > > > > >> > > want to work in whatever language they are using
> >> elsewhere.
> >> > I
> >> > > > > think
> >> > > > > > >> > moving
> >> > > > > > >> > > to a model where Kafka itself does the group
> membership,
> >> > > > lifecycle
> >> > > > > > >> > control,
> >> > > > > > >> > > and partition assignment has the advantage of putting
> all
> >> > that
> >> > > > > > complex
> >> > > > > > >> > > stuff behind a clean api that the clients are already
> >> going
> >> > to
> >> > > > be
> >> > > > > > >> > > implementing for their consumer, so the added
> >> functionality
> >> > > for
> >> > > > > > stream
> >> > > > > > >> > > processing beyond a consumer becomes very minor.
> >> > > > > > >> > >
> >> > > > > > >> > > -Jay
> >> > > > > > >> > >
> >> > > > > > >> > > On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover <
> >> > > > > > [email protected]>
> >> > > > > > >> > > wrote:
> >> > > > > > >> > >
> >> > > > > > >> > > > Metamorphosis...nice. :)
> >> > > > > > >> > > >
> >> > > > > > >> > > > This has been a great discussion.  As a user of Samza
> >> > who's
> >> > > > > > recently
> >> > > > > > >> > > > integrated it into a relatively large organization, I
> >> just
> >> > > > want
> >> > > > > to
> >> > > > > > >> add
> >> > > > > > >> > > > support to a few points already made.
> >> > > > > > >> > > >
> >> > > > > > >> > > > The biggest hurdles to adoption of Samza as it
> >> currently
> >> > > > exists
> >> > > > > > that
> >> > > > > > >> > I've
> >> > > > > > >> > > > experienced are:
> >> > > > > > >> > > > 1) YARN - YARN is overly complex in many environments
> >> > where
> >> > > > > Puppet
> >> > > > > > >> > would
> >> > > > > > >> > > do
> >> > > > > > >> > > > just fine but it was the only mechanism to get fault
> >> > > > tolerance.
> >> > > > > > >> > > > 2) Configuration - I think I like the idea of
> >> configuring
> >> > > most
> >> > > > > of
> >> > > > > > the
> >> > > > > > >> > job
> >> > > > > > >> > > > in code rather than config files.  In general, I
> think
> >> the
> >> > > > goal
> >> > > > > > >> should
> >> > > > > > >> > be
> >> > > > > > >> > > > to make it harder to make mistakes, especially of the
> >> kind
> >> > > > where
> >> > > > > > the
> >> > > > > > >> > code
> >> > > > > > >> > > > expects something and the config doesn't match.  The
> >> > current
> >> > > > > > config
> >> > > > > > >> is
> >> > > > > > >> > > > quite intricate and error-prone.  For example, the
> >> > > application
> >> > > > > > logic
> >> > > > > > >> > may
> >> > > > > > >> > > > depend on bootstrapping a topic but rather than
> >> asserting
> >> > > that
> >> > > > > in
> >> > > > > > the
> >> > > > > > >> > > code,
> >> > > > > > >> > > > you have to rely on getting the config right.
> Likewise
> >> > with
> >> > > > > > serdes,
> >> > > > > > >> > the
> >> > > > > > >> > > > Java representations produced by various serdes
> (JSON,
> >> > Avro,
> >> > > > > etc.)
> >> > > > > > >> are
> >> > > > > > >> > > not
> >> > > > > > >> > > > equivalent so you cannot just reconfigure a serde
> >> without
> >> > > > > changing
> >> > > > > > >> the
> >> > > > > > >> > > > code.   It would be nice for jobs to be able to
> assert
> >> > what
> >> > > > they
> >> > > > > > >> expect
> >> > > > > > >> > > > from their input topics in terms of partitioning.
> >> This is
> >> > > > > > getting a
> >> > > > > > >> > > little
> >> > > > > > >> > > > off topic but I was even thinking about creating a
> >> "Samza
> >> > > > config
> >> > > > > > >> > linter"
> >> > > > > > >> > > > that would sanity check a set of configs.  Especially
> >> in
> >> > > > > > >> organizations
> >> > > > > > >> > > > where config is managed by a different team than the
> >> > > > application
> >> > > > > > >> > > developer,
> >> > > > > > >> > > > it's very hard to get avoid config mistakes.
> >> > > > > > >> > > > 3) Java/Scala centric - for many teams (especially
> >> > > DevOps-type
> >> > > > > > >> folks),
> >> > > > > > >> > > the
> >> > > > > > >> > > > pain of the Java toolchain (maven, slow builds, weak
> >> > command
> >> > > > > line
> >> > > > > > >> > > support,
> >> > > > > > >> > > > configuration over convention) really inhibits
> >> > productivity.
> >> > > > As
> >> > > > > > more
> >> > > > > > >> > and
> >> > > > > > >> > > > more high-quality clients become available for
> Kafka, I
> >> > hope
> >> > > > > > they'll
> >> > > > > > >> > > follow
> >> > > > > > >> > > > Samza's model.  Not sure how much it affects the
> >> proposals
> >> > > in
> >> > > > > this
> >> > > > > > >> > thread
> >> > > > > > >> > > > but please consider other languages in the ecosystem
> as
> >> > > well.
> >> > > > > > From
> >> > > > > > >> > what
> >> > > > > > >> > > > I've heard, Spark has more Python users than
> >> Java/Scala.
> >> > > > > > >> > > > (FYI, we added a Jython wrapper for the Samza API
> >> > > > > > >> > > >
> >> > > > > > >> > > >
> >> > > > > > >> > >
> >> > > > > > >> >
> >> > > > > > >>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza
> >> > > > > > >> > > > and are working on a Yeoman generator
> >> > > > > > >> > > > https://github.com/Quantiply/generator-rico for
> >> > > Jython/Samza
> >> > > > > > >> projects
> >> > > > > > >> > to
> >> > > > > > >> > > > alleviate some of the pain)
> >> > > > > > >> > > >
> >> > > > > > >> > > > I also want to underscore Jay's point about improving
> >> the
> >> > > user
> >> > > > > > >> > > experience.
> >> > > > > > >> > > > That's a very important factor for adoption.  I think
> >> the
> >> > > goal
> >> > > > > > should
> >> > > > > > >> > be
> >> > > > > > >> > > to
> >> > > > > > >> > > > make Samza as easy to get started with as something
> >> like
> >> > > > > Logstash.
> >> > > > > > >> > > > Logstash is vastly inferior in terms of capabilities
> to
> >> > > Samza
> >> > > > > but
> >> > > > > > >> it's
> >> > > > > > >> > > easy
> >> > > > > > >> > > > to get started and that makes a big difference.
> >> > > > > > >> > > >
> >> > > > > > >> > > > Cheers,
> >> > > > > > >> > > >
> >> > > > > > >> > > > Roger
> >> > > > > > >> > > >
> >> > > > > > >> > > >
> >> > > > > > >> > > >
> >> > > > > > >> > > >
> >> > > > > > >> > > >
> >> > > > > > >> > > > On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De
> Francisci
> >> > > > Morales <
> >> > > > > > >> > > > [email protected]> wrote:
> >> > > > > > >> > > >
> >> > > > > > >> > > > > Forgot to add. On the naming issues, Kafka
> >> Metamorphosis
> >> > > is
> >> > > > a
> >> > > > > > clear
> >> > > > > > >> > > > winner
> >> > > > > > >> > > > > :)
> >> > > > > > >> > > > >
> >> > > > > > >> > > > > --
> >> > > > > > >> > > > > Gianmarco
> >> > > > > > >> > > > >
> >> > > > > > >> > > > > On 7 July 2015 at 13:26, Gianmarco De Francisci
> >> Morales
> >> > <
> >> > > > > > >> > > [email protected]
> >> > > > > > >> > > > >
> >> > > > > > >> > > > > wrote:
> >> > > > > > >> > > > >
> >> > > > > > >> > > > > > Hi,
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > > > @Martin, thanks for you comments.
> >> > > > > > >> > > > > > Maybe I'm missing some important point, but I
> think
> >> > > > coupling
> >> > > > > > the
> >> > > > > > >> > > > releases
> >> > > > > > >> > > > > > is actually a *good* thing.
> >> > > > > > >> > > > > > To make an example, would it be better if the MR
> >> and
> >> > > HDFS
> >> > > > > > >> > components
> >> > > > > > >> > > of
> >> > > > > > >> > > > > > Hadoop had different release schedules?
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > > > Actually, keeping the discussion in a single
> place
> >> > would
> >> > > > > make
> >> > > > > > >> > > agreeing
> >> > > > > > >> > > > on
> >> > > > > > >> > > > > > releases (and backwards compatibility) much
> >> easier, as
> >> > > > > > everybody
> >> > > > > > >> > > would
> >> > > > > > >> > > > be
> >> > > > > > >> > > > > > responsible for the whole codebase.
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > > > That said, I like the idea of absorbing
> samza-core
> >> as
> >> > a
> >> > > > > > >> > sub-project,
> >> > > > > > >> > > > and
> >> > > > > > >> > > > > > leave the fancy stuff separate.
> >> > > > > > >> > > > > > It probably gives 90% of the benefits we have
> been
> >> > > > > discussing
> >> > > > > > >> here.
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > > > Cheers,
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > > > --
> >> > > > > > >> > > > > > Gianmarco
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > > > On 7 July 2015 at 02:30, Jay Kreps <
> >> > [email protected]
> >> > > >
> >> > > > > > wrote:
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > > >> Hey Martin,
> >> > > > > > >> > > > > >>
> >> > > > > > >> > > > > >> I agree coupling release schedules is a
> downside.
> >> > > > > > >> > > > > >>
> >> > > > > > >> > > > > >> Definitely we can try to solve some of the
> >> > integration
> >> > > > > > problems
> >> > > > > > >> in
> >> > > > > > >> > > > > >> Confluent Platform or in other distributions.
> But
> >> I
> >> > > think
> >> > > > > > this
> >> > > > > > >> > ends
> >> > > > > > >> > > up
> >> > > > > > >> > > > > >> being really shallow. I guess I feel to really
> >> get a
> >> > > good
> >> > > > > > user
> >> > > > > > >> > > > > experience
> >> > > > > > >> > > > > >> the two systems have to kind of feel like part
> of
> >> the
> >> > > > same
> >> > > > > > thing
> >> > > > > > >> > and
> >> > > > > > >> > > > you
> >> > > > > > >> > > > > >> can't really add that in later--you can put both
> >> in
> >> > the
> >> > > > > same
> >> > > > > > >> > > > > downloadable
> >> > > > > > >> > > > > >> tar file but it doesn't really give a very
> >> cohesive
> >> > > > > feeling.
> >> > > > > > I
> >> > > > > > >> > agree
> >> > > > > > >> > > > > that
> >> > > > > > >> > > > > >> ultimately any of the project stuff is as much
> >> social
> >> > > and
> >> > > > > > naming
> >> > > > > > >> > as
> >> > > > > > >> > > > > >> anything else--theoretically two totally
> >> independent
> >> > > > > projects
> >> > > > > > >> > could
> >> > > > > > >> > > > work
> >> > > > > > >> > > > > >> to
> >> > > > > > >> > > > > >> tightly align. In practice this seems to be
> quite
> >> > > > difficult
> >> > > > > > >> > though.
> >> > > > > > >> > > > > >>
> >> > > > > > >> > > > > >> For the frameworks--totally agree it would be
> >> good to
> >> > > > > > maintain
> >> > > > > > >> the
> >> > > > > > >> > > > > >> framework support with the project. In some
> cases
> >> > there
> >> > > > may
> >> > > > > > not
> >> > > > > > >> be
> >> > > > > > >> > > too
> >> > > > > > >> > > > > >> much
> >> > > > > > >> > > > > >> there since the integration gets lighter but I
> >> think
> >> > > > > whatever
> >> > > > > > >> > stubs
> >> > > > > > >> > > > you
> >> > > > > > >> > > > > >> need should be included. So no I definitely
> wasn't
> >> > > trying
> >> > > > > to
> >> > > > > > >> imply
> >> > > > > > >> > > > > >> dropping
> >> > > > > > >> > > > > >> support for these frameworks, just making the
> >> > > integration
> >> > > > > > >> lighter
> >> > > > > > >> > by
> >> > > > > > >> > > > > >> separating process management from partition
> >> > > management.
> >> > > > > > >> > > > > >>
> >> > > > > > >> > > > > >> You raise two good points we would have to
> figure
> >> out
> >> > > if
> >> > > > we
> >> > > > > > went
> >> > > > > > >> > > down
> >> > > > > > >> > > > > the
> >> > > > > > >> > > > > >> alignment path:
> >> > > > > > >> > > > > >> 1. With respect to the name, yeah I think the
> >> first
> >> > > > > question
> >> > > > > > is
> >> > > > > > >> > > > whether
> >> > > > > > >> > > > > >> some "re-branding" would be worth it. If so
> then I
> >> > > think
> >> > > > we
> >> > > > > > can
> >> > > > > > >> > > have a
> >> > > > > > >> > > > > big
> >> > > > > > >> > > > > >> thread on the name. I'm definitely not set on
> >> Kafka
> >> > > > > > Streaming or
> >> > > > > > >> > > Kafka
> >> > > > > > >> > > > > >> Streams I was just using them to be kind of
> >> > > > illustrative. I
> >> > > > > > >> agree
> >> > > > > > >> > > with
> >> > > > > > >> > > > > >> your
> >> > > > > > >> > > > > >> critique of these names, though I think people
> >> would
> >> > > get
> >> > > > > the
> >> > > > > > >> idea.
> >> > > > > > >> > > > > >> 2. Yeah you also raise a good point about how to
> >> > > "factor"
> >> > > > > it.
> >> > > > > > >> Here
> >> > > > > > >> > > are
> >> > > > > > >> > > > > the
> >> > > > > > >> > > > > >> options I see (I could get enthusiastic about
> any
> >> of
> >> > > > them):
> >> > > > > > >> > > > > >>    a. One repo for both Kafka and Samza
> >> > > > > > >> > > > > >>    b. Two repos, retaining the current
> seperation
> >> > > > > > >> > > > > >>    c. Two repos, the equivalent of samza-api and
> >> > > > samza-core
> >> > > > > > is
> >> > > > > > >> > > > absorbed
> >> > > > > > >> > > > > >> almost like a third client
> >> > > > > > >> > > > > >>
> >> > > > > > >> > > > > >> Cheers,
> >> > > > > > >> > > > > >>
> >> > > > > > >> > > > > >> -Jay
> >> > > > > > >> > > > > >>
> >> > > > > > >> > > > > >> On Mon, Jul 6, 2015 at 1:18 PM, Martin
> Kleppmann <
> >> > > > > > >> > > > [email protected]>
> >> > > > > > >> > > > > >> wrote:
> >> > > > > > >> > > > > >>
> >> > > > > > >> > > > > >> > Ok, thanks for the clarifications. Just a few
> >> > > follow-up
> >> > > > > > >> > comments.
> >> > > > > > >> > > > > >> >
> >> > > > > > >> > > > > >> > - I see the appeal of merging with Kafka or
> >> > becoming
> >> > > a
> >> > > > > > >> > subproject:
> >> > > > > > >> > > > the
> >> > > > > > >> > > > > >> > reasons you mention are good. The risk I see
> is
> >> > that
> >> > > > > > release
> >> > > > > > >> > > > schedules
> >> > > > > > >> > > > > >> > become coupled to each other, which can slow
> >> > everyone
> >> > > > > down,
> >> > > > > > >> and
> >> > > > > > >> > > > large
> >> > > > > > >> > > > > >> > projects with many contributors are harder to
> >> > manage.
> >> > > > > > (Jakob,
> >> > > > > > >> > can
> >> > > > > > >> > > > you
> >> > > > > > >> > > > > >> speak
> >> > > > > > >> > > > > >> > from experience, having seen a wider range of
> >> > Hadoop
> >> > > > > > ecosystem
> >> > > > > > >> > > > > >> projects?)
> >> > > > > > >> > > > > >> >
> >> > > > > > >> > > > > >> > Some of the goals of a better unified
> developer
> >> > > > > experience
> >> > > > > > >> could
> >> > > > > > >> > > > also
> >> > > > > > >> > > > > be
> >> > > > > > >> > > > > >> > solved by integrating Samza nicely into a
> Kafka
> >> > > > > > distribution
> >> > > > > > >> > (such
> >> > > > > > >> > > > as
> >> > > > > > >> > > > > >> > Confluent's). I'm not against merging projects
> >> if
> >> > we
> >> > > > > decide
> >> > > > > > >> > that's
> >> > > > > > >> > > > the
> >> > > > > > >> > > > > >> way
> >> > > > > > >> > > > > >> > to go, just pointing out the same goals can
> >> perhaps
> >> > > > also
> >> > > > > be
> >> > > > > > >> > > achieved
> >> > > > > > >> > > > > in
> >> > > > > > >> > > > > >> > other ways.
> >> > > > > > >> > > > > >> >
> >> > > > > > >> > > > > >> > - With regard to dropping the YARN dependency:
> >> are
> >> > > you
> >> > > > > > >> proposing
> >> > > > > > >> > > > that
> >> > > > > > >> > > > > >> > Samza doesn't give any help to people wanting
> to
> >> > run
> >> > > on
> >> > > > > > >> > > > > >> YARN/Mesos/AWS/etc?
> >> > > > > > >> > > > > >> > So the docs would basically have a link to
> >> Slider
> >> > and
> >> > > > > > nothing
> >> > > > > > >> > > else?
> >> > > > > > >> > > > Or
> >> > > > > > >> > > > > >> > would we maintain integrations with a bunch of
> >> > > popular
> >> > > > > > >> > deployment
> >> > > > > > >> > > > > >> methods
> >> > > > > > >> > > > > >> > (e.g. the necessary glue and shell scripts to
> >> make
> >> > > > Samza
> >> > > > > > work
> >> > > > > > >> > with
> >> > > > > > >> > > > > >> Slider)?
> >> > > > > > >> > > > > >> >
> >> > > > > > >> > > > > >> > I absolutely think it's a good idea to have
> the
> >> > "as a
> >> > > > > > library"
> >> > > > > > >> > and
> >> > > > > > >> > > > > "as a
> >> > > > > > >> > > > > >> > process" (using Yi's taxonomy) options for
> >> people
> >> > who
> >> > > > > want
> >> > > > > > >> them,
> >> > > > > > >> > > > but I
> >> > > > > > >> > > > > >> > think there should also be a low-friction path
> >> for
> >> > > > common
> >> > > > > > "as
> >> > > > > > >> a
> >> > > > > > >> > > > > service"
> >> > > > > > >> > > > > >> > deployment methods, for which we probably need
> >> to
> >> > > > > maintain
> >> > > > > > >> > > > > integrations.
> >> > > > > > >> > > > > >> >
> >> > > > > > >> > > > > >> > - Project naming: "Kafka Streams" seems odd to
> >> me,
> >> > > > > because
> >> > > > > > >> Kafka
> >> > > > > > >> > > is
> >> > > > > > >> > > > > all
> >> > > > > > >> > > > > >> > about streams already. Perhaps "Kafka
> >> Transformers"
> >> > > or
> >> > > > > > "Kafka
> >> > > > > > >> > > > Filters"
> >> > > > > > >> > > > > >> > would be more apt?
> >> > > > > > >> > > > > >> >
> >> > > > > > >> > > > > >> > One suggestion: perhaps the core of Samza
> >> (stream
> >> > > > > > >> transformation
> >> > > > > > >> > > > with
> >> > > > > > >> > > > > >> > state management -- i.e. the "Samza as a
> >> library"
> >> > > bit)
> >> > > > > > could
> >> > > > > > >> > > become
> >> > > > > > >> > > > > >> part of
> >> > > > > > >> > > > > >> > Kafka, while higher-level tools such as
> >> streaming
> >> > SQL
> >> > > > and
> >> > > > > > >> > > > integrations
> >> > > > > > >> > > > > >> with
> >> > > > > > >> > > > > >> > deployment frameworks remain in a separate
> >> project?
> >> > > In
> >> > > > > > other
> >> > > > > > >> > > words,
> >> > > > > > >> > > > > >> Kafka
> >> > > > > > >> > > > > >> > would absorb the proven, stable core of Samza,
> >> > which
> >> > > > > would
> >> > > > > > >> > become
> >> > > > > > >> > > > the
> >> > > > > > >> > > > > >> > "third Kafka client" mentioned early in this
> >> > thread.
> >> > > > The
> >> > > > > > Samza
> >> > > > > > >> > > > project
> >> > > > > > >> > > > > >> > would then target that third Kafka client as
> its
> >> > base
> >> > > > > API,
> >> > > > > > and
> >> > > > > > >> > the
> >> > > > > > >> > > > > >> project
> >> > > > > > >> > > > > >> > would be freed up to explore more experimental
> >> new
> >> > > > > > horizons.
> >> > > > > > >> > > > > >> >
> >> > > > > > >> > > > > >> > Martin
> >> > > > > > >> > > > > >> >
> >> > > > > > >> > > > > >> > On 6 Jul 2015, at 18:51, Jay Kreps <
> >> > > > [email protected]>
> >> > > > > > >> wrote:
> >> > > > > > >> > > > > >> >
> >> > > > > > >> > > > > >> > > Hey Martin,
> >> > > > > > >> > > > > >> > >
> >> > > > > > >> > > > > >> > > For the YARN/Mesos/etc decoupling I actually
> >> > don't
> >> > > > > think
> >> > > > > > it
> >> > > > > > >> > ties
> >> > > > > > >> > > > our
> >> > > > > > >> > > > > >> > hands
> >> > > > > > >> > > > > >> > > at all, all it does is refactor things. The
> >> > > division
> >> > > > of
> >> > > > > > >> > > > > >> responsibility is
> >> > > > > > >> > > > > >> > > that Samza core is responsible for task
> >> > lifecycle,
> >> > > > > state,
> >> > > > > > >> and
> >> > > > > > >> > > > > >> partition
> >> > > > > > >> > > > > >> > > management (using the Kafka co-ordinator)
> but
> >> it
> >> > is
> >> > > > NOT
> >> > > > > > >> > > > responsible
> >> > > > > > >> > > > > >> for
> >> > > > > > >> > > > > >> > > packaging, configuration deployment or
> >> execution
> >> > of
> >> > > > > > >> processes.
> >> > > > > > >> > > The
> >> > > > > > >> > > > > >> > problem
> >> > > > > > >> > > > > >> > > of packaging and starting these processes is
> >> > > > > > >> > > > > >> > > framework/environment-specific. This leaves
> >> > > > individual
> >> > > > > > >> > > frameworks
> >> > > > > > >> > > > to
> >> > > > > > >> > > > > >> be
> >> > > > > > >> > > > > >> > as
> >> > > > > > >> > > > > >> > > fancy or vanilla as they like. So you can
> get
> >> > > simple
> >> > > > > > >> stateless
> >> > > > > > >> > > > > >> support in
> >> > > > > > >> > > > > >> > > YARN, Mesos, etc using their off-the-shelf
> app
> >> > > > > framework
> >> > > > > > >> > > (Slider,
> >> > > > > > >> > > > > >> > Marathon,
> >> > > > > > >> > > > > >> > > etc). These are well known by people and
> have
> >> > nice
> >> > > > UIs
> >> > > > > > and a
> >> > > > > > >> > lot
> >> > > > > > >> > > > of
> >> > > > > > >> > > > > >> > > flexibility. I don't think they have node
> >> > affinity
> >> > > > as a
> >> > > > > > >> built
> >> > > > > > >> > in
> >> > > > > > >> > > > > >> option
> >> > > > > > >> > > > > >> > > (though I could be wrong). So if we want
> that
> >> we
> >> > > can
> >> > > > > > either
> >> > > > > > >> > wait
> >> > > > > > >> > > > for
> >> > > > > > >> > > > > >> them
> >> > > > > > >> > > > > >> > > to add it or do a custom framework to add
> that
> >> > > > feature
> >> > > > > > (as
> >> > > > > > >> > now).
> >> > > > > > >> > > > > >> > Obviously
> >> > > > > > >> > > > > >> > > if you manage things with old-school ops
> tools
> >> > > > > > >> > (puppet/chef/etc)
> >> > > > > > >> > > > you
> >> > > > > > >> > > > > >> get
> >> > > > > > >> > > > > >> > > locality easily. The nice thing, though, is
> >> that
> >> > > all
> >> > > > > the
> >> > > > > > >> samza
> >> > > > > > >> > > > > >> "business
> >> > > > > > >> > > > > >> > > logic" around partition management and fault
> >> > > > tolerance
> >> > > > > > is in
> >> > > > > > >> > > Samza
> >> > > > > > >> > > > > >> core
> >> > > > > > >> > > > > >> > so
> >> > > > > > >> > > > > >> > > it is shared across frameworks and the
> >> framework
> >> > > > > specific
> >> > > > > > >> bit
> >> > > > > > >> > is
> >> > > > > > >> > > > > just
> >> > > > > > >> > > > > >> > > whether it is smart enough to try to get the
> >> same
> >> > > > host
> >> > > > > > when
> >> > > > > > >> a
> >> > > > > > >> > > job
> >> > > > > > >> > > > is
> >> > > > > > >> > > > > >> > > restarted.
> >> > > > > > >> > > > > >> > >
> >> > > > > > >> > > > > >> > > With respect to the Kafka-alignment, yeah I
> >> think
> >> > > the
> >> > > > > > goal
> >> > > > > > >> > would
> >> > > > > > >> > > > be
> >> > > > > > >> > > > > >> (a)
> >> > > > > > >> > > > > >> > > actually get better alignment in user
> >> experience,
> >> > > and
> >> > > > > (b)
> >> > > > > > >> > > express
> >> > > > > > >> > > > > >> this in
> >> > > > > > >> > > > > >> > > the naming and project branding.
> Specifically:
> >> > > > > > >> > > > > >> > > 1. Website/docs, it would be nice for the
> >> > > > > > "transformation"
> >> > > > > > >> api
> >> > > > > > >> > > to
> >> > > > > > >> > > > be
> >> > > > > > >> > > > > >> > > discoverable in the main Kafka docs--i.e. be
> >> able
> >> > > to
> >> > > > > > explain
> >> > > > > > >> > > when
> >> > > > > > >> > > > to
> >> > > > > > >> > > > > >> use
> >> > > > > > >> > > > > >> > > the consumer and when to use the stream
> >> > processing
> >> > > > > > >> > functionality
> >> > > > > > >> > > > and
> >> > > > > > >> > > > > >> lead
> >> > > > > > >> > > > > >> > > people into that experience.
> >> > > > > > >> > > > > >> > > 2. Align releases so if you get Kafkza 1.4.2
> >> (or
> >> > > > > > whatever)
> >> > > > > > >> > that
> >> > > > > > >> > > > has
> >> > > > > > >> > > > > >> both
> >> > > > > > >> > > > > >> > > Kafka and the stream processing part and
> they
> >> > > > actually
> >> > > > > > work
> >> > > > > > >> > > > > together.
> >> > > > > > >> > > > > >> > > 3. Unify the programming experience so the
> >> client
> >> > > and
> >> > > > > > Samza
> >> > > > > > >> > api
> >> > > > > > >> > > > > share
> >> > > > > > >> > > > > >> > > config/monitoring/naming/packaging/etc.
> >> > > > > > >> > > > > >> > >
> >> > > > > > >> > > > > >> > > I think sub-projects keep separate
> committers
> >> and
> >> > > can
> >> > > > > > have a
> >> > > > > > >> > > > > separate
> >> > > > > > >> > > > > >> > repo,
> >> > > > > > >> > > > > >> > > but I'm actually not really sure (I can't
> >> find a
> >> > > > > > definition
> >> > > > > > >> > of a
> >> > > > > > >> > > > > >> > subproject
> >> > > > > > >> > > > > >> > > in Apache).
> >> > > > > > >> > > > > >> > >
> >> > > > > > >> > > > > >> > > Basically at a high-level you want the
> >> experience
> >> > > to
> >> > > > > > "feel"
> >> > > > > > >> > > like a
> >> > > > > > >> > > > > >> single
> >> > > > > > >> > > > > >> > > system, not to relatively independent things
> >> that
> >> > > are
> >> > > > > > kind
> >> > > > > > >> of
> >> > > > > > >> > > > > >> awkwardly
> >> > > > > > >> > > > > >> > > glued together.
> >> > > > > > >> > > > > >> > >
> >> > > > > > >> > > > > >> > > I think if we did that they having naming or
> >> > > branding
> >> > > > > > like
> >> > > > > > >> > > "kafka
> >> > > > > > >> > > > > >> > > streaming" or "kafka streams" or something
> >> like
> >> > > that
> >> > > > > > would
> >> > > > > > >> > > > actually
> >> > > > > > >> > > > > >> do a
> >> > > > > > >> > > > > >> > > good job of conveying what it is. I do that
> >> this
> >> > > > would
> >> > > > > > help
> >> > > > > > >> > > > adoption
> >> > > > > > >> > > > > >> > quite
> >> > > > > > >> > > > > >> > > a lot as it would correctly convey that
> using
> >> > Kafka
> >> > > > > > >> Streaming
> >> > > > > > >> > > with
> >> > > > > > >> > > > > >> Kafka
> >> > > > > > >> > > > > >> > is
> >> > > > > > >> > > > > >> > > a fairly seamless experience and Kafka is
> >> pretty
> >> > > > > heavily
> >> > > > > > >> > adopted
> >> > > > > > >> > > > at
> >> > > > > > >> > > > > >> this
> >> > > > > > >> > > > > >> > > point.
> >> > > > > > >> > > > > >> > >
> >> > > > > > >> > > > > >> > > Fwiw we actually considered this model
> >> originally
> >> > > > when
> >> > > > > > open
> >> > > > > > >> > > > sourcing
> >> > > > > > >> > > > > >> > Samza,
> >> > > > > > >> > > > > >> > > however at that time Kafka was relatively
> >> unknown
> >> > > and
> >> > > > > we
> >> > > > > > >> > decided
> >> > > > > > >> > > > not
> >> > > > > > >> > > > > >> to
> >> > > > > > >> > > > > >> > do
> >> > > > > > >> > > > > >> > > it since we felt it would be limiting. From
> my
> >> > > point
> >> > > > of
> >> > > > > > view
> >> > > > > > >> > the
> >> > > > > > >> > > > > three
> >> > > > > > >> > > > > >> > > things have changed (1) Kafka is now really
> >> > heavily
> >> > > > > used
> >> > > > > > for
> >> > > > > > >> > > > stream
> >> > > > > > >> > > > > >> > > processing, (2) we learned that abstracting
> >> out
> >> > the
> >> > > > > > stream
> >> > > > > > >> > well
> >> > > > > > >> > > is
> >> > > > > > >> > > > > >> > > basically impossible, (3) we learned it is
> >> really
> >> > > > hard
> >> > > > > to
> >> > > > > > >> keep
> >> > > > > > >> > > the
> >> > > > > > >> > > > > two
> >> > > > > > >> > > > > >> > > things feeling like a single product.
> >> > > > > > >> > > > > >> > >
> >> > > > > > >> > > > > >> > > -Jay
> >> > > > > > >> > > > > >> > >
> >> > > > > > >> > > > > >> > >
> >> > > > > > >> > > > > >> > > On Mon, Jul 6, 2015 at 3:37 AM, Martin
> >> Kleppmann
> >> > <
> >> > > > > > >> > > > > >> [email protected]>
> >> > > > > > >> > > > > >> > > wrote:
> >> > > > > > >> > > > > >> > >
> >> > > > > > >> > > > > >> > >> Hi all,
> >> > > > > > >> > > > > >> > >>
> >> > > > > > >> > > > > >> > >> Lots of good thoughts here.
> >> > > > > > >> > > > > >> > >>
> >> > > > > > >> > > > > >> > >> I agree with the general philosophy of
> tying
> >> > Samza
> >> > > > > more
> >> > > > > > >> > firmly
> >> > > > > > >> > > to
> >> > > > > > >> > > > > >> Kafka.
> >> > > > > > >> > > > > >> > >> After I spent a while looking at
> integrating
> >> > other
> >> > > > > > message
> >> > > > > > >> > > > brokers
> >> > > > > > >> > > > > >> (e.g.
> >> > > > > > >> > > > > >> > >> Kinesis) with SystemConsumer, I came to the
> >> > > > conclusion
> >> > > > > > that
> >> > > > > > >> > > > > >> > SystemConsumer
> >> > > > > > >> > > > > >> > >> tacitly assumes a model so much like
> Kafka's
> >> > that
> >> > > > > pretty
> >> > > > > > >> much
> >> > > > > > >> > > > > nobody
> >> > > > > > >> > > > > >> but
> >> > > > > > >> > > > > >> > >> Kafka actually implements it. (Databus is
> >> > perhaps
> >> > > an
> >> > > > > > >> > exception,
> >> > > > > > >> > > > but
> >> > > > > > >> > > > > >> it
> >> > > > > > >> > > > > >> > >> isn't widely used outside of LinkedIn.)
> Thus,
> >> > > making
> >> > > > > > Samza
> >> > > > > > >> > > fully
> >> > > > > > >> > > > > >> > dependent
> >> > > > > > >> > > > > >> > >> on Kafka acknowledges that the
> >> > system-independence
> >> > > > was
> >> > > > > > >> never
> >> > > > > > >> > as
> >> > > > > > >> > > > > real
> >> > > > > > >> > > > > >> as
> >> > > > > > >> > > > > >> > we
> >> > > > > > >> > > > > >> > >> perhaps made it out to be. The gains of
> code
> >> > reuse
> >> > > > are
> >> > > > > > >> real.
> >> > > > > > >> > > > > >> > >>
> >> > > > > > >> > > > > >> > >> The idea of decoupling Samza from YARN has
> >> also
> >> > > > always
> >> > > > > > been
> >> > > > > > >> > > > > >> appealing to
> >> > > > > > >> > > > > >> > >> me, for various reasons already mentioned
> in
> >> > this
> >> > > > > > thread.
> >> > > > > > >> > > > Although
> >> > > > > > >> > > > > >> > making
> >> > > > > > >> > > > > >> > >> Samza jobs deployable on anything
> >> > > > (YARN/Mesos/AWS/etc)
> >> > > > > > >> seems
> >> > > > > > >> > > > > >> laudable,
> >> > > > > > >> > > > > >> > I am
> >> > > > > > >> > > > > >> > >> a little concerned that it will restrict us
> >> to a
> >> > > > > lowest
> >> > > > > > >> > common
> >> > > > > > >> > > > > >> > denominator.
> >> > > > > > >> > > > > >> > >> For example, would host affinity
> (SAMZA-617)
> >> > still
> >> > > > be
> >> > > > > > >> > possible?
> >> > > > > > >> > > > For
> >> > > > > > >> > > > > >> jobs
> >> > > > > > >> > > > > >> > >> with large amounts of state, I think
> >> SAMZA-617
> >> > > would
> >> > > > > be
> >> > > > > > a
> >> > > > > > >> big
> >> > > > > > >> > > > boon,
> >> > > > > > >> > > > > >> > since
> >> > > > > > >> > > > > >> > >> restoring state off the changelog on every
> >> > single
> >> > > > > > restart
> >> > > > > > >> is
> >> > > > > > >> > > > > painful,
> >> > > > > > >> > > > > >> > due
> >> > > > > > >> > > > > >> > >> to long recovery times. It would be a shame
> >> if
> >> > the
> >> > > > > > >> decoupling
> >> > > > > > >> > > > from
> >> > > > > > >> > > > > >> YARN
> >> > > > > > >> > > > > >> > >> made host affinity impossible.
> >> > > > > > >> > > > > >> > >>
> >> > > > > > >> > > > > >> > >> Jay, a question about the proposed API for
> >> > > > > > instantiating a
> >> > > > > > >> > job
> >> > > > > > >> > > in
> >> > > > > > >> > > > > >> code
> >> > > > > > >> > > > > >> > >> (rather than a properties file): when
> >> > submitting a
> >> > > > job
> >> > > > > > to a
> >> > > > > > >> > > > > cluster,
> >> > > > > > >> > > > > >> is
> >> > > > > > >> > > > > >> > the
> >> > > > > > >> > > > > >> > >> idea that the instantiation code runs on a
> >> > client
> >> > > > > > >> somewhere,
> >> > > > > > >> > > > which
> >> > > > > > >> > > > > >> then
> >> > > > > > >> > > > > >> > >> pokes the necessary endpoints on
> >> > > YARN/Mesos/AWS/etc?
> >> > > > > Or
> >> > > > > > >> does
> >> > > > > > >> > > that
> >> > > > > > >> > > > > >> code
> >> > > > > > >> > > > > >> > run
> >> > > > > > >> > > > > >> > >> on each container that is part of the job
> (in
> >> > > which
> >> > > > > > case,
> >> > > > > > >> how
> >> > > > > > >> > > > does
> >> > > > > > >> > > > > >> the
> >> > > > > > >> > > > > >> > job
> >> > > > > > >> > > > > >> > >> submission to the cluster work)?
> >> > > > > > >> > > > > >> > >>
> >> > > > > > >> > > > > >> > >> I agree with Garry that it doesn't feel
> >> right to
> >> > > > make
> >> > > > > a
> >> > > > > > 1.0
> >> > > > > > >> > > > release
> >> > > > > > >> > > > > >> > with a
> >> > > > > > >> > > > > >> > >> plan for it to be immediately obsolete. So
> if
> >> > this
> >> > > > is
> >> > > > > > going
> >> > > > > > >> > to
> >> > > > > > >> > > > > >> happen, I
> >> > > > > > >> > > > > >> > >> think it would be more honest to stick with
> >> 0.*
> >> > > > > version
> >> > > > > > >> > numbers
> >> > > > > > >> > > > > until
> >> > > > > > >> > > > > >> > the
> >> > > > > > >> > > > > >> > >> library-ified Samza has been implemented,
> is
> >> > > stable
> >> > > > > and
> >> > > > > > >> > widely
> >> > > > > > >> > > > > used.
> >> > > > > > >> > > > > >> > >>
> >> > > > > > >> > > > > >> > >> Should the new Samza be a subproject of
> >> Kafka?
> >> > > There
> >> > > > > is
> >> > > > > > >> > > precedent
> >> > > > > > >> > > > > for
> >> > > > > > >> > > > > >> > >> tight coupling between different Apache
> >> projects
> >> > > > (e.g.
> >> > > > > > >> > Curator
> >> > > > > > >> > > > and
> >> > > > > > >> > > > > >> > >> Zookeeper, or Slider and YARN), so I think
> >> > > remaining
> >> > > > > > >> separate
> >> > > > > > >> > > > would
> >> > > > > > >> > > > > >> be
> >> > > > > > >> > > > > >> > ok.
> >> > > > > > >> > > > > >> > >> Even if Samza is fully dependent on Kafka,
> >> there
> >> > > is
> >> > > > > > enough
> >> > > > > > >> > > > > substance
> >> > > > > > >> > > > > >> in
> >> > > > > > >> > > > > >> > >> Samza that it warrants being a separate
> >> project.
> >> > > An
> >> > > > > > >> argument
> >> > > > > > >> > in
> >> > > > > > >> > > > > >> favour
> >> > > > > > >> > > > > >> > of
> >> > > > > > >> > > > > >> > >> merging would be if we think Kafka has a
> much
> >> > > > stronger
> >> > > > > > >> "brand
> >> > > > > > >> > > > > >> presence"
> >> > > > > > >> > > > > >> > >> than Samza; I'm ambivalent on that one. If
> >> the
> >> > > Kafka
> >> > > > > > >> project
> >> > > > > > >> > is
> >> > > > > > >> > > > > >> willing
> >> > > > > > >> > > > > >> > to
> >> > > > > > >> > > > > >> > >> endorse Samza as the "official" way of
> doing
> >> > > > stateful
> >> > > > > > >> stream
> >> > > > > > >> > > > > >> > >> transformations, that would probably have
> >> much
> >> > the
> >> > > > > same
> >> > > > > > >> > effect
> >> > > > > > >> > > as
> >> > > > > > >> > > > > >> > >> re-branding Samza as "Kafka Stream
> >> Processors"
> >> > or
> >> > > > > > suchlike.
> >> > > > > > >> > > Close
> >> > > > > > >> > > > > >> > >> collaboration between the two projects will
> >> be
> >> > > > needed
> >> > > > > in
> >> > > > > > >> any
> >> > > > > > >> > > > case.
> >> > > > > > >> > > > > >> > >>
> >> > > > > > >> > > > > >> > >> From a project management perspective, I
> >> guess
> >> > the
> >> > > > > "new
> >> > > > > > >> > Samza"
> >> > > > > > >> > > > > would
> >> > > > > > >> > > > > >> > have
> >> > > > > > >> > > > > >> > >> to be developed on a branch alongside
> ongoing
> >> > > > > > maintenance
> >> > > > > > >> of
> >> > > > > > >> > > the
> >> > > > > > >> > > > > >> current
> >> > > > > > >> > > > > >> > >> line of development? I think it would be
> >> > important
> >> > > > to
> >> > > > > > >> > continue
> >> > > > > > >> > > > > >> > supporting
> >> > > > > > >> > > > > >> > >> existing users, and provide a graceful
> >> migration
> >> > > > path
> >> > > > > to
> >> > > > > > >> the
> >> > > > > > >> > > new
> >> > > > > > >> > > > > >> > version.
> >> > > > > > >> > > > > >> > >> Leaving the current versions unsupported
> and
> >> > > forcing
> >> > > > > > people
> >> > > > > > >> > to
> >> > > > > > >> > > > > >> rewrite
> >> > > > > > >> > > > > >> > >> their jobs would send a bad signal.
> >> > > > > > >> > > > > >> > >>
> >> > > > > > >> > > > > >> > >> Best,
> >> > > > > > >> > > > > >> > >> Martin
> >> > > > > > >> > > > > >> > >>
> >> > > > > > >> > > > > >> > >> On 2 Jul 2015, at 16:59, Jay Kreps <
> >> > > > [email protected]>
> >> > > > > > >> wrote:
> >> > > > > > >> > > > > >> > >>
> >> > > > > > >> > > > > >> > >>> Hey Garry,
> >> > > > > > >> > > > > >> > >>>
> >> > > > > > >> > > > > >> > >>> Yeah that's super frustrating. I'd be
> happy
> >> to
> >> > > chat
> >> > > > > > more
> >> > > > > > >> > about
> >> > > > > > >> > > > > this
> >> > > > > > >> > > > > >> if
> >> > > > > > >> > > > > >> > >>> you'd be interested. I think Chris and I
> >> > started
> >> > > > with
> >> > > > > > the
> >> > > > > > >> > idea
> >> > > > > > >> > > > of
> >> > > > > > >> > > > > >> "what
> >> > > > > > >> > > > > >> > >>> would it take to make Samza a kick-ass
> >> > ingestion
> >> > > > > tool"
> >> > > > > > but
> >> > > > > > >> > > > > >> ultimately
> >> > > > > > >> > > > > >> > we
> >> > > > > > >> > > > > >> > >>> kind of came around to the idea that
> >> ingestion
> >> > > and
> >> > > > > > >> > > > transformation
> >> > > > > > >> > > > > >> had
> >> > > > > > >> > > > > >> > >>> pretty different needs and coupling the
> two
> >> > made
> >> > > > > things
> >> > > > > > >> > hard.
> >> > > > > > >> > > > > >> > >>>
> >> > > > > > >> > > > > >> > >>> For what it's worth I think copycat
> (KIP-26)
> >> > > > actually
> >> > > > > > will
> >> > > > > > >> > do
> >> > > > > > >> > > > what
> >> > > > > > >> > > > > >> you
> >> > > > > > >> > > > > >> > >> are
> >> > > > > > >> > > > > >> > >>> looking for.
> >> > > > > > >> > > > > >> > >>>
> >> > > > > > >> > > > > >> > >>> With regard to your point about slider, I
> >> don't
> >> > > > > > >> necessarily
> >> > > > > > >> > > > > >> disagree.
> >> > > > > > >> > > > > >> > >> But I
> >> > > > > > >> > > > > >> > >>> think getting good YARN support is quite
> >> doable
> >> > > > and I
> >> > > > > > >> think
> >> > > > > > >> > we
> >> > > > > > >> > > > can
> >> > > > > > >> > > > > >> make
> >> > > > > > >> > > > > >> > >>> that work well. I think the issue this
> >> proposal
> >> > > > > solves
> >> > > > > > is
> >> > > > > > >> > that
> >> > > > > > >> > > > > >> > >> technically
> >> > > > > > >> > > > > >> > >>> it is pretty hard to support multiple
> >> cluster
> >> > > > > > management
> >> > > > > > >> > > systems
> >> > > > > > >> > > > > the
> >> > > > > > >> > > > > >> > way
> >> > > > > > >> > > > > >> > >>> things are now, you need to write an "app
> >> > master"
> >> > > > or
> >> > > > > > >> > > "framework"
> >> > > > > > >> > > > > for
> >> > > > > > >> > > > > >> > each
> >> > > > > > >> > > > > >> > >>> and they are all a little different so
> >> testing
> >> > is
> >> > > > > > really
> >> > > > > > >> > hard.
> >> > > > > > >> > > > In
> >> > > > > > >> > > > > >> the
> >> > > > > > >> > > > > >> > >>> absence of this we have been stuck with
> just
> >> > YARN
> >> > > > > which
> >> > > > > > >> has
> >> > > > > > >> > > > > >> fantastic
> >> > > > > > >> > > > > >> > >>> penetration in the Hadoopy part of the
> org,
> >> but
> >> > > > zero
> >> > > > > > >> > > penetration
> >> > > > > > >> > > > > >> > >> elsewhere.
> >> > > > > > >> > > > > >> > >>> Given the huge amount of work being put in
> >> to
> >> > > > slider,
> >> > > > > > >> > > marathon,
> >> > > > > > >> > > > > aws
> >> > > > > > >> > > > > >> > >>> tooling, not to mention the umpteen
> related
> >> > > > packaging
> >> > > > > > >> > > > technologies
> >> > > > > > >> > > > > >> > people
> >> > > > > > >> > > > > >> > >>> want to use (Docker, Kubernetes, various
> >> > > > > cloud-specific
> >> > > > > > >> > deploy
> >> > > > > > >> > > > > >> tools,
> >> > > > > > >> > > > > >> > >> etc)
> >> > > > > > >> > > > > >> > >>> I really think it is important to get this
> >> > right.
> >> > > > > > >> > > > > >> > >>>
> >> > > > > > >> > > > > >> > >>> -Jay
> >> > > > > > >> > > > > >> > >>>
> >> > > > > > >> > > > > >> > >>> On Thu, Jul 2, 2015 at 4:17 AM, Garry
> >> > Turkington
> >> > > <
> >> > > > > > >> > > > > >> > >>> [email protected]> wrote:
> >> > > > > > >> > > > > >> > >>>
> >> > > > > > >> > > > > >> > >>>> Hi all,
> >> > > > > > >> > > > > >> > >>>>
> >> > > > > > >> > > > > >> > >>>> I think the question below re does Samza
> >> > become
> >> > > a
> >> > > > > > >> > sub-project
> >> > > > > > >> > > > of
> >> > > > > > >> > > > > >> Kafka
> >> > > > > > >> > > > > >> > >>>> highlights the broader point around
> >> migration.
> >> > > > Chris
> >> > > > > > >> > mentions
> >> > > > > > >> > > > > >> Samza's
> >> > > > > > >> > > > > >> > >>>> maturity is heading towards a v1 release
> >> but
> >> > I'm
> >> > > > not
> >> > > > > > sure
> >> > > > > > >> > it
> >> > > > > > >> > > > > feels
> >> > > > > > >> > > > > >> > >> right to
> >> > > > > > >> > > > > >> > >>>> launch a v1 then immediately plan to
> >> deprecate
> >> > > > most
> >> > > > > of
> >> > > > > > >> it.
> >> > > > > > >> > > > > >> > >>>>
> >> > > > > > >> > > > > >> > >>>> From a selfish perspective I have some
> guys
> >> > who
> >> > > > have
> >> > > > > > >> > started
> >> > > > > > >> > > > > >> working
> >> > > > > > >> > > > > >> > >> with
> >> > > > > > >> > > > > >> > >>>> Samza and building some new
> >> > consumers/producers
> >> > > > was
> >> > > > > > next
> >> > > > > > >> > up.
> >> > > > > > >> > > > > Sounds
> >> > > > > > >> > > > > >> > like
> >> > > > > > >> > > > > >> > >>>> that is absolutely not the direction to
> >> go. I
> >> > > need
> >> > > > > to
> >> > > > > > >> look
> >> > > > > > >> > > into
> >> > > > > > >> > > > > the
> >> > > > > > >> > > > > >> > KIP
> >> > > > > > >> > > > > >> > >> in
> >> > > > > > >> > > > > >> > >>>> more detail but for me the attractiveness
> >> of
> >> > > > adding
> >> > > > > > new
> >> > > > > > >> > Samza
> >> > > > > > >> > > > > >> > >>>> consumer/producers -- even if yes all
> they
> >> > were
> >> > > > > doing
> >> > > > > > was
> >> > > > > > >> > > > really
> >> > > > > > >> > > > > >> > getting
> >> > > > > > >> > > > > >> > >>>> data into and out of Kafka --  was to
> avoid
> >> > > > having
> >> > > > > to
> >> > > > > > >> > worry
> >> > > > > > >> > > > > about
> >> > > > > > >> > > > > >> the
> >> > > > > > >> > > > > >> > >>>> lifecycle management of external clients.
> >> If
> >> > > there
> >> > > > > is
> >> > > > > > a
> >> > > > > > >> > > generic
> >> > > > > > >> > > > > >> Kafka
> >> > > > > > >> > > > > >> > >>>> ingress/egress layer that I can plug a
> new
> >> > > > connector
> >> > > > > > into
> >> > > > > > >> > and
> >> > > > > > >> > > > > have
> >> > > > > > >> > > > > >> a
> >> > > > > > >> > > > > >> > >> lot of
> >> > > > > > >> > > > > >> > >>>> the heavy lifting re scale and
> reliability
> >> > done
> >> > > > for
> >> > > > > me
> >> > > > > > >> then
> >> > > > > > >> > > it
> >> > > > > > >> > > > > >> gives
> >> > > > > > >> > > > > >> > me
> >> > > > > > >> > > > > >> > >> all
> >> > > > > > >> > > > > >> > >>>> the pushing new consumers/producers
> would.
> >> If
> >> > > not
> >> > > > > > then it
> >> > > > > > >> > > > > >> complicates
> >> > > > > > >> > > > > >> > my
> >> > > > > > >> > > > > >> > >>>> operational deployments.
> >> > > > > > >> > > > > >> > >>>>
> >> > > > > > >> > > > > >> > >>>> Which is similar to my other question
> with
> >> the
> >> > > > > > proposal
> >> > > > > > >> --
> >> > > > > > >> > if
> >> > > > > > >> > > > we
> >> > > > > > >> > > > > >> > build a
> >> > > > > > >> > > > > >> > >>>> fully available/stand-alone Samza plus
> the
> >> > > > requisite
> >> > > > > > >> shims
> >> > > > > > >> > to
> >> > > > > > >> > > > > >> > integrate
> >> > > > > > >> > > > > >> > >>>> with Slider etc I suspect the former may
> >> be a
> >> > > lot
> >> > > > > more
> >> > > > > > >> work
> >> > > > > > >> > > > than
> >> > > > > > >> > > > > we
> >> > > > > > >> > > > > >> > >> think.
> >> > > > > > >> > > > > >> > >>>> We may make it much easier for a newcomer
> >> to
> >> > get
> >> > > > > > >> something
> >> > > > > > >> > > > > running
> >> > > > > > >> > > > > >> but
> >> > > > > > >> > > > > >> > >>>> having them step up and get a reliable
> >> > > production
> >> > > > > > >> > deployment
> >> > > > > > >> > > > may
> >> > > > > > >> > > > > >> still
> >> > > > > > >> > > > > >> > >>>> dominate mailing list  traffic, if for
> >> > different
> >> > > > > > reasons
> >> > > > > > >> > than
> >> > > > > > >> > > > > >> today.
> >> > > > > > >> > > > > >> > >>>>
> >> > > > > > >> > > > > >> > >>>> Don't get me wrong -- I'm comfortable
> with
> >> > > making
> >> > > > > the
> >> > > > > > >> Samza
> >> > > > > > >> > > > > >> dependency
> >> > > > > > >> > > > > >> > >> on
> >> > > > > > >> > > > > >> > >>>> Kafka much more explicit and I absolutely
> >> see
> >> > > the
> >> > > > > > >> benefits
> >> > > > > > >> > > in
> >> > > > > > >> > > > > the
> >> > > > > > >> > > > > >> > >>>> reduction of duplication and clashing
> >> > > > > > >> > > > terminologies/abstractions
> >> > > > > > >> > > > > >> that
> >> > > > > > >> > > > > >> > >>>> Chris/Jay describe. Samza as a library
> >> would
> >> > > > likely
> >> > > > > > be a
> >> > > > > > >> > very
> >> > > > > > >> > > > > nice
> >> > > > > > >> > > > > >> > tool
> >> > > > > > >> > > > > >> > >> to
> >> > > > > > >> > > > > >> > >>>> add to the Kafka ecosystem. I just have
> the
> >> > > > concerns
> >> > > > > > >> above
> >> > > > > > >> > re
> >> > > > > > >> > > > the
> >> > > > > > >> > > > > >> > >>>> operational side.
> >> > > > > > >> > > > > >> > >>>>
> >> > > > > > >> > > > > >> > >>>> Garry
> >> > > > > > >> > > > > >> > >>>>
> >> > > > > > >> > > > > >> > >>>> -----Original Message-----
> >> > > > > > >> > > > > >> > >>>> From: Gianmarco De Francisci Morales
> >> [mailto:
> >> > > > > > >> > [email protected]
> >> > > > > > >> > > ]
> >> > > > > > >> > > > > >> > >>>> Sent: 02 July 2015 12:56
> >> > > > > > >> > > > > >> > >>>> To: [email protected]
> >> > > > > > >> > > > > >> > >>>> Subject: Re: Thoughts and obesrvations on
> >> > Samza
> >> > > > > > >> > > > > >> > >>>>
> >> > > > > > >> > > > > >> > >>>> Very interesting thoughts.
> >> > > > > > >> > > > > >> > >>>> From outside, I have always perceived
> Samza
> >> > as a
> >> > > > > > >> computing
> >> > > > > > >> > > > layer
> >> > > > > > >> > > > > >> over
> >> > > > > > >> > > > > >> > >>>> Kafka.
> >> > > > > > >> > > > > >> > >>>>
> >> > > > > > >> > > > > >> > >>>> The question, maybe a bit provocative, is
> >> > > "should
> >> > > > > > Samza
> >> > > > > > >> be
> >> > > > > > >> > a
> >> > > > > > >> > > > > >> > sub-project
> >> > > > > > >> > > > > >> > >>>> of Kafka then?"
> >> > > > > > >> > > > > >> > >>>> Or does it make sense to keep it as a
> >> separate
> >> > > > > project
> >> > > > > > >> > with a
> >> > > > > > >> > > > > >> separate
> >> > > > > > >> > > > > >> > >>>> governance?
> >> > > > > > >> > > > > >> > >>>>
> >> > > > > > >> > > > > >> > >>>> Cheers,
> >> > > > > > >> > > > > >> > >>>>
> >> > > > > > >> > > > > >> > >>>> --
> >> > > > > > >> > > > > >> > >>>> Gianmarco
> >> > > > > > >> > > > > >> > >>>>
> >> > > > > > >> > > > > >> > >>>> On 2 July 2015 at 08:59, Yan Fang <
> >> > > > > > [email protected]>
> >> > > > > > >> > > > wrote:
> >> > > > > > >> > > > > >> > >>>>
> >> > > > > > >> > > > > >> > >>>>> Overall, I agree to couple with Kafka
> more
> >> > > > tightly.
> >> > > > > > >> > Because
> >> > > > > > >> > > > > Samza
> >> > > > > > >> > > > > >> de
> >> > > > > > >> > > > > >> > >>>>> facto is based on Kafka, and it should
> >> > leverage
> >> > > > > what
> >> > > > > > >> Kafka
> >> > > > > > >> > > > has.
> >> > > > > > >> > > > > At
> >> > > > > > >> > > > > >> > the
> >> > > > > > >> > > > > >> > >>>>> same time, Kafka does not need to
> reinvent
> >> > what
> >> > > > > Samza
> >> > > > > > >> > > already
> >> > > > > > >> > > > > >> has. I
> >> > > > > > >> > > > > >> > >>>>> also like the idea of separating the
> >> > ingestion
> >> > > > and
> >> > > > > > >> > > > > transformation.
> >> > > > > > >> > > > > >> > >>>>>
> >> > > > > > >> > > > > >> > >>>>> But it is a little difficult for me to
> >> image
> >> > > how
> >> > > > > the
> >> > > > > > >> Samza
> >> > > > > > >> > > > will
> >> > > > > > >> > > > > >> look
> >> > > > > > >> > > > > >> > >>>> like.
> >> > > > > > >> > > > > >> > >>>>> And I feel Chris and Jay have a little
> >> > > difference
> >> > > > > in
> >> > > > > > >> terms
> >> > > > > > >> > > of
> >> > > > > > >> > > > > how
> >> > > > > > >> > > > > >> > >>>>> Samza should look like.
> >> > > > > > >> > > > > >> > >>>>>
> >> > > > > > >> > > > > >> > >>>>> *** Will it look like what Jay's code
> >> shows
> >> > (A
> >> > > > > > client of
> >> > > > > > >> > > > Kakfa)
> >> > > > > > >> > > > > ?
> >> > > > > > >> > > > > >> And
> >> > > > > > >> > > > > >> > >>>>> user's application code calls this
> client?
> >> > > > > > >> > > > > >> > >>>>>
> >> > > > > > >> > > > > >> > >>>>> 1. If we make Samza be a library of
> Kafka
> >> > (like
> >> > > > > what
> >> > > > > > the
> >> > > > > > >> > > code
> >> > > > > > >> > > > > >> shows),
> >> > > > > > >> > > > > >> > >>>>> how do we implement auto-balance and
> >> > > > > fault-tolerance?
> >> > > > > > >> Are
> >> > > > > > >> > > they
> >> > > > > > >> > > > > >> taken
> >> > > > > > >> > > > > >> > >>>>> care by the Kafka broker or other
> >> mechanism,
> >> > > such
> >> > > > > as
> >> > > > > > >> > "Samza
> >> > > > > > >> > > > > >> worker"
> >> > > > > > >> > > > > >> > >>>>> (just make up the name) ?
> >> > > > > > >> > > > > >> > >>>>>
> >> > > > > > >> > > > > >> > >>>>> 2. What about other features, such as
> >> > > > auto-scaling,
> >> > > > > > >> shared
> >> > > > > > >> > > > > state,
> >> > > > > > >> > > > > >> > >>>>> monitoring?
> >> > > > > > >> > > > > >> > >>>>>
> >> > > > > > >> > > > > >> > >>>>>
> >> > > > > > >> > > > > >> > >>>>> *** If we have Samza standalone, (is
> this
> >> > what
> >> > > > > Chris
> >> > > > > > >> > > > suggests?)
> >> > > > > > >> > > > > >> > >>>>>
> >> > > > > > >> > > > > >> > >>>>> 1. we still need to ingest data from
> Kakfa
> >> > and
> >> > > > > > produce
> >> > > > > > >> to
> >> > > > > > >> > > it.
> >> > > > > > >> > > > > >> Then it
> >> > > > > > >> > > > > >> > >>>>> becomes the same as what Samza looks
> like
> >> > now,
> >> > > > > > except it
> >> > > > > > >> > > does
> >> > > > > > >> > > > > not
> >> > > > > > >> > > > > >> > rely
> >> > > > > > >> > > > > >> > >>>>> on Yarn anymore.
> >> > > > > > >> > > > > >> > >>>>>
> >> > > > > > >> > > > > >> > >>>>> 2. if it is standalone, how can it
> >> leverage
> >> > > > Kafka's
> >> > > > > > >> > metrics,
> >> > > > > > >> > > > > logs,
> >> > > > > > >> > > > > >> > >>>>> etc? Use Kafka code as the dependency?
> >> > > > > > >> > > > > >> > >>>>>
> >> > > > > > >> > > > > >> > >>>>>
> >> > > > > > >> > > > > >> > >>>>> Thanks,
> >> > > > > > >> > > > > >> > >>>>>
> >> > > > > > >> > > > > >> > >>>>> Fang, Yan
> >> > > > > > >> > > > > >> > >>>>> [email protected]
> >> > > > > > >> > > > > >> > >>>>>
> >> > > > > > >> > > > > >> > >>>>> On Wed, Jul 1, 2015 at 5:46 PM, Guozhang
> >> > Wang <
> >> > > > > > >> > > > > [email protected]
> >> > > > > > >> > > > > >> >
> >> > > > > > >> > > > > >> > >>>> wrote:
> >> > > > > > >> > > > > >> > >>>>>
> >> > > > > > >> > > > > >> > >>>>>> Read through the code example and it
> >> looks
> >> > > good
> >> > > > to
> >> > > > > > me.
> >> > > > > > >> A
> >> > > > > > >> > > few
> >> > > > > > >> > > > > >> > >>>>>> thoughts regarding deployment:
> >> > > > > > >> > > > > >> > >>>>>>
> >> > > > > > >> > > > > >> > >>>>>> Today Samza deploys as executable
> >> runnable
> >> > > like:
> >> > > > > > >> > > > > >> > >>>>>>
> >> > > > > > >> > > > > >> > >>>>>> deploy/samza/bin/run-job.sh
> >> > > --config-factory=...
> >> > > > > > >> > > > > >> > >>>> --config-path=file://...
> >> > > > > > >> > > > > >> > >>>>>>
> >> > > > > > >> > > > > >> > >>>>>> And this proposal advocate for
> deploying
> >> > Samza
> >> > > > > more
> >> > > > > > as
> >> > > > > > >> > > > embedded
> >> > > > > > >> > > > > >> > >>>>>> libraries in user application code
> >> (ignoring
> >> > > the
> >> > > > > > >> > > terminology
> >> > > > > > >> > > > > >> since
> >> > > > > > >> > > > > >> > >>>>>> it is not the
> >> > > > > > >> > > > > >> > >>>>> same
> >> > > > > > >> > > > > >> > >>>>>> as the prototype code):
> >> > > > > > >> > > > > >> > >>>>>>
> >> > > > > > >> > > > > >> > >>>>>> StreamTask task = new
> >> MyStreamTask(configs);
> >> > > > > Thread
> >> > > > > > >> > thread
> >> > > > > > >> > > =
> >> > > > > > >> > > > > new
> >> > > > > > >> > > > > >> > >>>>>> Thread(task); thread.start();
> >> > > > > > >> > > > > >> > >>>>>>
> >> > > > > > >> > > > > >> > >>>>>> I think both of these deployment modes
> >> are
> >> > > > > important
> >> > > > > > >> for
> >> > > > > > >> > > > > >> different
> >> > > > > > >> > > > > >> > >>>>>> types
> >> > > > > > >> > > > > >> > >>>>> of
> >> > > > > > >> > > > > >> > >>>>>> users. That said, I think making Samza
> >> > purely
> >> > > > > > >> standalone
> >> > > > > > >> > is
> >> > > > > > >> > > > > still
> >> > > > > > >> > > > > >> > >>>>>> sufficient for either runnable or
> library
> >> > > modes.
> >> > > > > > >> > > > > >> > >>>>>>
> >> > > > > > >> > > > > >> > >>>>>> Guozhang
> >> > > > > > >> > > > > >> > >>>>>>
> >> > > > > > >> > > > > >> > >>>>>> On Tue, Jun 30, 2015 at 11:33 PM, Jay
> >> Kreps
> >> > <
> >> > > > > > >> > > > [email protected]>
> >> > > > > > >> > > > > >> > wrote:
> >> > > > > > >> > > > > >> > >>>>>>
> >> > > > > > >> > > > > >> > >>>>>>> Looks like gmail mangled the code
> >> example,
> >> > it
> >> > > > was
> >> > > > > > >> > supposed
> >> > > > > > >> > > > to
> >> > > > > > >> > > > > >> look
> >> > > > > > >> > > > > >> > >>>>>>> like
> >> > > > > > >> > > > > >> > >>>>>>> this:
> >> > > > > > >> > > > > >> > >>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>> Properties props = new Properties();
> >> > > > > > >> > > > > >> > >>>>>>> props.put("bootstrap.servers",
> >> > > > "localhost:4242");
> >> > > > > > >> > > > > >> StreamingConfig
> >> > > > > > >> > > > > >> > >>>>>>> config = new StreamingConfig(props);
> >> > > > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1",
> >> > > > "test-topic-2");
> >> > > > > > >> > > > > >> > >>>>>>>
> >> > > config.processor(ExampleStreamProcessor.class);
> >> > > > > > >> > > > > >> > >>>>>>> config.serialization(new
> >> > StringSerializer(),
> >> > > > new
> >> > > > > > >> > > > > >> > >>>>>>> StringDeserializer()); KafkaStreaming
> >> > > > container =
> >> > > > > > new
> >> > > > > > >> > > > > >> > >>>>>>> KafkaStreaming(config);
> container.run();
> >> > > > > > >> > > > > >> > >>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>> -Jay
> >> > > > > > >> > > > > >> > >>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>> On Tue, Jun 30, 2015 at 11:32 PM, Jay
> >> > Kreps <
> >> > > > > > >> > > > [email protected]
> >> > > > > > >> > > > > >
> >> > > > > > >> > > > > >> > >>>> wrote:
> >> > > > > > >> > > > > >> > >>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>> Hey guys,
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>> This came out of some conversations
> >> Chris
> >> > > and
> >> > > > I
> >> > > > > > were
> >> > > > > > >> > > having
> >> > > > > > >> > > > > >> > >>>>>>>> around
> >> > > > > > >> > > > > >> > >>>>>>> whether
> >> > > > > > >> > > > > >> > >>>>>>>> it would make sense to use Samza as a
> >> kind
> >> > > of
> >> > > > > data
> >> > > > > > >> > > > ingestion
> >> > > > > > >> > > > > >> > >>>>> framework
> >> > > > > > >> > > > > >> > >>>>>>> for
> >> > > > > > >> > > > > >> > >>>>>>>> Kafka (which ultimately lead to
> KIP-26
> >> > > > > "copycat").
> >> > > > > > >> This
> >> > > > > > >> > > > kind
> >> > > > > > >> > > > > of
> >> > > > > > >> > > > > >> > >>>>>> combined
> >> > > > > > >> > > > > >> > >>>>>>>> with complaints around config and
> YARN
> >> and
> >> > > the
> >> > > > > > >> > discussion
> >> > > > > > >> > > > > >> around
> >> > > > > > >> > > > > >> > >>>>>>>> how
> >> > > > > > >> > > > > >> > >>>>> to
> >> > > > > > >> > > > > >> > >>>>>>>> best do a standalone mode.
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>> So the thought experiment was, given
> >> that
> >> > > > Samza
> >> > > > > > was
> >> > > > > > >> > > > basically
> >> > > > > > >> > > > > >> > >>>>>>>> already totally Kafka specific, what
> if
> >> > you
> >> > > > just
> >> > > > > > >> > embraced
> >> > > > > > >> > > > > that
> >> > > > > > >> > > > > >> > >>>>>>>> and turned it
> >> > > > > > >> > > > > >> > >>>>>> into
> >> > > > > > >> > > > > >> > >>>>>>>> something less like a heavyweight
> >> > framework
> >> > > > and
> >> > > > > > more
> >> > > > > > >> > > like a
> >> > > > > > >> > > > > >> > >>>>>>>> third
> >> > > > > > >> > > > > >> > >>>>> Kafka
> >> > > > > > >> > > > > >> > >>>>>>>> client--a kind of "producing
> consumer"
> >> > with
> >> > > > > state
> >> > > > > > >> > > > management
> >> > > > > > >> > > > > >> > >>>>>> facilities.
> >> > > > > > >> > > > > >> > >>>>>>>> Basically a library. Instead of a
> >> complex
> >> > > > stream
> >> > > > > > >> > > processing
> >> > > > > > >> > > > > >> > >>>>>>>> framework
> >> > > > > > >> > > > > >> > >>>>>>> this
> >> > > > > > >> > > > > >> > >>>>>>>> would actually be a very simple
> thing,
> >> not
> >> > > > much
> >> > > > > > more
> >> > > > > > >> > > > > >> complicated
> >> > > > > > >> > > > > >> > >>>>>>>> to
> >> > > > > > >> > > > > >> > >>>>> use
> >> > > > > > >> > > > > >> > >>>>>>> or
> >> > > > > > >> > > > > >> > >>>>>>>> operate than a Kafka consumer. As
> Chris
> >> > said
> >> > > > we
> >> > > > > > >> thought
> >> > > > > > >> > > > about
> >> > > > > > >> > > > > >> it
> >> > > > > > >> > > > > >> > >>>>>>>> a
> >> > > > > > >> > > > > >> > >>>>> lot
> >> > > > > > >> > > > > >> > >>>>>> of
> >> > > > > > >> > > > > >> > >>>>>>>> what Samza (and the other stream
> >> > processing
> >> > > > > > systems
> >> > > > > > >> > were
> >> > > > > > >> > > > > doing)
> >> > > > > > >> > > > > >> > >>>>> seemed
> >> > > > > > >> > > > > >> > >>>>>>> like
> >> > > > > > >> > > > > >> > >>>>>>>> kind of a hangover from MapReduce.
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>> Of course you need to ingest/output
> >> data
> >> > to
> >> > > > and
> >> > > > > > from
> >> > > > > > >> > the
> >> > > > > > >> > > > > stream
> >> > > > > > >> > > > > >> > >>>>>>>> processing. But when we actually
> looked
> >> > into
> >> > > > how
> >> > > > > > that
> >> > > > > > >> > > would
> >> > > > > > >> > > > > >> > >>>>>>>> work,
> >> > > > > > >> > > > > >> > >>>>> Samza
> >> > > > > > >> > > > > >> > >>>>>>>> isn't really an ideal data ingestion
> >> > > framework
> >> > > > > > for a
> >> > > > > > >> > > bunch
> >> > > > > > >> > > > of
> >> > > > > > >> > > > > >> > >>>>> reasons.
> >> > > > > > >> > > > > >> > >>>>>> To
> >> > > > > > >> > > > > >> > >>>>>>>> really do that right you need a
> pretty
> >> > > > different
> >> > > > > > >> > internal
> >> > > > > > >> > > > > data
> >> > > > > > >> > > > > >> > >>>>>>>> model
> >> > > > > > >> > > > > >> > >>>>>> and
> >> > > > > > >> > > > > >> > >>>>>>>> set of apis. So what if you split
> them
> >> and
> >> > > had
> >> > > > > an
> >> > > > > > api
> >> > > > > > >> > for
> >> > > > > > >> > > > > Kafka
> >> > > > > > >> > > > > >> > >>>>>>>> ingress/egress (copycat AKA KIP-26)
> >> and a
> >> > > > > separate
> >> > > > > > >> api
> >> > > > > > >> > > for
> >> > > > > > >> > > > > >> Kafka
> >> > > > > > >> > > > > >> > >>>>>>>> transformation (Samza).
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>> This would also allow really
> embracing
> >> the
> >> > > > same
> >> > > > > > >> > > terminology
> >> > > > > > >> > > > > and
> >> > > > > > >> > > > > >> > >>>>>>>> conventions. One complaint about the
> >> > current
> >> > > > > > state is
> >> > > > > > >> > > that
> >> > > > > > >> > > > > the
> >> > > > > > >> > > > > >> > >>>>>>>> two
> >> > > > > > >> > > > > >> > >>>>>>> systems
> >> > > > > > >> > > > > >> > >>>>>>>> kind of feel bolted on. Terminology
> >> like
> >> > > > > "stream"
> >> > > > > > vs
> >> > > > > > >> > > > "topic"
> >> > > > > > >> > > > > >> and
> >> > > > > > >> > > > > >> > >>>>>>> different
> >> > > > > > >> > > > > >> > >>>>>>>> config and monitoring systems means
> you
> >> > kind
> >> > > > of
> >> > > > > > have
> >> > > > > > >> to
> >> > > > > > >> > > > learn
> >> > > > > > >> > > > > >> > >>>>>>>> Kafka's
> >> > > > > > >> > > > > >> > >>>>>>> way,
> >> > > > > > >> > > > > >> > >>>>>>>> then learn Samza's slightly different
> >> way,
> >> > > > then
> >> > > > > > kind
> >> > > > > > >> of
> >> > > > > > >> > > > > >> > >>>>>>>> understand
> >> > > > > > >> > > > > >> > >>>>> how
> >> > > > > > >> > > > > >> > >>>>>>> they
> >> > > > > > >> > > > > >> > >>>>>>>> map to each other, which having
> walked
> >> a
> >> > few
> >> > > > > > people
> >> > > > > > >> > > through
> >> > > > > > >> > > > > >> this
> >> > > > > > >> > > > > >> > >>>>>>>> is surprisingly tricky for folks to
> >> get.
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>> Since I have been spending a lot of
> >> time
> >> > on
> >> > > > > > >> airplanes I
> >> > > > > > >> > > > > hacked
> >> > > > > > >> > > > > >> > >>>>>>>> up an ernest but still somewhat
> >> incomplete
> >> > > > > > prototype
> >> > > > > > >> of
> >> > > > > > >> > > > what
> >> > > > > > >> > > > > >> > >>>>>>>> this would
> >> > > > > > >> > > > > >> > >>>>> look
> >> > > > > > >> > > > > >> > >>>>>>>> like. This is just unceremoniously
> >> dumped
> >> > > into
> >> > > > > > Kafka
> >> > > > > > >> as
> >> > > > > > >> > > it
> >> > > > > > >> > > > > >> > >>>>>>>> required a
> >> > > > > > >> > > > > >> > >>>>>> few
> >> > > > > > >> > > > > >> > >>>>>>>> changes to the new consumer. Here is
> >> the
> >> > > code:
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>
> >> > > > > > >> > > > > >> > >>>>>
> >> > > > > > >> > > > > >> >
> >> > > > > > >> > > > >
> >> > > > > > >> >
> >> > > > > >
> >> > >
> >> https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org
> >> > > > > > >> > > > > >> > >>>>> /apache/kafka/clients/streaming
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>> For the purpose of the prototype I
> just
> >> > > > > liberally
> >> > > > > > >> > renamed
> >> > > > > > >> > > > > >> > >>>>>>>> everything
> >> > > > > > >> > > > > >> > >>>>> to
> >> > > > > > >> > > > > >> > >>>>>>>> try to align it with Kafka with no
> >> regard
> >> > > for
> >> > > > > > >> > > > compatibility.
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>> To use this would be something like
> >> this:
> >> > > > > > >> > > > > >> > >>>>>>>> Properties props = new Properties();
> >> > > > > > >> > > > > >> > >>>>>>>> props.put("bootstrap.servers",
> >> > > > > "localhost:4242");
> >> > > > > > >> > > > > >> > >>>>>>>> StreamingConfig config = new
> >> > > > > > >> > > > > >> > >>>>> StreamingConfig(props);
> >> > > > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1",
> >> > > > > > >> > > > > >> > >>>>>>>> "test-topic-2");
> >> > > > > > >> > > > > >> config.processor(ExampleStreamProcessor.class);
> >> > > > > > >> > > > > >> > >>>>>>> config.serialization(new
> >> > > > > > >> > > > > >> > >>>>>>>> StringSerializer(), new
> >> > > StringDeserializer());
> >> > > > > > >> > > > KafkaStreaming
> >> > > > > > >> > > > > >> > >>>>>> container =
> >> > > > > > >> > > > > >> > >>>>>>>> new KafkaStreaming(config);
> >> > container.run();
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>> KafkaStreaming is basically the
> >> > > > SamzaContainer;
> >> > > > > > >> > > > > StreamProcessor
> >> > > > > > >> > > > > >> > >>>>>>>> is basically StreamTask.
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>> So rather than putting all the class
> >> names
> >> > > in
> >> > > > a
> >> > > > > > file
> >> > > > > > >> > and
> >> > > > > > >> > > > then
> >> > > > > > >> > > > > >> > >>>>>>>> having
> >> > > > > > >> > > > > >> > >>>>>> the
> >> > > > > > >> > > > > >> > >>>>>>>> job assembled by reflection, you just
> >> > > > > instantiate
> >> > > > > > the
> >> > > > > > >> > > > > container
> >> > > > > > >> > > > > >> > >>>>>>>> programmatically. Work is balanced
> over
> >> > > > however
> >> > > > > > many
> >> > > > > > >> > > > > instances
> >> > > > > > >> > > > > >> > >>>>>>>> of
> >> > > > > > >> > > > > >> > >>>>> this
> >> > > > > > >> > > > > >> > >>>>>>> are
> >> > > > > > >> > > > > >> > >>>>>>>> alive at any time (i.e. if an
> instance
> >> > dies,
> >> > > > new
> >> > > > > > >> tasks
> >> > > > > > >> > > are
> >> > > > > > >> > > > > >> added
> >> > > > > > >> > > > > >> > >>>>>>>> to
> >> > > > > > >> > > > > >> > >>>>> the
> >> > > > > > >> > > > > >> > >>>>>>>> existing containers without shutting
> >> them
> >> > > > down).
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>> We would provide some glue for
> running
> >> > this
> >> > > > > stuff
> >> > > > > > in
> >> > > > > > >> > YARN
> >> > > > > > >> > > > via
> >> > > > > > >> > > > > >> > >>>>>>>> Slider, Mesos via Marathon, and AWS
> >> using
> >> > > some
> >> > > > > of
> >> > > > > > >> their
> >> > > > > > >> > > > tools
> >> > > > > > >> > > > > >> > >>>>>>>> but from the
> >> > > > > > >> > > > > >> > >>>>>> point
> >> > > > > > >> > > > > >> > >>>>>>> of
> >> > > > > > >> > > > > >> > >>>>>>>> view of these frameworks these stream
> >> > > > processing
> >> > > > > > jobs
> >> > > > > > >> > are
> >> > > > > > >> > > > > just
> >> > > > > > >> > > > > >> > >>>>>> stateless
> >> > > > > > >> > > > > >> > >>>>>>>> services that can come and go and
> >> expand
> >> > and
> >> > > > > > contract
> >> > > > > > >> > at
> >> > > > > > >> > > > > will.
> >> > > > > > >> > > > > >> > >>>>>>>> There
> >> > > > > > >> > > > > >> > >>>>> is
> >> > > > > > >> > > > > >> > >>>>>>> no
> >> > > > > > >> > > > > >> > >>>>>>>> more custom scheduler.
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>> Here are some relevant details:
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>  1. It is only ~1300 lines of code,
> it
> >> > would
> >> > > > get
> >> > > > > > >> larger
> >> > > > > > >> > > if
> >> > > > > > >> > > > we
> >> > > > > > >> > > > > >> > >>>>>>>>  productionized but not vastly
> larger.
> >> We
> >> > > > really
> >> > > > > > do
> >> > > > > > >> > get a
> >> > > > > > >> > > > ton
> >> > > > > > >> > > > > >> > >>>>>>>> of
> >> > > > > > >> > > > > >> > >>>>>>> leverage
> >> > > > > > >> > > > > >> > >>>>>>>>  out of Kafka.
> >> > > > > > >> > > > > >> > >>>>>>>>  2. Partition management is fully
> >> > delegated
> >> > > to
> >> > > > > the
> >> > > > > > >> new
> >> > > > > > >> > > > > >> consumer.
> >> > > > > > >> > > > > >> > >>>>> This
> >> > > > > > >> > > > > >> > >>>>>>>>  is nice since now any partition
> >> > management
> >> > > > > > strategy
> >> > > > > > >> > > > > available
> >> > > > > > >> > > > > >> > >>>>>>>> to
> >> > > > > > >> > > > > >> > >>>>>> Kafka
> >> > > > > > >> > > > > >> > >>>>>>>>  consumer is also available to Samza
> >> (and
> >> > > vice
> >> > > > > > versa)
> >> > > > > > >> > and
> >> > > > > > >> > > > > with
> >> > > > > > >> > > > > >> > >>>>>>>> the
> >> > > > > > >> > > > > >> > >>>>>>> exact
> >> > > > > > >> > > > > >> > >>>>>>>>  same configs.
> >> > > > > > >> > > > > >> > >>>>>>>>  3. It supports state as well as
> state
> >> > reuse
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>> Anyhow take a look, hopefully it is
> >> > thought
> >> > > > > > >> provoking.
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>> -Jay
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>> On Tue, Jun 30, 2015 at 6:55 PM,
> Chris
> >> > > > > Riccomini <
> >> > > > > > >> > > > > >> > >>>>>> [email protected]>
> >> > > > > > >> > > > > >> > >>>>>>>> wrote:
> >> > > > > > >> > > > > >> > >>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> Hey all,
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> I have had some discussions with
> Samza
> >> > > > > engineers
> >> > > > > > at
> >> > > > > > >> > > > LinkedIn
> >> > > > > > >> > > > > >> > >>>>>>>>> and
> >> > > > > > >> > > > > >> > >>>>>>> Confluent
> >> > > > > > >> > > > > >> > >>>>>>>>> and we came up with a few
> observations
> >> > and
> >> > > > > would
> >> > > > > > >> like
> >> > > > > > >> > to
> >> > > > > > >> > > > > >> > >>>>>>>>> propose
> >> > > > > > >> > > > > >> > >>>>> some
> >> > > > > > >> > > > > >> > >>>>>>>>> changes.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> We've observed some things that I
> >> want to
> >> > > > call
> >> > > > > > out
> >> > > > > > >> > about
> >> > > > > > >> > > > > >> > >>>>>>>>> Samza's
> >> > > > > > >> > > > > >> > >>>>>> design,
> >> > > > > > >> > > > > >> > >>>>>>>>> and I'd like to propose some
> changes.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> * Samza is dependent upon a dynamic
> >> > > > deployment
> >> > > > > > >> system.
> >> > > > > > >> > > > > >> > >>>>>>>>> * Samza is too pluggable.
> >> > > > > > >> > > > > >> > >>>>>>>>> * Samza's
> >> SystemConsumer/SystemProducer
> >> > and
> >> > > > > > Kafka's
> >> > > > > > >> > > > consumer
> >> > > > > > >> > > > > >> > >>>>>>>>> APIs
> >> > > > > > >> > > > > >> > >>>>> are
> >> > > > > > >> > > > > >> > >>>>>>>>> trying to solve a lot of the same
> >> > problems.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> All three of these issues are
> related,
> >> > but
> >> > > > I'll
> >> > > > > > >> > address
> >> > > > > > >> > > > them
> >> > > > > > >> > > > > >> in
> >> > > > > > >> > > > > >> > >>>>> order.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> Deployment
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> Samza strongly depends on the use
> of a
> >> > > > dynamic
> >> > > > > > >> > > deployment
> >> > > > > > >> > > > > >> > >>>>>>>>> scheduler
> >> > > > > > >> > > > > >> > >>>>>> such
> >> > > > > > >> > > > > >> > >>>>>>>>> as
> >> > > > > > >> > > > > >> > >>>>>>>>> YARN, Mesos, etc. When we initially
> >> built
> >> > > > > Samza,
> >> > > > > > we
> >> > > > > > >> > bet
> >> > > > > > >> > > > that
> >> > > > > > >> > > > > >> > >>>>>>>>> there
> >> > > > > > >> > > > > >> > >>>>>> would
> >> > > > > > >> > > > > >> > >>>>>>>>> be
> >> > > > > > >> > > > > >> > >>>>>>>>> one or two winners in this area, and
> >> we
> >> > > could
> >> > > > > > >> support
> >> > > > > > >> > > > them,
> >> > > > > > >> > > > > >> and
> >> > > > > > >> > > > > >> > >>>>>>>>> the
> >> > > > > > >> > > > > >> > >>>>>> rest
> >> > > > > > >> > > > > >> > >>>>>>>>> would go away. In reality, there are
> >> many
> >> > > > > > >> variations.
> >> > > > > > >> > > > > >> > >>>>>>>>> Furthermore,
> >> > > > > > >> > > > > >> > >>>>>> many
> >> > > > > > >> > > > > >> > >>>>>>>>> people still prefer to just start
> >> their
> >> > > > > > processors
> >> > > > > > >> > like
> >> > > > > > >> > > > > normal
> >> > > > > > >> > > > > >> > >>>>>>>>> Java processes, and use traditional
> >> > > > deployment
> >> > > > > > >> scripts
> >> > > > > > >> > > > such
> >> > > > > > >> > > > > as
> >> > > > > > >> > > > > >> > >>>>>>>>> Fabric,
> >> > > > > > >> > > > > >> > >>>>>> Chef,
> >> > > > > > >> > > > > >> > >>>>>>>>> Ansible, etc. Forcing a deployment
> >> system
> >> > > on
> >> > > > > > users
> >> > > > > > >> > makes
> >> > > > > > >> > > > the
> >> > > > > > >> > > > > >> > >>>>>>>>> Samza start-up process really
> painful
> >> for
> >> > > > first
> >> > > > > > time
> >> > > > > > >> > > > users.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> Dynamic deployment as a requirement
> >> was
> >> > > also
> >> > > > a
> >> > > > > > bit
> >> > > > > > >> of
> >> > > > > > >> > a
> >> > > > > > >> > > > > >> > >>>>>>>>> mis-fire
> >> > > > > > >> > > > > >> > >>>>>> because
> >> > > > > > >> > > > > >> > >>>>>>>>> of
> >> > > > > > >> > > > > >> > >>>>>>>>> a fundamental misunderstanding
> between
> >> > the
> >> > > > > > nature of
> >> > > > > > >> > > batch
> >> > > > > > >> > > > > >> jobs
> >> > > > > > >> > > > > >> > >>>>>>>>> and
> >> > > > > > >> > > > > >> > >>>>>>> stream
> >> > > > > > >> > > > > >> > >>>>>>>>> processing jobs. Early on, we made
> >> > > conscious
> >> > > > > > effort
> >> > > > > > >> to
> >> > > > > > >> > > > favor
> >> > > > > > >> > > > > >> > >>>>>>>>> the
> >> > > > > > >> > > > > >> > >>>>>> Hadoop
> >> > > > > > >> > > > > >> > >>>>>>>>> (Map/Reduce) way of doing things,
> >> since
> >> > it
> >> > > > > worked
> >> > > > > > >> and
> >> > > > > > >> > > was
> >> > > > > > >> > > > > well
> >> > > > > > >> > > > > >> > >>>>>>> understood.
> >> > > > > > >> > > > > >> > >>>>>>>>> One thing that we missed was that
> >> batch
> >> > > jobs
> >> > > > > > have a
> >> > > > > > >> > > > definite
> >> > > > > > >> > > > > >> > >>>>>> beginning,
> >> > > > > > >> > > > > >> > >>>>>>>>> and
> >> > > > > > >> > > > > >> > >>>>>>>>> end, and stream processing jobs
> don't
> >> > > > > (usually).
> >> > > > > > >> This
> >> > > > > > >> > > > leads
> >> > > > > > >> > > > > to
> >> > > > > > >> > > > > >> > >>>>>>>>> a
> >> > > > > > >> > > > > >> > >>>>> much
> >> > > > > > >> > > > > >> > >>>>>>>>> simpler scheduling problem for
> stream
> >> > > > > processors.
> >> > > > > > >> You
> >> > > > > > >> > > > > >> basically
> >> > > > > > >> > > > > >> > >>>>>>>>> just
> >> > > > > > >> > > > > >> > >>>>>>> need
> >> > > > > > >> > > > > >> > >>>>>>>>> to find a place to start the
> >> processor,
> >> > and
> >> > > > > start
> >> > > > > > >> it.
> >> > > > > > >> > > The
> >> > > > > > >> > > > > way
> >> > > > > > >> > > > > >> > >>>>>>>>> we run grids, at LinkedIn, there's
> no
> >> > > concept
> >> > > > > of
> >> > > > > > a
> >> > > > > > >> > > cluster
> >> > > > > > >> > > > > >> > >>>>>>>>> being "full". We always
> >> > > > > > >> > > > > >> > >>>>>> add
> >> > > > > > >> > > > > >> > >>>>>>>>> more machines. The problem with
> >> coupling
> >> > > > Samza
> >> > > > > > with
> >> > > > > > >> a
> >> > > > > > >> > > > > >> scheduler
> >> > > > > > >> > > > > >> > >>>>>>>>> is
> >> > > > > > >> > > > > >> > >>>>>> that
> >> > > > > > >> > > > > >> > >>>>>>>>> Samza (as a framework) now has to
> >> handle
> >> > > > > > deployment.
> >> > > > > > >> > > This
> >> > > > > > >> > > > > >> pulls
> >> > > > > > >> > > > > >> > >>>>>>>>> in a
> >> > > > > > >> > > > > >> > >>>>>>> bunch
> >> > > > > > >> > > > > >> > >>>>>>>>> of things such as configuration
> >> > > distribution
> >> > > > > > (config
> >> > > > > > >> > > > > stream),
> >> > > > > > >> > > > > >> > >>>>>>>>> shell
> >> > > > > > >> > > > > >> > >>>>>>> scrips
> >> > > > > > >> > > > > >> > >>>>>>>>> (bin/run-job.sh, JobRunner),
> packaging
> >> > (all
> >> > > > the
> >> > > > > > .tgz
> >> > > > > > >> > > > stuff),
> >> > > > > > >> > > > > >> etc.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> Another reason for requiring dynamic
> >> > > > deployment
> >> > > > > > was
> >> > > > > > >> to
> >> > > > > > >> > > > > support
> >> > > > > > >> > > > > >> > >>>>>>>>> data locality. If you want to have
> >> > > locality,
> >> > > > > you
> >> > > > > > >> need
> >> > > > > > >> > to
> >> > > > > > >> > > > put
> >> > > > > > >> > > > > >> > >>>>>>>>> your
> >> > > > > > >> > > > > >> > >>>>>> processors
> >> > > > > > >> > > > > >> > >>>>>>>>> close to the data they're
> processing.
> >> > Upon
> >> > > > > > further
> >> > > > > > >> > > > > >> > >>>>>>>>> investigation,
> >> > > > > > >> > > > > >> > >>>>>>> though,
> >> > > > > > >> > > > > >> > >>>>>>>>> this feature is not that beneficial.
> >> > There
> >> > > is
> >> > > > > > some
> >> > > > > > >> > good
> >> > > > > > >> > > > > >> > >>>>>>>>> discussion
> >> > > > > > >> > > > > >> > >>>>>> about
> >> > > > > > >> > > > > >> > >>>>>>>>> some problems with it on SAMZA-335.
> >> > Again,
> >> > > we
> >> > > > > > took
> >> > > > > > >> the
> >> > > > > > >> > > > > >> > >>>>>>>>> Map/Reduce
> >> > > > > > >> > > > > >> > >>>>>> path,
> >> > > > > > >> > > > > >> > >>>>>>>>> but
> >> > > > > > >> > > > > >> > >>>>>>>>> there are some fundamental
> differences
> >> > > > between
> >> > > > > > HDFS
> >> > > > > > >> > and
> >> > > > > > >> > > > > Kafka.
> >> > > > > > >> > > > > >> > >>>>>>>>> HDFS
> >> > > > > > >> > > > > >> > >>>>>> has
> >> > > > > > >> > > > > >> > >>>>>>>>> blocks, while Kafka has partitions.
> >> This
> >> > > > leads
> >> > > > > to
> >> > > > > > >> less
> >> > > > > > >> > > > > >> > >>>>>>>>> optimization potential with stream
> >> > > processors
> >> > > > > on
> >> > > > > > top
> >> > > > > > >> > of
> >> > > > > > >> > > > > Kafka.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> This feature is also used as a
> crutch.
> >> > > Samza
> >> > > > > > doesn't
> >> > > > > > >> > > have
> >> > > > > > >> > > > > any
> >> > > > > > >> > > > > >> > >>>>>>>>> built
> >> > > > > > >> > > > > >> > >>>>> in
> >> > > > > > >> > > > > >> > >>>>>>>>> fault-tolerance logic. Instead, it
> >> > depends
> >> > > on
> >> > > > > the
> >> > > > > > >> > > dynamic
> >> > > > > > >> > > > > >> > >>>>>>>>> deployment scheduling system to
> handle
> >> > > > restarts
> >> > > > > > >> when a
> >> > > > > > >> > > > > >> > >>>>>>>>> processor dies. This has
> >> > > > > > >> > > > > >> > >>>>>>> made
> >> > > > > > >> > > > > >> > >>>>>>>>> it very difficult to write a
> >> standalone
> >> > > Samza
> >> > > > > > >> > container
> >> > > > > > >> > > > > >> > >>>> (SAMZA-516).
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> Pluggability
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> In some cases pluggability is good,
> >> but I
> >> > > > think
> >> > > > > > that
> >> > > > > > >> > > we've
> >> > > > > > >> > > > > >> gone
> >> > > > > > >> > > > > >> > >>>>>>>>> too
> >> > > > > > >> > > > > >> > >>>>>> far
> >> > > > > > >> > > > > >> > >>>>>>>>> with it. Currently, Samza has:
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable config.
> >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable metrics.
> >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable deployment systems.
> >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable streaming systems
> >> > > > (SystemConsumer,
> >> > > > > > >> > > > > SystemProducer,
> >> > > > > > >> > > > > >> > >>>> etc).
> >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable serdes.
> >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable storage engines.
> >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable strategies for just
> about
> >> > every
> >> > > > > > >> component
> >> > > > > > >> > > > > >> > >>>>> (MessageChooser,
> >> > > > > > >> > > > > >> > >>>>>>>>> SystemStreamPartitionGrouper,
> >> > > ConfigRewriter,
> >> > > > > > etc).
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> There's probably more that I've
> >> > forgotten,
> >> > > as
> >> > > > > > well.
> >> > > > > > >> > Some
> >> > > > > > >> > > > of
> >> > > > > > >> > > > > >> > >>>>>>>>> these
> >> > > > > > >> > > > > >> > >>>>> are
> >> > > > > > >> > > > > >> > >>>>>>>>> useful, but some have proven not to
> >> be.
> >> > > This
> >> > > > > all
> >> > > > > > >> comes
> >> > > > > > >> > > at
> >> > > > > > >> > > > a
> >> > > > > > >> > > > > >> cost:
> >> > > > > > >> > > > > >> > >>>>>>>>> complexity. This complexity is
> making
> >> it
> >> > > > harder
> >> > > > > > for
> >> > > > > > >> > our
> >> > > > > > >> > > > > users
> >> > > > > > >> > > > > >> > >>>>>>>>> to
> >> > > > > > >> > > > > >> > >>>>> pick
> >> > > > > > >> > > > > >> > >>>>>> up
> >> > > > > > >> > > > > >> > >>>>>>>>> and use Samza out of the box. It
> also
> >> > makes
> >> > > > it
> >> > > > > > >> > difficult
> >> > > > > > >> > > > for
> >> > > > > > >> > > > > >> > >>>>>>>>> Samza developers to reason about
> what
> >> the
> >> > > > > > >> > > characteristics
> >> > > > > > >> > > > of
> >> > > > > > >> > > > > >> > >>>>>>>>> the container (since the
> >> characteristics
> >> > > > change
> >> > > > > > >> > > depending
> >> > > > > > >> > > > on
> >> > > > > > >> > > > > >> > >>>>>>>>> which plugins are use).
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> The issues with pluggability are
> most
> >> > > visible
> >> > > > > in
> >> > > > > > the
> >> > > > > > >> > > > System
> >> > > > > > >> > > > > >> APIs.
> >> > > > > > >> > > > > >> > >>>>> What
> >> > > > > > >> > > > > >> > >>>>>>>>> Samza really requires to be
> >> functional is
> >> > > > Kafka
> >> > > > > > as
> >> > > > > > >> its
> >> > > > > > >> > > > > >> > >>>>>>>>> transport
> >> > > > > > >> > > > > >> > >>>>>> layer.
> >> > > > > > >> > > > > >> > >>>>>>>>> But
> >> > > > > > >> > > > > >> > >>>>>>>>> we've conflated two unrelated use
> >> cases
> >> > > into
> >> > > > > one
> >> > > > > > >> API:
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> 1. Get data into/out of Kafka.
> >> > > > > > >> > > > > >> > >>>>>>>>> 2. Process the data in Kafka.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> The current System API supports both
> >> of
> >> > > these
> >> > > > > use
> >> > > > > > >> > cases.
> >> > > > > > >> > > > The
> >> > > > > > >> > > > > >> > >>>>>>>>> problem
> >> > > > > > >> > > > > >> > >>>>>> is,
> >> > > > > > >> > > > > >> > >>>>>>>>> we
> >> > > > > > >> > > > > >> > >>>>>>>>> actually want different features for
> >> each
> >> > > use
> >> > > > > > case.
> >> > > > > > >> By
> >> > > > > > >> > > > > >> papering
> >> > > > > > >> > > > > >> > >>>>>>>>> over
> >> > > > > > >> > > > > >> > >>>>>>> these
> >> > > > > > >> > > > > >> > >>>>>>>>> two use cases, and providing a
> single
> >> > API,
> >> > > > > we've
> >> > > > > > >> > > > introduced
> >> > > > > > >> > > > > a
> >> > > > > > >> > > > > >> > >>>>>>>>> ton of
> >> > > > > > >> > > > > >> > >>>>>>> leaky
> >> > > > > > >> > > > > >> > >>>>>>>>> abstractions.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> For example, what we'd really like
> in
> >> (2)
> >> > > is
> >> > > > to
> >> > > > > > have
> >> > > > > > >> > > > > >> > >>>>>>>>> monotonically increasing longs for
> >> > offsets
> >> > > > > (like
> >> > > > > > >> > Kafka).
> >> > > > > > >> > > > > This
> >> > > > > > >> > > > > >> > >>>>>>>>> would be at odds
> >> > > > > > >> > > > > >> > >>>>> with
> >> > > > > > >> > > > > >> > >>>>>>> (1),
> >> > > > > > >> > > > > >> > >>>>>>>>> though, since different systems have
> >> > > > different
> >> > > > > > >> > > > > >> > >>>>>>> SCNs/Offsets/UUIDs/vectors.
> >> > > > > > >> > > > > >> > >>>>>>>>> There was discussion both on the
> >> mailing
> >> > > list
> >> > > > > and
> >> > > > > > >> the
> >> > > > > > >> > > SQL
> >> > > > > > >> > > > > >> JIRAs
> >> > > > > > >> > > > > >> > >>>>> about
> >> > > > > > >> > > > > >> > >>>>>>> the
> >> > > > > > >> > > > > >> > >>>>>>>>> need for this.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> The same thing holds true for
> >> > > replayability.
> >> > > > > > Kafka
> >> > > > > > >> > > allows
> >> > > > > > >> > > > us
> >> > > > > > >> > > > > >> to
> >> > > > > > >> > > > > >> > >>>>> rewind
> >> > > > > > >> > > > > >> > >>>>>>>>> when
> >> > > > > > >> > > > > >> > >>>>>>>>> we have a failure. Many other
> systems
> >> > > don't.
> >> > > > In
> >> > > > > > some
> >> > > > > > >> > > > cases,
> >> > > > > > >> > > > > >> > >>>>>>>>> systems
> >> > > > > > >> > > > > >> > >>>>>>> return
> >> > > > > > >> > > > > >> > >>>>>>>>> null for their offsets (e.g.
> >> > > > > > >> WikipediaSystemConsumer)
> >> > > > > > >> > > > > because
> >> > > > > > >> > > > > >> > >>>>>>>>> they
> >> > > > > > >> > > > > >> > >>>>>> have
> >> > > > > > >> > > > > >> > >>>>>>> no
> >> > > > > > >> > > > > >> > >>>>>>>>> offsets.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> Partitioning is another example.
> Kafka
> >> > > > supports
> >> > > > > > >> > > > > partitioning,
> >> > > > > > >> > > > > >> > >>>>>>>>> but
> >> > > > > > >> > > > > >> > >>>>> many
> >> > > > > > >> > > > > >> > >>>>>>>>> systems don't. We model this by
> >> having a
> >> > > > single
> >> > > > > > >> > > partition
> >> > > > > > >> > > > > for
> >> > > > > > >> > > > > >> > >>>>>>>>> those systems. Still, other systems
> >> model
> >> > > > > > >> partitioning
> >> > > > > > >> > > > > >> > >>>> differently (e.g.
> >> > > > > > >> > > > > >> > >>>>>>>>> Kinesis).
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> The SystemAdmin interface is also a
> >> mess.
> >> > > > > > Creating
> >> > > > > > >> > > streams
> >> > > > > > >> > > > > in
> >> > > > > > >> > > > > >> a
> >> > > > > > >> > > > > >> > >>>>>>>>> system-agnostic way is almost
> >> impossible.
> >> > > As
> >> > > > is
> >> > > > > > >> > modeling
> >> > > > > > >> > > > > >> > >>>>>>>>> metadata
> >> > > > > > >> > > > > >> > >>>>> for
> >> > > > > > >> > > > > >> > >>>>>>> the
> >> > > > > > >> > > > > >> > >>>>>>>>> system (replication factor,
> >> partitions,
> >> > > > > location,
> >> > > > > > >> > etc).
> >> > > > > > >> > > > The
> >> > > > > > >> > > > > >> > >>>>>>>>> list
> >> > > > > > >> > > > > >> > >>>>> goes
> >> > > > > > >> > > > > >> > >>>>>>> on.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> Duplicate work
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> At the time that we began writing
> >> Samza,
> >> > > > > Kafka's
> >> > > > > > >> > > consumer
> >> > > > > > >> > > > > and
> >> > > > > > >> > > > > >> > >>>>> producer
> >> > > > > > >> > > > > >> > >>>>>>>>> APIs
> >> > > > > > >> > > > > >> > >>>>>>>>> had a relatively weak feature set.
> On
> >> the
> >> > > > > > >> > consumer-side,
> >> > > > > > >> > > > you
> >> > > > > > >> > > > > >> > >>>>>>>>> had two
> >> > > > > > >> > > > > >> > >>>>>>>>> options: use the high level
> consumer,
> >> or
> >> > > the
> >> > > > > > simple
> >> > > > > > >> > > > > consumer.
> >> > > > > > >> > > > > >> > >>>>>>>>> The
> >> > > > > > >> > > > > >> > >>>>>>> problem
> >> > > > > > >> > > > > >> > >>>>>>>>> with the high-level consumer was
> that
> >> it
> >> > > > > > controlled
> >> > > > > > >> > your
> >> > > > > > >> > > > > >> > >>>>>>>>> offsets, partition assignments, and
> >> the
> >> > > order
> >> > > > > in
> >> > > > > > >> which
> >> > > > > > >> > > you
> >> > > > > > >> > > > > >> > >>>>>>>>> received messages. The
> >> > > > > > >> > > > > >> > >>>>> problem
> >> > > > > > >> > > > > >> > >>>>>>>>> with
> >> > > > > > >> > > > > >> > >>>>>>>>> the simple consumer is that it's not
> >> > > simple.
> >> > > > > It's
> >> > > > > > >> > basic.
> >> > > > > > >> > > > You
> >> > > > > > >> > > > > >> > >>>>>>>>> end up
> >> > > > > > >> > > > > >> > >>>>>>> having
> >> > > > > > >> > > > > >> > >>>>>>>>> to handle a lot of really low-level
> >> stuff
> >> > > > that
> >> > > > > > you
> >> > > > > > >> > > > > shouldn't.
> >> > > > > > >> > > > > >> > >>>>>>>>> We
> >> > > > > > >> > > > > >> > >>>>>> spent a
> >> > > > > > >> > > > > >> > >>>>>>>>> lot of time to make Samza's
> >> > > > KafkaSystemConsumer
> >> > > > > > very
> >> > > > > > >> > > > robust.
> >> > > > > > >> > > > > >> It
> >> > > > > > >> > > > > >> > >>>>>>>>> also allows us to support some cool
> >> > > features:
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> * Per-partition message ordering and
> >> > > > > > prioritization.
> >> > > > > > >> > > > > >> > >>>>>>>>> * Tight control over partition
> >> assignment
> >> > > to
> >> > > > > > support
> >> > > > > > >> > > > joins,
> >> > > > > > >> > > > > >> > >>>>>>>>> global
> >> > > > > > >> > > > > >> > >>>>>> state
> >> > > > > > >> > > > > >> > >>>>>>>>> (if we want to implement it :)),
> etc.
> >> > > > > > >> > > > > >> > >>>>>>>>> * Tight control over offset
> >> > checkpointing.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> What we didn't realize at the time
> is
> >> > that
> >> > > > > these
> >> > > > > > >> > > features
> >> > > > > > >> > > > > >> > >>>>>>>>> should
> >> > > > > > >> > > > > >> > >>>>>>> actually
> >> > > > > > >> > > > > >> > >>>>>>>>> be in Kafka. A lot of Kafka
> consumers
> >> > (not
> >> > > > just
> >> > > > > > >> Samza
> >> > > > > > >> > > > stream
> >> > > > > > >> > > > > >> > >>>>>> processors)
> >> > > > > > >> > > > > >> > >>>>>>>>> end up wanting to do things like
> joins
> >> > and
> >> > > > > > partition
> >> > > > > > >> > > > > >> > >>>>>>>>> assignment. The
> >> > > > > > >> > > > > >> > >>>>>>> Kafka
> >> > > > > > >> > > > > >> > >>>>>>>>> community has come to the same
> >> > conclusion.
> >> > > > > > They're
> >> > > > > > >> > > adding
> >> > > > > > >> > > > a
> >> > > > > > >> > > > > >> ton
> >> > > > > > >> > > > > >> > >>>>>>>>> of upgrades into their new Kafka
> >> consumer
> >> > > > > > >> > > implementation.
> >> > > > > > >> > > > > To a
> >> > > > > > >> > > > > >> > >>>>>>>>> large extent,
> >> > > > > > >> > > > > >> > >>>>> it's
> >> > > > > > >> > > > > >> > >>>>>>>>> duplicate work to what we've already
> >> done
> >> > > in
> >> > > > > > Samza.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> On top of this, Kafka ended up
> taking
> >> a
> >> > > very
> >> > > > > > similar
> >> > > > > > >> > > > > approach
> >> > > > > > >> > > > > >> > >>>>>>>>> to
> >> > > > > > >> > > > > >> > >>>>>> Samza's
> >> > > > > > >> > > > > >> > >>>>>>>>> KafkaCheckpointManager
> implementation
> >> for
> >> > > > > > handling
> >> > > > > > >> > > offset
> >> > > > > > >> > > > > >> > >>>>>> checkpointing.
> >> > > > > > >> > > > > >> > >>>>>>>>> Like Samza, Kafka's new offset
> >> management
> >> > > > > feature
> >> > > > > > >> > stores
> >> > > > > > >> > > > > >> offset
> >> > > > > > >> > > > > >> > >>>>>>>>> checkpoints in a topic, and allows
> >> you to
> >> > > > fetch
> >> > > > > > them
> >> > > > > > >> > > from
> >> > > > > > >> > > > > the
> >> > > > > > >> > > > > >> > >>>>>>>>> broker.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> A lot of this seems like a waste,
> >> since
> >> > we
> >> > > > > could
> >> > > > > > >> have
> >> > > > > > >> > > > shared
> >> > > > > > >> > > > > >> > >>>>>>>>> the
> >> > > > > > >> > > > > >> > >>>>> work
> >> > > > > > >> > > > > >> > >>>>>> if
> >> > > > > > >> > > > > >> > >>>>>>>>> it
> >> > > > > > >> > > > > >> > >>>>>>>>> had been done in Kafka from the
> >> get-go.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> Vision
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> All of this leads me to a rather
> >> radical
> >> > > > > > proposal.
> >> > > > > > >> > Samza
> >> > > > > > >> > > > is
> >> > > > > > >> > > > > >> > >>>>> relatively
> >> > > > > > >> > > > > >> > >>>>>>>>> stable at this point. I'd venture to
> >> say
> >> > > that
> >> > > > > > we're
> >> > > > > > >> > > near a
> >> > > > > > >> > > > > 1.0
> >> > > > > > >> > > > > >> > >>>>>> release.
> >> > > > > > >> > > > > >> > >>>>>>>>> I'd
> >> > > > > > >> > > > > >> > >>>>>>>>> like to propose that we take what
> >> we've
> >> > > > > learned,
> >> > > > > > and
> >> > > > > > >> > > begin
> >> > > > > > >> > > > > >> > >>>>>>>>> thinking
> >> > > > > > >> > > > > >> > >>>>>>> about
> >> > > > > > >> > > > > >> > >>>>>>>>> Samza beyond 1.0. What would we
> >> change if
> >> > > we
> >> > > > > were
> >> > > > > > >> > > starting
> >> > > > > > >> > > > > >> from
> >> > > > > > >> > > > > >> > >>>>>> scratch?
> >> > > > > > >> > > > > >> > >>>>>>>>> My
> >> > > > > > >> > > > > >> > >>>>>>>>> proposal is to:
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> 1. Make Samza standalone the *only*
> >> way
> >> > to
> >> > > > run
> >> > > > > > Samza
> >> > > > > > >> > > > > >> > >>>>>>>>> processors, and eliminate all direct
> >> > > > > dependences
> >> > > > > > on
> >> > > > > > >> > > YARN,
> >> > > > > > >> > > > > >> Mesos,
> >> > > > > > >> > > > > >> > >>>> etc.
> >> > > > > > >> > > > > >> > >>>>>>>>> 2. Make a definitive call to support
> >> only
> >> > > > Kafka
> >> > > > > > as
> >> > > > > > >> the
> >> > > > > > >> > > > > stream
> >> > > > > > >> > > > > >> > >>>>>> processing
> >> > > > > > >> > > > > >> > >>>>>>>>> layer.
> >> > > > > > >> > > > > >> > >>>>>>>>> 3. Eliminate Samza's metrics,
> logging,
> >> > > > > > >> serialization,
> >> > > > > > >> > > and
> >> > > > > > >> > > > > >> > >>>>>>>>> config
> >> > > > > > >> > > > > >> > >>>>>>> systems,
> >> > > > > > >> > > > > >> > >>>>>>>>> and simply use Kafka's instead.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> This would fix all of the issues
> that
> >> I
> >> > > > > outlined
> >> > > > > > >> > above.
> >> > > > > > >> > > It
> >> > > > > > >> > > > > >> > >>>>>>>>> should
> >> > > > > > >> > > > > >> > >>>>> also
> >> > > > > > >> > > > > >> > >>>>>>>>> shrink the Samza code base pretty
> >> > > > dramatically.
> >> > > > > > >> > > Supporting
> >> > > > > > >> > > > > >> only
> >> > > > > > >> > > > > >> > >>>>>>>>> a standalone container will allow
> >> Samza
> >> > to
> >> > > be
> >> > > > > > >> executed
> >> > > > > > >> > > on
> >> > > > > > >> > > > > YARN
> >> > > > > > >> > > > > >> > >>>>>>>>> (using Slider), Mesos (using
> >> > > > Marathon/Aurora),
> >> > > > > or
> >> > > > > > >> most
> >> > > > > > >> > > > other
> >> > > > > > >> > > > > >> > >>>>>>>>> in-house
> >> > > > > > >> > > > > >> > >>>>>>> deployment
> >> > > > > > >> > > > > >> > >>>>>>>>> systems. This should make life a lot
> >> > easier
> >> > > > for
> >> > > > > > new
> >> > > > > > >> > > users.
> >> > > > > > >> > > > > >> > >>>>>>>>> Imagine
> >> > > > > > >> > > > > >> > >>>>>>> having
> >> > > > > > >> > > > > >> > >>>>>>>>> the hello-samza tutorial without
> YARN.
> >> > The
> >> > > > drop
> >> > > > > > in
> >> > > > > > >> > > mailing
> >> > > > > > >> > > > > >> list
> >> > > > > > >> > > > > >> > >>>>>> traffic
> >> > > > > > >> > > > > >> > >>>>>>>>> will be pretty dramatic.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> Coupling with Kafka seems long
> >> overdue to
> >> > > me.
> >> > > > > The
> >> > > > > > >> > > reality
> >> > > > > > >> > > > > is,
> >> > > > > > >> > > > > >> > >>>>> everyone
> >> > > > > > >> > > > > >> > >>>>>>>>> that
> >> > > > > > >> > > > > >> > >>>>>>>>> I'm aware of is using Samza with
> >> Kafka.
> >> > We
> >> > > > > > basically
> >> > > > > > >> > > > require
> >> > > > > > >> > > > > >> it
> >> > > > > > >> > > > > >> > >>>>>> already
> >> > > > > > >> > > > > >> > >>>>>>> in
> >> > > > > > >> > > > > >> > >>>>>>>>> order for most features to work.
> Those
> >> > that
> >> > > > are
> >> > > > > > >> using
> >> > > > > > >> > > > other
> >> > > > > > >> > > > > >> > >>>>>>>>> systems
> >> > > > > > >> > > > > >> > >>>>>> are
> >> > > > > > >> > > > > >> > >>>>>>>>> generally using it for ingest into
> >> Kafka
> >> > > (1),
> >> > > > > and
> >> > > > > > >> then
> >> > > > > > >> > > > they
> >> > > > > > >> > > > > do
> >> > > > > > >> > > > > >> > >>>>>>>>> the processing on top. There is
> >> already
> >> > > > > > discussion (
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>
> >> > > > > > >> > > > > >> > >>>>>
> >> > > > > > >> > > > > >> >
> >> > > > > > >> > > > >
> >> > > > > > >> >
> >> > > > > >
> >> > >
> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851
> >> > > > > > >> > > > > >> > >>>>> 767
> >> > > > > > >> > > > > >> > >>>>>>>>> )
> >> > > > > > >> > > > > >> > >>>>>>>>> in Kafka to make ingesting into
> Kafka
> >> > > > extremely
> >> > > > > > >> easy.
> >> > > > > > >> > > > > >> > >>>>>>>>>
> >> > > > > > >> > > > > >> > >>>>>>>>> Once we make the call to couple with
> >> > Kafka,
> >> > > > we
> >> > > > > > can
> >> > > > > > >> > > > leverage
> >> > > > > > >> > > > > a
> >> > > > > > >> > > > > >> > >>>>>>>>> ton of
> >> > > > > > >> > > > > >> > >>>>>>> their
> >> > > > > > >> > > > > >> > >>>>>>>>> ecosystem. We no longer have to
> >> maintain
> >> > > our
> >> > > > > own
> >> > > > > > >> > config,
> >> > > > > > >> > > > > >> > >>>>>>>>> metrics,
> >> > > > > > >> > > > > >> > >>>>> etc.
> >> > > > > > >> > > > > >> > >>>>>>> We
> >> > > > > > >> > > > > >> > >>>>>>>>> can all share the same libraries,
> and
> >> > make
> >> > > > them
> >> > > > > > >> > better.
> >> > > > > > >> > > > This
> >> > > > > > >> > > > > >> > >>>>>>>>> will
> >> > > > > > >> > > > > >> >
> >> ...
> >>
> >> [Message clipped]
> >
> >
> >
>



-- 
Jordan Shaw
Full Stack Software Engineer
PubNub Inc
1045 17th St
San Francisco, CA 94107

Re: Thoughts and obesrvations on Samza

Reply via email to