Re: Thoughts and obesrvations on Samza

Jay Kreps Sun, 12 Jul 2015 20:58:14 -0700

Hey guys,

There seems to be some confusion in the last few emails: there is no plan
whatsoever to remove YARN support. The change suggested was to move the
partition management out of the YARN app master and rely on Kafka's
partition management. The advantage of this would be to make the vast
majority of Samza totally cluster manager agnostic, and make it possible to
implement a high quality "stand-alone" mode and support other frameworks.
It is possible this will let us run in YARN without *any* samza-specific
code using something generic like Slider, or maybe not. I don't think
anyone has tried this so it is probably premature to say how much
YARN-specific code could be killed. If that works out, then a bunch of the
YARN-specific code will disappear, but this doesn't mean YARN support will
disappear, just that we would retain less code to implement the same
thing.  Either way Samza in YARN isn't going away.


-Jay

On Sun, Jul 12, 2015 at 7:48 PM, Garrett Barton <[email protected]>
wrote:

> Yi,
>
>  What you just summarized makes a whole lot more sense to me.  Shamelessly
> I am looking at this shift as a customer with a production workflow riding
> on it so I am looking for some kind of consistency into the future of
> Samza.  This makes me feel a lot better about it.
>
>
> Thank you!
>
> On Sun, Jul 12, 2015 at 10:44 PM, Yi Pan <[email protected]> wrote:
>
> > Just to make it explicitly clear what I am proposing, here is a version
> of
> > more detailed description:
> >
> > The fourth option (in addition to what Jakob summarized) we are proposing
> > is:
> >
> > - Recharter Samza to “stream processing as a service”
> >
> > - The current Samza core (the basic transformation API w/ basic partition
> > and offset management build-in) will be moved to Kafka Streams (i.e. part
> > of Kafka) and supports “run-as-a-library”
> >
> > - Deprecate the SystemConsumers and SystemProducers APIs and move them to
> > Copycat
> >
> > - The current SQL development:
> >
> >    * physical operators and a Trident-like stream API should stay in
> Kafka
> > Streams as libraries, enabling any standalone deployment to use the core
> > window/join functions
> >
> >    * query parser/planner and execution on top of a distributed service
> > should stay in new Samza (i.e. “stream processing as a service”)
> >
> > - Advanced features related to job scheduling/state management stays in
> new
> > Samza (i.e. “streaming processing as a service”)
> >
> >    * Any advanced PartitionManager implementation that can be plugged
> into
> > Kafka Streams
> >
> >    * Any auto-scaling, dynamic configuration via coordinator stream
> >
> >    * Any advanced state management s.t. host-affinity etc.
> >
> >
> > Pros:
> >
> > - W/ the current Samza core as Kafka Streams and move the ingestion to
> > Copycat, we achieved most of the goals in the initial proposal:
> >
> >    * Tighter coupling w/ Kafka
> >
> >    * Reuse Kafka’s build-in functionalities, such as offset manager,
> basic
> > partition distribution
> >
> >    * Separation of ingestion vs transformation APIs
> >
> >    * offload a lot of system-specific configuration to Kafka Streams and
> > Copycat (i.e. SystemFactory configure, serde configure, etc.)
> >
> >    * remove YARN dependency and make standalone deployment easy. As
> > Guozhang mentioned, it would be really easy to start a process that
> > internally run Kafka Streams as library.
> >
> > - By re-chartering Samza as “stream processing as a service”, we address
> > the concern regarding to
> >
> >    * Pluggable partition management
> >
> >    * Running in a distributed cluster to manage process lifecycle,
> > fault-tolerance, resource-allocation, etc.
> >
> >    * More advanced features s.t. host-affinity, auto-scaling, and dynamic
> > configure changes, etc.
> >
> >
> > Regarding to the code and community organization, I think the following
> may
> > be the best:
> >
> > Code:
> >
> > - A Kafka sub-project Kafka Streams to hold samza-core, samza-kv-store,
> and
> > the physical operator layer as library in SQL: this would allow better
> > alignment w/ Kafka, in code, doc, and branding
> >
> > - Retain the current Samza project just to keep
> >
> >    * A pluggable explicit partition management in Kafka Streams client
> >
> >    * Integration w/ cluster-management systems for advanced features:
> >
> >       * host-affinity, auto-scaling,, dynamic configuration, etc.
> >
> >    * It will fully depend on the Kafka Streams API and remove all support
> > for SystemConsumers/SystemProducers in the future
> >
> > Community: (this is almost the same as what Chris proposed)
> >
> > - Kafka Streams: the current Samza community should be supporting this
> > effort together with some Kafka members, since most of the code here will
> > be from samza-core, samza-kv-store, and samza-sql.
> >
> > - new Samza: the current Samza community should continue serve the course
> > to support more advanced features to run Kafka Streams as a service.
> > Arguably, the new Samza framework may be used to run Copycat workers as
> > well, at least to manage Copycat worker’s lifecycle in a clustered
> > environment. Hence, it would stay as a general stream processing
> framework
> > that takes in any source and output to any destination, just the
> transport
> > system is fixed to Kafka.
> >
> > On Sun, Jul 12, 2015 at 7:29 PM, Yi Pan <[email protected]> wrote:
> >
> > > Hi, Chris,
> > >
> > > Thanks for sending out this concrete set of points here. I agree w/ all
> > > but have a slight different point view on 8).
> > >
> > > My view on this is: instead of sunset Samza as TLP, can we re-charter
> the
> > > scope of Samza to be the home for "running streaming process as a
> > service"?
> > >
> > > My main motivation is from the following points from a long internal
> > > discussion in LinkedIn:
> > >
> > > - There is a clear ask for pluggable partition management, like we do
> in
> > > LinkedIn, and as Ben Kirwin has mentioned in
> > >
> >
> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3ccacux-d-yjx++2gnf_1laf10kyuvyamg7up_dt19v0znmmhb...@mail.gmail.com%3E
> > > - There are concerns on lack of support for running stream processing
> in
> > a
> > > cluster: lifecycle management, resource allocation, fault tolerance,
> etc.
> > > - There is a question to how to support more advanced features s.t.
> > > host-affinity, auto-scaling, and dynamic configuration in Samza jobs,
> as
> > > raised by Martin here:
> > >
> >
> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%[email protected]%3E
> > >
> > > We have use cases that need to address all the above three cases and
> most
> > > of the functions are all in the current Samza project, in some flavor.
> We
> > > are all supporting to merge the samza-core functionalities into Kafka
> > > Streams, but there is a question where we keep these functions in the
> > > future. One option is to start a new project that includes these
> > functions
> > > that are closely related w/ "run stream-processing as-a-service", while
> > > another personally more attractive option is to re-charter Samza
> project
> > > just do "run stream processing as-a-service". We can avoid the overhead
> > of
> > > re-starting another community for this project. Personally, I felt that
> > > here are the benefits we should be getting:
> > >
> > > 1. We have already agreed mostly that Kafka Streams API would allow
> some
> > > pluggable partition management functions. Hence, the advanced partition
> > > management can live out-side the new Kafka Streams core w/o affecting
> the
> > > run-as-a-library model in Kafka Streams.
> > > 2. The integration w/ cluster management system and advanced features
> > > listed above stays in the same project and allow existing users enjoy
> > > no-impact migration to Kafka Stream as the core. That also addresses
> > Tim's
> > > question on "removing the support for YARN".
> > > 3. A separate project for stream-processing-as-a-service also allow the
> > > new Kafka Streams being independent to any cluster management and just
> > > focusing on stream process core functions, while leaving the functions
> > that
> > > requires cluster-resource and state management to a separate layer.
> > >
> > > Please feel free to comment. Thanks!
> > >
> > > On Sun, Jul 12, 2015 at 5:54 PM, Chris Riccomini <
> [email protected]>
> > > wrote:
> > >
> > >> Hey all,
> > >>
> > >> I want to start by saying that I'm absolutely thrilled to be a part of
> > >> this
> > >> community. The amount of level-headed, thoughtful, educated discussion
> > >> that's gone on over the past ~10 days is overwhelming. Wonderful.
> > >>
> > >> It seems like discussion is waning a bit, and we've reached some
> > >> conclusions. There are several key emails in this threat, which I want
> > to
> > >> call out:
> > >>
> > >> 1. Jakob's summary of the three potential ways forward.
> > >>
> > >>
> > >>
> >
> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVu-hxdBfyQ4qm3LDC55cUQbPdmbe4zGzTOOatYF1Pz43A%40mail.gmail.com%3E
> > >> 2. Julian's call out that we should be focusing on community over
> code.
> > >>
> > >>
> > >>
> >
> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCAPSgeESZ_7bVFbwN%2Bzqi5MH%3D4CWu9MZUSanKg0-1woMqt55Fvg%40mail.gmail.com%3E
> > >> 3. Martin's summary about the benefits of merging communities.
> > >>
> > >>
> > >>
> >
> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CBFB866B6-D9D8-4578-93C0-FFAEB1DF00FC%40kleppmann.com%3E
> > >> 4. Jakob's comments about the distinction between community and code
> > >> paths.
> > >>
> > >>
> > >>
> >
> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVtWPjHLLDsmxvz9KggVA5DfBi-nUvfqB6QdA-du%2B_a9Ng%40mail.gmail.com%3E
> > >>
> > >> I agree with the comments on all of these emails. I think Martin's
> > summary
> > >> of his position aligns very closely with my own. To that end, I think
> we
> > >> should get concrete about what the proposal is, and call a vote on it.
> > >> Given that Jay, Martin, and I seem to be aligning fairly closely, I
> > think
> > >> we should start with:
> > >>
> > >> 1. [community] Make Samza a subproject of Kafka.
> > >> 2. [community] Make all Samza PMC/committers committers of the
> > subproject.
> > >> 3. [community] Migrate Samza's website/documentation into Kafka's.
> > >> 4. [code] Have the Samza community and the Kafka community start a
> > >> from-scratch reboot together in the new Kafka subproject. We can
> > >> borrow/copy &  paste significant chunks of code from Samza's code
> base.
> > >> 5. [code] The subproject would intentionally eliminate support for
> both
> > >> other streaming systems and all deployment systems.
> > >> 6. [code] Attempt to provide a bridge from our SystemConsumer to
> KIP-26
> > >> (copy cat)
> > >> 7. [code] Attempt to provide a bridge from the new subproject's
> > processor
> > >> interface to our legacy StreamTask interface.
> > >> 8. [code/community] Sunset Samza as a TLP when we have a working Kafka
> > >> subproject that has a fault-tolerant container with state management.
> > >>
> > >> It's likely that (6) and (7) won't be fully drop-in. Still, the closer
> > we
> > >> can get, the better it's going to be for our existing community.
> > >>
> > >> One thing that I didn't touch on with (2) is whether any Samza PMC
> > members
> > >> should be rolled into Kafka PMC membership as well (though, Jay and
> > Jakob
> > >> are already PMC members on both). I think that Samza's community
> > deserves
> > >> a
> > >> voice on the PMC, so I'd propose that we roll at least a few PMC
> members
> > >> into the Kafka PMC, but I don't have a strong framework for which
> people
> > >> to
> > >> pick.
> > >>
> > >> Before (8), I think that Samza's TLP can continue to commit bug fixes
> > and
> > >> patches as it sees fit, provided that we openly communicate that we
> > won't
> > >> necessarily migrate new features to the new subproject, and that the
> TLP
> > >> will be shut down after the migration to the Kafka subproject occurs.
> > >>
> > >> Jakob, I could use your guidance here about about how to achieve this
> > from
> > >> an Apache process perspective (sorry).
> > >>
> > >> * Should I just call a vote on this proposal?
> > >> * Should it happen on dev or private?
> > >> * Do committers have binding votes, or just PMC?
> > >>
> > >> Having trouble finding much detail on the Apache wikis. :(
> > >>
> > >> Cheers,
> > >> Chris
> > >>
> > >> On Fri, Jul 10, 2015 at 2:38 PM, Yan Fang <[email protected]>
> wrote:
> > >>
> > >> > Thanks, Jay. This argument persuaded me actually. :)
> > >> >
> > >> > Fang, Yan
> > >> > [email protected]
> > >> >
> > >> > On Fri, Jul 10, 2015 at 2:33 PM, Jay Kreps <[email protected]>
> wrote:
> > >> >
> > >> > > Hey Yan,
> > >> > >
> > >> > > Yeah philosophically I think the argument is that you should
> capture
> > >> the
> > >> > > stream in Kafka independent of the transformation. This is
> > obviously a
> > >> > > Kafka-centric view point.
> > >> > >
> > >> > > Advantages of this:
> > >> > > - In practice I think this is what e.g. Storm people often end up
> > >> doing
> > >> > > anyway. You usually need to throttle any access to a live serving
> > >> > database.
> > >> > > - Can have multiple subscribers and they get the same thing
> without
> > >> > > additional load on the source system.
> > >> > > - Applications can tap into the stream if need be by subscribing.
> > >> > > - You can debug your transformation by tailing the Kafka topic
> with
> > >> the
> > >> > > console consumer
> > >> > > - Can tee off the same data stream for batch analysis or Lambda
> arch
> > >> > style
> > >> > > re-processing
> > >> > >
> > >> > > The disadvantage is that it will use Kafka resources. But the idea
> > is
> > >> > > eventually you will have multiple subscribers to any data source
> (at
> > >> > least
> > >> > > for monitoring) so you will end up there soon enough anyway.
> > >> > >
> > >> > > Down the road the technical benefit is that I think it gives us a
> > good
> > >> > path
> > >> > > towards end-to-end exactly once semantics from source to
> > destination.
> > >> > > Basically the connectors need to support idempotence when talking
> to
> > >> > Kafka
> > >> > > and we need the transactional write feature in Kafka to make the
> > >> > > transformation atomic. This is actually pretty doable if you
> > separate
> > >> > > connector=>kafka problem from the generic transformations which
> are
> > >> > always
> > >> > > kafka=>kafka. However I think it is quite impossible to do in a
> > >> > all_things
> > >> > > => all_things environment. Today you can say "well the semantics
> of
> > >> the
> > >> > > Samza APIs depend on the connectors you use" but it is actually
> > worse
> > >> > then
> > >> > > that because the semantics actually depend on the pairing of
> > >> > connectors--so
> > >> > > not only can you probably not get a usable "exactly once"
> guarantee
> > >> > > end-to-end it can actually be quite hard to reverse engineer what
> > >> > property
> > >> > > (if any) your end-to-end flow has if you have heterogenous
> systems.
> > >> > >
> > >> > > -Jay
> > >> > >
> > >> > > On Fri, Jul 10, 2015 at 2:00 PM, Yan Fang <[email protected]>
> > >> wrote:
> > >> > >
> > >> > > > {quote}
> > >> > > > maintained in a separate repository and retaining the existing
> > >> > > > committership but sharing as much else as possible (website,
> etc)
> > >> > > > {quote}
> > >> > > >
> > >> > > > Overall, I agree on this idea. Now the question is more about
> "how
> > >> to
> > >> > do
> > >> > > > it".
> > >> > > >
> > >> > > > On the other hand, one thing I want to point out is that, if we
> > >> decide
> > >> > to
> > >> > > > go this way, how do we want to support
> > >> > > > otherSystem-transformation-otherSystem use case?
> > >> > > >
> > >> > > > Basically, there are four user groups here:
> > >> > > >
> > >> > > > 1. Kafka-transformation-Kafka
> > >> > > > 2. Kafka-transformation-otherSystem
> > >> > > > 3. otherSystem-transformation-Kafka
> > >> > > > 4. otherSystem-transformation-otherSystem
> > >> > > >
> > >> > > > For group 1, they can easily use the new Samza library to
> achieve.
> > >> For
> > >> > > > group 2 and 3, they can use copyCat -> transformation -> Kafka
> or
> > >> > Kafka->
> > >> > > > transformation -> copyCat.
> > >> > > >
> > >> > > > The problem is for group 4. Do we want to abandon this or still
> > >> support
> > >> > > it?
> > >> > > > Of course, this use case can be achieved by using copyCat ->
> > >> > > transformation
> > >> > > > -> Kafka -> transformation -> copyCat, the thing is how we
> > persuade
> > >> > them
> > >> > > to
> > >> > > > do this long chain. If yes, it will also be a win for Kafka too.
> > Or
> > >> if
> > >> > > > there is no one in this community actually doing this so far,
> > maybe
> > >> ok
> > >> > to
> > >> > > > not support the group 4 directly.
> > >> > > >
> > >> > > > Thanks,
> > >> > > >
> > >> > > > Fang, Yan
> > >> > > > [email protected]
> > >> > > >
> > >> > > > On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps <[email protected]>
> > >> wrote:
> > >> > > >
> > >> > > > > Yeah I agree with this summary. I think there are kind of two
> > >> > questions
> > >> > > > > here:
> > >> > > > > 1. Technically does alignment/reliance on Kafka make sense
> > >> > > > > 2. Branding wise (naming, website, concepts, etc) does
> alignment
> > >> with
> > >> > > > Kafka
> > >> > > > > make sense
> > >> > > > >
> > >> > > > > Personally I do think both of these things would be really
> > >> valuable,
> > >> > > and
> > >> > > > > would dramatically alter the trajectory of the project.
> > >> > > > >
> > >> > > > > My preference would be to see if people can mostly agree on a
> > >> > direction
> > >> > > > > rather than splintering things off. From my point of view the
> > >> ideal
> > >> > > > outcome
> > >> > > > > of all the options discussed would be to make Samza a closely
> > >> aligned
> > >> > > > > subproject, maintained in a separate repository and retaining
> > the
> > >> > > > existing
> > >> > > > > committership but sharing as much else as possible (website,
> > >> etc). No
> > >> > > > idea
> > >> > > > > about how these things work, Jacob, you probably know more.
> > >> > > > >
> > >> > > > > No discussion amongst the Kafka folks has happened on this,
> but
> > >> > likely
> > >> > > we
> > >> > > > > should figure out what the Samza community actually wants
> first.
> > >> > > > >
> > >> > > > > I admit that this is a fairly radical departure from how
> things
> > >> are.
> > >> > > > >
> > >> > > > > If that doesn't fly, I think, yeah we could leave Samza as it
> is
> > >> and
> > >> > do
> > >> > > > the
> > >> > > > > more radical reboot inside Kafka. From my point of view that
> > does
> > >> > leave
> > >> > > > > things in a somewhat confusing state since now there are two
> > >> stream
> > >> > > > > processing systems more or less coupled to Kafka in large part
> > >> made
> > >> > by
> > >> > > > the
> > >> > > > > same people. But, arguably that might be a cleaner way to make
> > the
> > >> > > > cut-over
> > >> > > > > and perhaps less risky for Samza community since if it works
> > >> people
> > >> > can
> > >> > > > > switch and if it doesn't nothing will have changed. Dunno, how
> > do
> > >> > > people
> > >> > > > > feel about this?
> > >> > > > >
> > >> > > > > -Jay
> > >> > > > >
> > >> > > > > On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan <
> > [email protected]>
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > >  This leads me to thinking that merging projects and
> > >> communities
> > >> > > > might
> > >> > > > > > be a good idea: with the union of experience from both
> > >> communities,
> > >> > > we
> > >> > > > > will
> > >> > > > > > probably build a better system that is better for users.
> > >> > > > > > Is this what's being proposed though? Merging the projects
> > seems
> > >> > like
> > >> > > > > > a consequence of at most one of the three directions under
> > >> > > discussion:
> > >> > > > > > 1) Samza 2.0: The Samza community relies more heavily on
> Kafka
> > >> for
> > >> > > > > > configuration, etc. (to a greater or lesser extent to be
> > >> > determined)
> > >> > > > > > but the Samza community would not automatically merge withe
> > >> Kafka
> > >> > > > > > community (the Phoenix/HBase example is a good one here).
> > >> > > > > > 2) Samza Reboot: The Samza community continues to exist
> with a
> > >> > > limited
> > >> > > > > > project scope, but similarly would not need to be part of
> the
> > >> Kafka
> > >> > > > > > community (ie given committership) to progress.  Here, maybe
> > the
> > >> > > Samza
> > >> > > > > > team would become a subproject of Kafka (the Board frowns on
> > >> > > > > > subprojects at the moment, so I'm not sure if that's even
> > >> > feasible),
> > >> > > > > > but that would not be required.
> > >> > > > > > 3) Hey Samza! FYI, Kafka does streaming now: In this option
> > the
> > >> > Kafka
> > >> > > > > > team builds its own streaming library, possibly off of Jay's
> > >> > > > > > prototype, which has not direct lineage to the Samza team.
> > >> There's
> > >> > > no
> > >> > > > > > reason for the Kafka team to bring in the Samza team.
> > >> > > > > >
> > >> > > > > > Is the Kafka community on board with this?
> > >> > > > > >
> > >> > > > > > To be clear, all three options under discussion are
> > interesting,
> > >> > > > > > technically valid and likely healthy directions for the
> > project.
> > >> > > > > > Also, they are not mutually exclusive.  The Samza community
> > >> could
> > >> > > > > > decide to pursue, say, 'Samza 2.0', while the Kafka
> community
> > >> went
> > >> > > > > > forward with 'Hey Samza!'  My points above are directed
> > >> entirely at
> > >> > > > > > the community aspect of these choices.
> > >> > > > > > -Jakob
> > >> > > > > >
> > >> > > > > > On 10 July 2015 at 09:10, Roger Hoover <
> > [email protected]>
> > >> > > wrote:
> > >> > > > > > > That's great.  Thanks, Jay.
> > >> > > > > > >
> > >> > > > > > > On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps <
> > [email protected]>
> > >> > > wrote:
> > >> > > > > > >
> > >> > > > > > >> Yeah totally agree. I think you have this issue even
> today,
> > >> > right?
> > >> > > > > I.e.
> > >> > > > > > if
> > >> > > > > > >> you need to make a simple config change and you're
> running
> > in
> > >> > YARN
> > >> > > > > today
> > >> > > > > > >> you end up bouncing the job which then rebuilds state. I
> > >> think
> > >> > the
> > >> > > > fix
> > >> > > > > > is
> > >> > > > > > >> exactly what you described which is to have a long
> timeout
> > on
> > >> > > > > partition
> > >> > > > > > >> movement for stateful jobs so that if a job is just
> getting
> > >> > > bounced,
> > >> > > > > and
> > >> > > > > > >> the cluster manager (or admin) is smart enough to restart
> > it
> > >> on
> > >> > > the
> > >> > > > > same
> > >> > > > > > >> host when possible, it can optimistically reuse any
> > existing
> > >> > state
> > >> > > > it
> > >> > > > > > finds
> > >> > > > > > >> on disk (if it is valid).
> > >> > > > > > >>
> > >> > > > > > >> So in this model the charter of the CM is to place
> > processes
> > >> as
> > >> > > > > > stickily as
> > >> > > > > > >> possible and to restart or re-place failed processes. The
> > >> > charter
> > >> > > of
> > >> > > > > the
> > >> > > > > > >> partition management system is to control the assignment
> of
> > >> work
> > >> > > to
> > >> > > > > > these
> > >> > > > > > >> processes. The nice thing about this is that the work
> > >> > assignment,
> > >> > > > > > timeouts,
> > >> > > > > > >> behavior, configs, and code will all be the same across
> all
> > >> > > cluster
> > >> > > > > > >> managers.
> > >> > > > > > >>
> > >> > > > > > >> So I think that prototype would actually give you exactly
> > >> what
> > >> > you
> > >> > > > > want
> > >> > > > > > >> today for any cluster manager (or manual placement +
> > restart
> > >> > > script)
> > >> > > > > > that
> > >> > > > > > >> was sticky in terms of host placement since there is
> > already
> > >> a
> > >> > > > > > configurable
> > >> > > > > > >> partition movement timeout and task-by-task state reuse
> > with
> > >> a
> > >> > > check
> > >> > > > > on
> > >> > > > > > >> state validity.
> > >> > > > > > >>
> > >> > > > > > >> -Jay
> > >> > > > > > >>
> > >> > > > > > >> On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover <
> > >> > > > [email protected]
> > >> > > > > >
> > >> > > > > > >> wrote:
> > >> > > > > > >>
> > >> > > > > > >> > That would be great to let Kafka do as much heavy
> lifting
> > >> as
> > >> > > > > possible
> > >> > > > > > and
> > >> > > > > > >> > make it easier for other languages to implement Samza
> > apis.
> > >> > > > > > >> >
> > >> > > > > > >> > One thing to watch out for is the interplay between
> > Kafka's
> > >> > > group
> > >> > > > > > >> > management and the external scheduler/process manager's
> > >> fault
> > >> > > > > > tolerance.
> > >> > > > > > >> > If a container dies, the Kafka group membership
> protocol
> > >> will
> > >> > > try
> > >> > > > to
> > >> > > > > > >> assign
> > >> > > > > > >> > it's tasks to other containers while at the same time
> the
> > >> > > process
> > >> > > > > > manager
> > >> > > > > > >> > is trying to relaunch the container.  Without some
> > >> > consideration
> > >> > > > for
> > >> > > > > > this
> > >> > > > > > >> > (like a configurable amount of time to wait before
> Kafka
> > >> > alters
> > >> > > > the
> > >> > > > > > group
> > >> > > > > > >> > membership), there may be thrashing going on which is
> > >> > especially
> > >> > > > bad
> > >> > > > > > for
> > >> > > > > > >> > containers with large amounts of local state.
> > >> > > > > > >> >
> > >> > > > > > >> > Someone else pointed this out already but I thought it
> > >> might
> > >> > be
> > >> > > > > worth
> > >> > > > > > >> > calling out again.
> > >> > > > > > >> >
> > >> > > > > > >> > Cheers,
> > >> > > > > > >> >
> > >> > > > > > >> > Roger
> > >> > > > > > >> >
> > >> > > > > > >> >
> > >> > > > > > >> > On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps <
> > >> [email protected]>
> > >> > > > > wrote:
> > >> > > > > > >> >
> > >> > > > > > >> > > Hey Roger,
> > >> > > > > > >> > >
> > >> > > > > > >> > > I couldn't agree more. We spent a bunch of time
> talking
> > >> to
> > >> > > > people
> > >> > > > > > and
> > >> > > > > > >> > that
> > >> > > > > > >> > > is exactly the stuff we heard time and again. What
> > makes
> > >> it
> > >> > > > hard,
> > >> > > > > of
> > >> > > > > > >> > > course, is that there is some tension between
> > >> compatibility
> > >> > > with
> > >> > > > > > what's
> > >> > > > > > >> > > there now and making things better for new users.
> > >> > > > > > >> > >
> > >> > > > > > >> > > I also strongly agree with the importance of
> > >> multi-language
> > >> > > > > > support. We
> > >> > > > > > >> > are
> > >> > > > > > >> > > talking now about Java, but for application
> development
> > >> use
> > >> > > > cases
> > >> > > > > > >> people
> > >> > > > > > >> > > want to work in whatever language they are using
> > >> elsewhere.
> > >> > I
> > >> > > > > think
> > >> > > > > > >> > moving
> > >> > > > > > >> > > to a model where Kafka itself does the group
> > membership,
> > >> > > > lifecycle
> > >> > > > > > >> > control,
> > >> > > > > > >> > > and partition assignment has the advantage of putting
> > all
> > >> > that
> > >> > > > > > complex
> > >> > > > > > >> > > stuff behind a clean api that the clients are already
> > >> going
> > >> > to
> > >> > > > be
> > >> > > > > > >> > > implementing for their consumer, so the added
> > >> functionality
> > >> > > for
> > >> > > > > > stream
> > >> > > > > > >> > > processing beyond a consumer becomes very minor.
> > >> > > > > > >> > >
> > >> > > > > > >> > > -Jay
> > >> > > > > > >> > >
> > >> > > > > > >> > > On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover <
> > >> > > > > > [email protected]>
> > >> > > > > > >> > > wrote:
> > >> > > > > > >> > >
> > >> > > > > > >> > > > Metamorphosis...nice. :)
> > >> > > > > > >> > > >
> > >> > > > > > >> > > > This has been a great discussion.  As a user of
> Samza
> > >> > who's
> > >> > > > > > recently
> > >> > > > > > >> > > > integrated it into a relatively large
> organization, I
> > >> just
> > >> > > > want
> > >> > > > > to
> > >> > > > > > >> add
> > >> > > > > > >> > > > support to a few points already made.
> > >> > > > > > >> > > >
> > >> > > > > > >> > > > The biggest hurdles to adoption of Samza as it
> > >> currently
> > >> > > > exists
> > >> > > > > > that
> > >> > > > > > >> > I've
> > >> > > > > > >> > > > experienced are:
> > >> > > > > > >> > > > 1) YARN - YARN is overly complex in many
> environments
> > >> > where
> > >> > > > > Puppet
> > >> > > > > > >> > would
> > >> > > > > > >> > > do
> > >> > > > > > >> > > > just fine but it was the only mechanism to get
> fault
> > >> > > > tolerance.
> > >> > > > > > >> > > > 2) Configuration - I think I like the idea of
> > >> configuring
> > >> > > most
> > >> > > > > of
> > >> > > > > > the
> > >> > > > > > >> > job
> > >> > > > > > >> > > > in code rather than config files.  In general, I
> > think
> > >> the
> > >> > > > goal
> > >> > > > > > >> should
> > >> > > > > > >> > be
> > >> > > > > > >> > > > to make it harder to make mistakes, especially of
> the
> > >> kind
> > >> > > > where
> > >> > > > > > the
> > >> > > > > > >> > code
> > >> > > > > > >> > > > expects something and the config doesn't match.
> The
> > >> > current
> > >> > > > > > config
> > >> > > > > > >> is
> > >> > > > > > >> > > > quite intricate and error-prone.  For example, the
> > >> > > application
> > >> > > > > > logic
> > >> > > > > > >> > may
> > >> > > > > > >> > > > depend on bootstrapping a topic but rather than
> > >> asserting
> > >> > > that
> > >> > > > > in
> > >> > > > > > the
> > >> > > > > > >> > > code,
> > >> > > > > > >> > > > you have to rely on getting the config right.
> > Likewise
> > >> > with
> > >> > > > > > serdes,
> > >> > > > > > >> > the
> > >> > > > > > >> > > > Java representations produced by various serdes
> > (JSON,
> > >> > Avro,
> > >> > > > > etc.)
> > >> > > > > > >> are
> > >> > > > > > >> > > not
> > >> > > > > > >> > > > equivalent so you cannot just reconfigure a serde
> > >> without
> > >> > > > > changing
> > >> > > > > > >> the
> > >> > > > > > >> > > > code.   It would be nice for jobs to be able to
> > assert
> > >> > what
> > >> > > > they
> > >> > > > > > >> expect
> > >> > > > > > >> > > > from their input topics in terms of partitioning.
> > >> This is
> > >> > > > > > getting a
> > >> > > > > > >> > > little
> > >> > > > > > >> > > > off topic but I was even thinking about creating a
> > >> "Samza
> > >> > > > config
> > >> > > > > > >> > linter"
> > >> > > > > > >> > > > that would sanity check a set of configs.
> Especially
> > >> in
> > >> > > > > > >> organizations
> > >> > > > > > >> > > > where config is managed by a different team than
> the
> > >> > > > application
> > >> > > > > > >> > > developer,
> > >> > > > > > >> > > > it's very hard to get avoid config mistakes.
> > >> > > > > > >> > > > 3) Java/Scala centric - for many teams (especially
> > >> > > DevOps-type
> > >> > > > > > >> folks),
> > >> > > > > > >> > > the
> > >> > > > > > >> > > > pain of the Java toolchain (maven, slow builds,
> weak
> > >> > command
> > >> > > > > line
> > >> > > > > > >> > > support,
> > >> > > > > > >> > > > configuration over convention) really inhibits
> > >> > productivity.
> > >> > > > As
> > >> > > > > > more
> > >> > > > > > >> > and
> > >> > > > > > >> > > > more high-quality clients become available for
> > Kafka, I
> > >> > hope
> > >> > > > > > they'll
> > >> > > > > > >> > > follow
> > >> > > > > > >> > > > Samza's model.  Not sure how much it affects the
> > >> proposals
> > >> > > in
> > >> > > > > this
> > >> > > > > > >> > thread
> > >> > > > > > >> > > > but please consider other languages in the
> ecosystem
> > as
> > >> > > well.
> > >> > > > > > From
> > >> > > > > > >> > what
> > >> > > > > > >> > > > I've heard, Spark has more Python users than
> > >> Java/Scala.
> > >> > > > > > >> > > > (FYI, we added a Jython wrapper for the Samza API
> > >> > > > > > >> > > >
> > >> > > > > > >> > > >
> > >> > > > > > >> > >
> > >> > > > > > >> >
> > >> > > > > > >>
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza
> > >> > > > > > >> > > > and are working on a Yeoman generator
> > >> > > > > > >> > > > https://github.com/Quantiply/generator-rico for
> > >> > > Jython/Samza
> > >> > > > > > >> projects
> > >> > > > > > >> > to
> > >> > > > > > >> > > > alleviate some of the pain)
> > >> > > > > > >> > > >
> > >> > > > > > >> > > > I also want to underscore Jay's point about
> improving
> > >> the
> > >> > > user
> > >> > > > > > >> > > experience.
> > >> > > > > > >> > > > That's a very important factor for adoption.  I
> think
> > >> the
> > >> > > goal
> > >> > > > > > should
> > >> > > > > > >> > be
> > >> > > > > > >> > > to
> > >> > > > > > >> > > > make Samza as easy to get started with as something
> > >> like
> > >> > > > > Logstash.
> > >> > > > > > >> > > > Logstash is vastly inferior in terms of
> capabilities
> > to
> > >> > > Samza
> > >> > > > > but
> > >> > > > > > >> it's
> > >> > > > > > >> > > easy
> > >> > > > > > >> > > > to get started and that makes a big difference.
> > >> > > > > > >> > > >
> > >> > > > > > >> > > > Cheers,
> > >> > > > > > >> > > >
> > >> > > > > > >> > > > Roger
> > >> > > > > > >> > > >
> > >> > > > > > >> > > >
> > >> > > > > > >> > > >
> > >> > > > > > >> > > >
> > >> > > > > > >> > > >
> > >> > > > > > >> > > > On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De
> > Francisci
> > >> > > > Morales <
> > >> > > > > > >> > > > [email protected]> wrote:
> > >> > > > > > >> > > >
> > >> > > > > > >> > > > > Forgot to add. On the naming issues, Kafka
> > >> Metamorphosis
> > >> > > is
> > >> > > > a
> > >> > > > > > clear
> > >> > > > > > >> > > > winner
> > >> > > > > > >> > > > > :)
> > >> > > > > > >> > > > >
> > >> > > > > > >> > > > > --
> > >> > > > > > >> > > > > Gianmarco
> > >> > > > > > >> > > > >
> > >> > > > > > >> > > > > On 7 July 2015 at 13:26, Gianmarco De Francisci
> > >> Morales
> > >> > <
> > >> > > > > > >> > > [email protected]
> > >> > > > > > >> > > > >
> > >> > > > > > >> > > > > wrote:
> > >> > > > > > >> > > > >
> > >> > > > > > >> > > > > > Hi,
> > >> > > > > > >> > > > > >
> > >> > > > > > >> > > > > > @Martin, thanks for you comments.
> > >> > > > > > >> > > > > > Maybe I'm missing some important point, but I
> > think
> > >> > > > coupling
> > >> > > > > > the
> > >> > > > > > >> > > > releases
> > >> > > > > > >> > > > > > is actually a *good* thing.
> > >> > > > > > >> > > > > > To make an example, would it be better if the
> MR
> > >> and
> > >> > > HDFS
> > >> > > > > > >> > components
> > >> > > > > > >> > > of
> > >> > > > > > >> > > > > > Hadoop had different release schedules?
> > >> > > > > > >> > > > > >
> > >> > > > > > >> > > > > > Actually, keeping the discussion in a single
> > place
> > >> > would
> > >> > > > > make
> > >> > > > > > >> > > agreeing
> > >> > > > > > >> > > > on
> > >> > > > > > >> > > > > > releases (and backwards compatibility) much
> > >> easier, as
> > >> > > > > > everybody
> > >> > > > > > >> > > would
> > >> > > > > > >> > > > be
> > >> > > > > > >> > > > > > responsible for the whole codebase.
> > >> > > > > > >> > > > > >
> > >> > > > > > >> > > > > > That said, I like the idea of absorbing
> > samza-core
> > >> as
> > >> > a
> > >> > > > > > >> > sub-project,
> > >> > > > > > >> > > > and
> > >> > > > > > >> > > > > > leave the fancy stuff separate.
> > >> > > > > > >> > > > > > It probably gives 90% of the benefits we have
> > been
> > >> > > > > discussing
> > >> > > > > > >> here.
> > >> > > > > > >> > > > > >
> > >> > > > > > >> > > > > > Cheers,
> > >> > > > > > >> > > > > >
> > >> > > > > > >> > > > > > --
> > >> > > > > > >> > > > > > Gianmarco
> > >> > > > > > >> > > > > >
> > >> > > > > > >> > > > > > On 7 July 2015 at 02:30, Jay Kreps <
> > >> > [email protected]
> > >> > > >
> > >> > > > > > wrote:
> > >> > > > > > >> > > > > >
> > >> > > > > > >> > > > > >> Hey Martin,
> > >> > > > > > >> > > > > >>
> > >> > > > > > >> > > > > >> I agree coupling release schedules is a
> > downside.
> > >> > > > > > >> > > > > >>
> > >> > > > > > >> > > > > >> Definitely we can try to solve some of the
> > >> > integration
> > >> > > > > > problems
> > >> > > > > > >> in
> > >> > > > > > >> > > > > >> Confluent Platform or in other distributions.
> > But
> > >> I
> > >> > > think
> > >> > > > > > this
> > >> > > > > > >> > ends
> > >> > > > > > >> > > up
> > >> > > > > > >> > > > > >> being really shallow. I guess I feel to really
> > >> get a
> > >> > > good
> > >> > > > > > user
> > >> > > > > > >> > > > > experience
> > >> > > > > > >> > > > > >> the two systems have to kind of feel like part
> > of
> > >> the
> > >> > > > same
> > >> > > > > > thing
> > >> > > > > > >> > and
> > >> > > > > > >> > > > you
> > >> > > > > > >> > > > > >> can't really add that in later--you can put
> both
> > >> in
> > >> > the
> > >> > > > > same
> > >> > > > > > >> > > > > downloadable
> > >> > > > > > >> > > > > >> tar file but it doesn't really give a very
> > >> cohesive
> > >> > > > > feeling.
> > >> > > > > > I
> > >> > > > > > >> > agree
> > >> > > > > > >> > > > > that
> > >> > > > > > >> > > > > >> ultimately any of the project stuff is as much
> > >> social
> > >> > > and
> > >> > > > > > naming
> > >> > > > > > >> > as
> > >> > > > > > >> > > > > >> anything else--theoretically two totally
> > >> independent
> > >> > > > > projects
> > >> > > > > > >> > could
> > >> > > > > > >> > > > work
> > >> > > > > > >> > > > > >> to
> > >> > > > > > >> > > > > >> tightly align. In practice this seems to be
> > quite
> > >> > > > difficult
> > >> > > > > > >> > though.
> > >> > > > > > >> > > > > >>
> > >> > > > > > >> > > > > >> For the frameworks--totally agree it would be
> > >> good to
> > >> > > > > > maintain
> > >> > > > > > >> the
> > >> > > > > > >> > > > > >> framework support with the project. In some
> > cases
> > >> > there
> > >> > > > may
> > >> > > > > > not
> > >> > > > > > >> be
> > >> > > > > > >> > > too
> > >> > > > > > >> > > > > >> much
> > >> > > > > > >> > > > > >> there since the integration gets lighter but I
> > >> think
> > >> > > > > whatever
> > >> > > > > > >> > stubs
> > >> > > > > > >> > > > you
> > >> > > > > > >> > > > > >> need should be included. So no I definitely
> > wasn't
> > >> > > trying
> > >> > > > > to
> > >> > > > > > >> imply
> > >> > > > > > >> > > > > >> dropping
> > >> > > > > > >> > > > > >> support for these frameworks, just making the
> > >> > > integration
> > >> > > > > > >> lighter
> > >> > > > > > >> > by
> > >> > > > > > >> > > > > >> separating process management from partition
> > >> > > management.
> > >> > > > > > >> > > > > >>
> > >> > > > > > >> > > > > >> You raise two good points we would have to
> > figure
> > >> out
> > >> > > if
> > >> > > > we
> > >> > > > > > went
> > >> > > > > > >> > > down
> > >> > > > > > >> > > > > the
> > >> > > > > > >> > > > > >> alignment path:
> > >> > > > > > >> > > > > >> 1. With respect to the name, yeah I think the
> > >> first
> > >> > > > > question
> > >> > > > > > is
> > >> > > > > > >> > > > whether
> > >> > > > > > >> > > > > >> some "re-branding" would be worth it. If so
> > then I
> > >> > > think
> > >> > > > we
> > >> > > > > > can
> > >> > > > > > >> > > have a
> > >> > > > > > >> > > > > big
> > >> > > > > > >> > > > > >> thread on the name. I'm definitely not set on
> > >> Kafka
> > >> > > > > > Streaming or
> > >> > > > > > >> > > Kafka
> > >> > > > > > >> > > > > >> Streams I was just using them to be kind of
> > >> > > > illustrative. I
> > >> > > > > > >> agree
> > >> > > > > > >> > > with
> > >> > > > > > >> > > > > >> your
> > >> > > > > > >> > > > > >> critique of these names, though I think people
> > >> would
> > >> > > get
> > >> > > > > the
> > >> > > > > > >> idea.
> > >> > > > > > >> > > > > >> 2. Yeah you also raise a good point about how
> to
> > >> > > "factor"
> > >> > > > > it.
> > >> > > > > > >> Here
> > >> > > > > > >> > > are
> > >> > > > > > >> > > > > the
> > >> > > > > > >> > > > > >> options I see (I could get enthusiastic about
> > any
> > >> of
> > >> > > > them):
> > >> > > > > > >> > > > > >>    a. One repo for both Kafka and Samza
> > >> > > > > > >> > > > > >>    b. Two repos, retaining the current
> > seperation
> > >> > > > > > >> > > > > >>    c. Two repos, the equivalent of samza-api
> and
> > >> > > > samza-core
> > >> > > > > > is
> > >> > > > > > >> > > > absorbed
> > >> > > > > > >> > > > > >> almost like a third client
> > >> > > > > > >> > > > > >>
> > >> > > > > > >> > > > > >> Cheers,
> > >> > > > > > >> > > > > >>
> > >> > > > > > >> > > > > >> -Jay
> > >> > > > > > >> > > > > >>
> > >> > > > > > >> > > > > >> On Mon, Jul 6, 2015 at 1:18 PM, Martin
> > Kleppmann <
> > >> > > > > > >> > > > [email protected]>
> > >> > > > > > >> > > > > >> wrote:
> > >> > > > > > >> > > > > >>
> > >> > > > > > >> > > > > >> > Ok, thanks for the clarifications. Just a
> few
> > >> > > follow-up
> > >> > > > > > >> > comments.
> > >> > > > > > >> > > > > >> >
> > >> > > > > > >> > > > > >> > - I see the appeal of merging with Kafka or
> > >> > becoming
> > >> > > a
> > >> > > > > > >> > subproject:
> > >> > > > > > >> > > > the
> > >> > > > > > >> > > > > >> > reasons you mention are good. The risk I see
> > is
> > >> > that
> > >> > > > > > release
> > >> > > > > > >> > > > schedules
> > >> > > > > > >> > > > > >> > become coupled to each other, which can slow
> > >> > everyone
> > >> > > > > down,
> > >> > > > > > >> and
> > >> > > > > > >> > > > large
> > >> > > > > > >> > > > > >> > projects with many contributors are harder
> to
> > >> > manage.
> > >> > > > > > (Jakob,
> > >> > > > > > >> > can
> > >> > > > > > >> > > > you
> > >> > > > > > >> > > > > >> speak
> > >> > > > > > >> > > > > >> > from experience, having seen a wider range
> of
> > >> > Hadoop
> > >> > > > > > ecosystem
> > >> > > > > > >> > > > > >> projects?)
> > >> > > > > > >> > > > > >> >
> > >> > > > > > >> > > > > >> > Some of the goals of a better unified
> > developer
> > >> > > > > experience
> > >> > > > > > >> could
> > >> > > > > > >> > > > also
> > >> > > > > > >> > > > > be
> > >> > > > > > >> > > > > >> > solved by integrating Samza nicely into a
> > Kafka
> > >> > > > > > distribution
> > >> > > > > > >> > (such
> > >> > > > > > >> > > > as
> > >> > > > > > >> > > > > >> > Confluent's). I'm not against merging
> projects
> > >> if
> > >> > we
> > >> > > > > decide
> > >> > > > > > >> > that's
> > >> > > > > > >> > > > the
> > >> > > > > > >> > > > > >> way
> > >> > > > > > >> > > > > >> > to go, just pointing out the same goals can
> > >> perhaps
> > >> > > > also
> > >> > > > > be
> > >> > > > > > >> > > achieved
> > >> > > > > > >> > > > > in
> > >> > > > > > >> > > > > >> > other ways.
> > >> > > > > > >> > > > > >> >
> > >> > > > > > >> > > > > >> > - With regard to dropping the YARN
> dependency:
> > >> are
> > >> > > you
> > >> > > > > > >> proposing
> > >> > > > > > >> > > > that
> > >> > > > > > >> > > > > >> > Samza doesn't give any help to people
> wanting
> > to
> > >> > run
> > >> > > on
> > >> > > > > > >> > > > > >> YARN/Mesos/AWS/etc?
> > >> > > > > > >> > > > > >> > So the docs would basically have a link to
> > >> Slider
> > >> > and
> > >> > > > > > nothing
> > >> > > > > > >> > > else?
> > >> > > > > > >> > > > Or
> > >> > > > > > >> > > > > >> > would we maintain integrations with a bunch
> of
> > >> > > popular
> > >> > > > > > >> > deployment
> > >> > > > > > >> > > > > >> methods
> > >> > > > > > >> > > > > >> > (e.g. the necessary glue and shell scripts
> to
> > >> make
> > >> > > > Samza
> > >> > > > > > work
> > >> > > > > > >> > with
> > >> > > > > > >> > > > > >> Slider)?
> > >> > > > > > >> > > > > >> >
> > >> > > > > > >> > > > > >> > I absolutely think it's a good idea to have
> > the
> > >> > "as a
> > >> > > > > > library"
> > >> > > > > > >> > and
> > >> > > > > > >> > > > > "as a
> > >> > > > > > >> > > > > >> > process" (using Yi's taxonomy) options for
> > >> people
> > >> > who
> > >> > > > > want
> > >> > > > > > >> them,
> > >> > > > > > >> > > > but I
> > >> > > > > > >> > > > > >> > think there should also be a low-friction
> path
> > >> for
> > >> > > > common
> > >> > > > > > "as
> > >> > > > > > >> a
> > >> > > > > > >> > > > > service"
> > >> > > > > > >> > > > > >> > deployment methods, for which we probably
> need
> > >> to
> > >> > > > > maintain
> > >> > > > > > >> > > > > integrations.
> > >> > > > > > >> > > > > >> >
> > >> > > > > > >> > > > > >> > - Project naming: "Kafka Streams" seems odd
> to
> > >> me,
> > >> > > > > because
> > >> > > > > > >> Kafka
> > >> > > > > > >> > > is
> > >> > > > > > >> > > > > all
> > >> > > > > > >> > > > > >> > about streams already. Perhaps "Kafka
> > >> Transformers"
> > >> > > or
> > >> > > > > > "Kafka
> > >> > > > > > >> > > > Filters"
> > >> > > > > > >> > > > > >> > would be more apt?
> > >> > > > > > >> > > > > >> >
> > >> > > > > > >> > > > > >> > One suggestion: perhaps the core of Samza
> > >> (stream
> > >> > > > > > >> transformation
> > >> > > > > > >> > > > with
> > >> > > > > > >> > > > > >> > state management -- i.e. the "Samza as a
> > >> library"
> > >> > > bit)
> > >> > > > > > could
> > >> > > > > > >> > > become
> > >> > > > > > >> > > > > >> part of
> > >> > > > > > >> > > > > >> > Kafka, while higher-level tools such as
> > >> streaming
> > >> > SQL
> > >> > > > and
> > >> > > > > > >> > > > integrations
> > >> > > > > > >> > > > > >> with
> > >> > > > > > >> > > > > >> > deployment frameworks remain in a separate
> > >> project?
> > >> > > In
> > >> > > > > > other
> > >> > > > > > >> > > words,
> > >> > > > > > >> > > > > >> Kafka
> > >> > > > > > >> > > > > >> > would absorb the proven, stable core of
> Samza,
> > >> > which
> > >> > > > > would
> > >> > > > > > >> > become
> > >> > > > > > >> > > > the
> > >> > > > > > >> > > > > >> > "third Kafka client" mentioned early in this
> > >> > thread.
> > >> > > > The
> > >> > > > > > Samza
> > >> > > > > > >> > > > project
> > >> > > > > > >> > > > > >> > would then target that third Kafka client as
> > its
> > >> > base
> > >> > > > > API,
> > >> > > > > > and
> > >> > > > > > >> > the
> > >> > > > > > >> > > > > >> project
> > >> > > > > > >> > > > > >> > would be freed up to explore more
> experimental
> > >> new
> > >> > > > > > horizons.
> > >> > > > > > >> > > > > >> >
> > >> > > > > > >> > > > > >> > Martin
> > >> > > > > > >> > > > > >> >
> > >> > > > > > >> > > > > >> > On 6 Jul 2015, at 18:51, Jay Kreps <
> > >> > > > [email protected]>
> > >> > > > > > >> wrote:
> > >> > > > > > >> > > > > >> >
> > >> > > > > > >> > > > > >> > > Hey Martin,
> > >> > > > > > >> > > > > >> > >
> > >> > > > > > >> > > > > >> > > For the YARN/Mesos/etc decoupling I
> actually
> > >> > don't
> > >> > > > > think
> > >> > > > > > it
> > >> > > > > > >> > ties
> > >> > > > > > >> > > > our
> > >> > > > > > >> > > > > >> > hands
> > >> > > > > > >> > > > > >> > > at all, all it does is refactor things.
> The
> > >> > > division
> > >> > > > of
> > >> > > > > > >> > > > > >> responsibility is
> > >> > > > > > >> > > > > >> > > that Samza core is responsible for task
> > >> > lifecycle,
> > >> > > > > state,
> > >> > > > > > >> and
> > >> > > > > > >> > > > > >> partition
> > >> > > > > > >> > > > > >> > > management (using the Kafka co-ordinator)
> > but
> > >> it
> > >> > is
> > >> > > > NOT
> > >> > > > > > >> > > > responsible
> > >> > > > > > >> > > > > >> for
> > >> > > > > > >> > > > > >> > > packaging, configuration deployment or
> > >> execution
> > >> > of
> > >> > > > > > >> processes.
> > >> > > > > > >> > > The
> > >> > > > > > >> > > > > >> > problem
> > >> > > > > > >> > > > > >> > > of packaging and starting these processes
> is
> > >> > > > > > >> > > > > >> > > framework/environment-specific. This
> leaves
> > >> > > > individual
> > >> > > > > > >> > > frameworks
> > >> > > > > > >> > > > to
> > >> > > > > > >> > > > > >> be
> > >> > > > > > >> > > > > >> > as
> > >> > > > > > >> > > > > >> > > fancy or vanilla as they like. So you can
> > get
> > >> > > simple
> > >> > > > > > >> stateless
> > >> > > > > > >> > > > > >> support in
> > >> > > > > > >> > > > > >> > > YARN, Mesos, etc using their off-the-shelf
> > app
> > >> > > > > framework
> > >> > > > > > >> > > (Slider,
> > >> > > > > > >> > > > > >> > Marathon,
> > >> > > > > > >> > > > > >> > > etc). These are well known by people and
> > have
> > >> > nice
> > >> > > > UIs
> > >> > > > > > and a
> > >> > > > > > >> > lot
> > >> > > > > > >> > > > of
> > >> > > > > > >> > > > > >> > > flexibility. I don't think they have node
> > >> > affinity
> > >> > > > as a
> > >> > > > > > >> built
> > >> > > > > > >> > in
> > >> > > > > > >> > > > > >> option
> > >> > > > > > >> > > > > >> > > (though I could be wrong). So if we want
> > that
> > >> we
> > >> > > can
> > >> > > > > > either
> > >> > > > > > >> > wait
> > >> > > > > > >> > > > for
> > >> > > > > > >> > > > > >> them
> > >> > > > > > >> > > > > >> > > to add it or do a custom framework to add
> > that
> > >> > > > feature
> > >> > > > > > (as
> > >> > > > > > >> > now).
> > >> > > > > > >> > > > > >> > Obviously
> > >> > > > > > >> > > > > >> > > if you manage things with old-school ops
> > tools
> > >> > > > > > >> > (puppet/chef/etc)
> > >> > > > > > >> > > > you
> > >> > > > > > >> > > > > >> get
> > >> > > > > > >> > > > > >> > > locality easily. The nice thing, though,
> is
> > >> that
> > >> > > all
> > >> > > > > the
> > >> > > > > > >> samza
> > >> > > > > > >> > > > > >> "business
> > >> > > > > > >> > > > > >> > > logic" around partition management and
> fault
> > >> > > > tolerance
> > >> > > > > > is in
> > >> > > > > > >> > > Samza
> > >> > > > > > >> > > > > >> core
> > >> > > > > > >> > > > > >> > so
> > >> > > > > > >> > > > > >> > > it is shared across frameworks and the
> > >> framework
> > >> > > > > specific
> > >> > > > > > >> bit
> > >> > > > > > >> > is
> > >> > > > > > >> > > > > just
> > >> > > > > > >> > > > > >> > > whether it is smart enough to try to get
> the
> > >> same
> > >> > > > host
> > >> > > > > > when
> > >> > > > > > >> a
> > >> > > > > > >> > > job
> > >> > > > > > >> > > > is
> > >> > > > > > >> > > > > >> > > restarted.
> > >> > > > > > >> > > > > >> > >
> > >> > > > > > >> > > > > >> > > With respect to the Kafka-alignment, yeah
> I
> > >> think
> > >> > > the
> > >> > > > > > goal
> > >> > > > > > >> > would
> > >> > > > > > >> > > > be
> > >> > > > > > >> > > > > >> (a)
> > >> > > > > > >> > > > > >> > > actually get better alignment in user
> > >> experience,
> > >> > > and
> > >> > > > > (b)
> > >> > > > > > >> > > express
> > >> > > > > > >> > > > > >> this in
> > >> > > > > > >> > > > > >> > > the naming and project branding.
> > Specifically:
> > >> > > > > > >> > > > > >> > > 1. Website/docs, it would be nice for the
> > >> > > > > > "transformation"
> > >> > > > > > >> api
> > >> > > > > > >> > > to
> > >> > > > > > >> > > > be
> > >> > > > > > >> > > > > >> > > discoverable in the main Kafka docs--i.e.
> be
> > >> able
> > >> > > to
> > >> > > > > > explain
> > >> > > > > > >> > > when
> > >> > > > > > >> > > > to
> > >> > > > > > >> > > > > >> use
> > >> > > > > > >> > > > > >> > > the consumer and when to use the stream
> > >> > processing
> > >> > > > > > >> > functionality
> > >> > > > > > >> > > > and
> > >> > > > > > >> > > > > >> lead
> > >> > > > > > >> > > > > >> > > people into that experience.
> > >> > > > > > >> > > > > >> > > 2. Align releases so if you get Kafkza
> 1.4.2
> > >> (or
> > >> > > > > > whatever)
> > >> > > > > > >> > that
> > >> > > > > > >> > > > has
> > >> > > > > > >> > > > > >> both
> > >> > > > > > >> > > > > >> > > Kafka and the stream processing part and
> > they
> > >> > > > actually
> > >> > > > > > work
> > >> > > > > > >> > > > > together.
> > >> > > > > > >> > > > > >> > > 3. Unify the programming experience so the
> > >> client
> > >> > > and
> > >> > > > > > Samza
> > >> > > > > > >> > api
> > >> > > > > > >> > > > > share
> > >> > > > > > >> > > > > >> > > config/monitoring/naming/packaging/etc.
> > >> > > > > > >> > > > > >> > >
> > >> > > > > > >> > > > > >> > > I think sub-projects keep separate
> > committers
> > >> and
> > >> > > can
> > >> > > > > > have a
> > >> > > > > > >> > > > > separate
> > >> > > > > > >> > > > > >> > repo,
> > >> > > > > > >> > > > > >> > > but I'm actually not really sure (I can't
> > >> find a
> > >> > > > > > definition
> > >> > > > > > >> > of a
> > >> > > > > > >> > > > > >> > subproject
> > >> > > > > > >> > > > > >> > > in Apache).
> > >> > > > > > >> > > > > >> > >
> > >> > > > > > >> > > > > >> > > Basically at a high-level you want the
> > >> experience
> > >> > > to
> > >> > > > > > "feel"
> > >> > > > > > >> > > like a
> > >> > > > > > >> > > > > >> single
> > >> > > > > > >> > > > > >> > > system, not to relatively independent
> things
> > >> that
> > >> > > are
> > >> > > > > > kind
> > >> > > > > > >> of
> > >> > > > > > >> > > > > >> awkwardly
> > >> > > > > > >> > > > > >> > > glued together.
> > >> > > > > > >> > > > > >> > >
> > >> > > > > > >> > > > > >> > > I think if we did that they having naming
> or
> > >> > > branding
> > >> > > > > > like
> > >> > > > > > >> > > "kafka
> > >> > > > > > >> > > > > >> > > streaming" or "kafka streams" or something
> > >> like
> > >> > > that
> > >> > > > > > would
> > >> > > > > > >> > > > actually
> > >> > > > > > >> > > > > >> do a
> > >> > > > > > >> > > > > >> > > good job of conveying what it is. I do
> that
> > >> this
> > >> > > > would
> > >> > > > > > help
> > >> > > > > > >> > > > adoption
> > >> > > > > > >> > > > > >> > quite
> > >> > > > > > >> > > > > >> > > a lot as it would correctly convey that
> > using
> > >> > Kafka
> > >> > > > > > >> Streaming
> > >> > > > > > >> > > with
> > >> > > > > > >> > > > > >> Kafka
> > >> > > > > > >> > > > > >> > is
> > >> > > > > > >> > > > > >> > > a fairly seamless experience and Kafka is
> > >> pretty
> > >> > > > > heavily
> > >> > > > > > >> > adopted
> > >> > > > > > >> > > > at
> > >> > > > > > >> > > > > >> this
> > >> > > > > > >> > > > > >> > > point.
> > >> > > > > > >> > > > > >> > >
> > >> > > > > > >> > > > > >> > > Fwiw we actually considered this model
> > >> originally
> > >> > > > when
> > >> > > > > > open
> > >> > > > > > >> > > > sourcing
> > >> > > > > > >> > > > > >> > Samza,
> > >> > > > > > >> > > > > >> > > however at that time Kafka was relatively
> > >> unknown
> > >> > > and
> > >> > > > > we
> > >> > > > > > >> > decided
> > >> > > > > > >> > > > not
> > >> > > > > > >> > > > > >> to
> > >> > > > > > >> > > > > >> > do
> > >> > > > > > >> > > > > >> > > it since we felt it would be limiting.
> From
> > my
> > >> > > point
> > >> > > > of
> > >> > > > > > view
> > >> > > > > > >> > the
> > >> > > > > > >> > > > > three
> > >> > > > > > >> > > > > >> > > things have changed (1) Kafka is now
> really
> > >> > heavily
> > >> > > > > used
> > >> > > > > > for
> > >> > > > > > >> > > > stream
> > >> > > > > > >> > > > > >> > > processing, (2) we learned that
> abstracting
> > >> out
> > >> > the
> > >> > > > > > stream
> > >> > > > > > >> > well
> > >> > > > > > >> > > is
> > >> > > > > > >> > > > > >> > > basically impossible, (3) we learned it is
> > >> really
> > >> > > > hard
> > >> > > > > to
> > >> > > > > > >> keep
> > >> > > > > > >> > > the
> > >> > > > > > >> > > > > two
> > >> > > > > > >> > > > > >> > > things feeling like a single product.
> > >> > > > > > >> > > > > >> > >
> > >> > > > > > >> > > > > >> > > -Jay
> > >> > > > > > >> > > > > >> > >
> > >> > > > > > >> > > > > >> > >
> > >> > > > > > >> > > > > >> > > On Mon, Jul 6, 2015 at 3:37 AM, Martin
> > >> Kleppmann
> > >> > <
> > >> > > > > > >> > > > > >> [email protected]>
> > >> > > > > > >> > > > > >> > > wrote:
> > >> > > > > > >> > > > > >> > >
> > >> > > > > > >> > > > > >> > >> Hi all,
> > >> > > > > > >> > > > > >> > >>
> > >> > > > > > >> > > > > >> > >> Lots of good thoughts here.
> > >> > > > > > >> > > > > >> > >>
> > >> > > > > > >> > > > > >> > >> I agree with the general philosophy of
> > tying
> > >> > Samza
> > >> > > > > more
> > >> > > > > > >> > firmly
> > >> > > > > > >> > > to
> > >> > > > > > >> > > > > >> Kafka.
> > >> > > > > > >> > > > > >> > >> After I spent a while looking at
> > integrating
> > >> > other
> > >> > > > > > message
> > >> > > > > > >> > > > brokers
> > >> > > > > > >> > > > > >> (e.g.
> > >> > > > > > >> > > > > >> > >> Kinesis) with SystemConsumer, I came to
> the
> > >> > > > conclusion
> > >> > > > > > that
> > >> > > > > > >> > > > > >> > SystemConsumer
> > >> > > > > > >> > > > > >> > >> tacitly assumes a model so much like
> > Kafka's
> > >> > that
> > >> > > > > pretty
> > >> > > > > > >> much
> > >> > > > > > >> > > > > nobody
> > >> > > > > > >> > > > > >> but
> > >> > > > > > >> > > > > >> > >> Kafka actually implements it. (Databus is
> > >> > perhaps
> > >> > > an
> > >> > > > > > >> > exception,
> > >> > > > > > >> > > > but
> > >> > > > > > >> > > > > >> it
> > >> > > > > > >> > > > > >> > >> isn't widely used outside of LinkedIn.)
> > Thus,
> > >> > > making
> > >> > > > > > Samza
> > >> > > > > > >> > > fully
> > >> > > > > > >> > > > > >> > dependent
> > >> > > > > > >> > > > > >> > >> on Kafka acknowledges that the
> > >> > system-independence
> > >> > > > was
> > >> > > > > > >> never
> > >> > > > > > >> > as
> > >> > > > > > >> > > > > real
> > >> > > > > > >> > > > > >> as
> > >> > > > > > >> > > > > >> > we
> > >> > > > > > >> > > > > >> > >> perhaps made it out to be. The gains of
> > code
> > >> > reuse
> > >> > > > are
> > >> > > > > > >> real.
> > >> > > > > > >> > > > > >> > >>
> > >> > > > > > >> > > > > >> > >> The idea of decoupling Samza from YARN
> has
> > >> also
> > >> > > > always
> > >> > > > > > been
> > >> > > > > > >> > > > > >> appealing to
> > >> > > > > > >> > > > > >> > >> me, for various reasons already mentioned
> > in
> > >> > this
> > >> > > > > > thread.
> > >> > > > > > >> > > > Although
> > >> > > > > > >> > > > > >> > making
> > >> > > > > > >> > > > > >> > >> Samza jobs deployable on anything
> > >> > > > (YARN/Mesos/AWS/etc)
> > >> > > > > > >> seems
> > >> > > > > > >> > > > > >> laudable,
> > >> > > > > > >> > > > > >> > I am
> > >> > > > > > >> > > > > >> > >> a little concerned that it will restrict
> us
> > >> to a
> > >> > > > > lowest
> > >> > > > > > >> > common
> > >> > > > > > >> > > > > >> > denominator.
> > >> > > > > > >> > > > > >> > >> For example, would host affinity
> > (SAMZA-617)
> > >> > still
> > >> > > > be
> > >> > > > > > >> > possible?
> > >> > > > > > >> > > > For
> > >> > > > > > >> > > > > >> jobs
> > >> > > > > > >> > > > > >> > >> with large amounts of state, I think
> > >> SAMZA-617
> > >> > > would
> > >> > > > > be
> > >> > > > > > a
> > >> > > > > > >> big
> > >> > > > > > >> > > > boon,
> > >> > > > > > >> > > > > >> > since
> > >> > > > > > >> > > > > >> > >> restoring state off the changelog on
> every
> > >> > single
> > >> > > > > > restart
> > >> > > > > > >> is
> > >> > > > > > >> > > > > painful,
> > >> > > > > > >> > > > > >> > due
> > >> > > > > > >> > > > > >> > >> to long recovery times. It would be a
> shame
> > >> if
> > >> > the
> > >> > > > > > >> decoupling
> > >> > > > > > >> > > > from
> > >> > > > > > >> > > > > >> YARN
> > >> > > > > > >> > > > > >> > >> made host affinity impossible.
> > >> > > > > > >> > > > > >> > >>
> > >> > > > > > >> > > > > >> > >> Jay, a question about the proposed API
> for
> > >> > > > > > instantiating a
> > >> > > > > > >> > job
> > >> > > > > > >> > > in
> > >> > > > > > >> > > > > >> code
> > >> > > > > > >> > > > > >> > >> (rather than a properties file): when
> > >> > submitting a
> > >> > > > job
> > >> > > > > > to a
> > >> > > > > > >> > > > > cluster,
> > >> > > > > > >> > > > > >> is
> > >> > > > > > >> > > > > >> > the
> > >> > > > > > >> > > > > >> > >> idea that the instantiation code runs on
> a
> > >> > client
> > >> > > > > > >> somewhere,
> > >> > > > > > >> > > > which
> > >> > > > > > >> > > > > >> then
> > >> > > > > > >> > > > > >> > >> pokes the necessary endpoints on
> > >> > > YARN/Mesos/AWS/etc?
> > >> > > > > Or
> > >> > > > > > >> does
> > >> > > > > > >> > > that
> > >> > > > > > >> > > > > >> code
> > >> > > > > > >> > > > > >> > run
> > >> > > > > > >> > > > > >> > >> on each container that is part of the job
> > (in
> > >> > > which
> > >> > > > > > case,
> > >> > > > > > >> how
> > >> > > > > > >> > > > does
> > >> > > > > > >> > > > > >> the
> > >> > > > > > >> > > > > >> > job
> > >> > > > > > >> > > > > >> > >> submission to the cluster work)?
> > >> > > > > > >> > > > > >> > >>
> > >> > > > > > >> > > > > >> > >> I agree with Garry that it doesn't feel
> > >> right to
> > >> > > > make
> > >> > > > > a
> > >> > > > > > 1.0
> > >> > > > > > >> > > > release
> > >> > > > > > >> > > > > >> > with a
> > >> > > > > > >> > > > > >> > >> plan for it to be immediately obsolete.
> So
> > if
> > >> > this
> > >> > > > is
> > >> > > > > > going
> > >> > > > > > >> > to
> > >> > > > > > >> > > > > >> happen, I
> > >> > > > > > >> > > > > >> > >> think it would be more honest to stick
> with
> > >> 0.*
> > >> > > > > version
> > >> > > > > > >> > numbers
> > >> > > > > > >> > > > > until
> > >> > > > > > >> > > > > >> > the
> > >> > > > > > >> > > > > >> > >> library-ified Samza has been implemented,
> > is
> > >> > > stable
> > >> > > > > and
> > >> > > > > > >> > widely
> > >> > > > > > >> > > > > used.
> > >> > > > > > >> > > > > >> > >>
> > >> > > > > > >> > > > > >> > >> Should the new Samza be a subproject of
> > >> Kafka?
> > >> > > There
> > >> > > > > is
> > >> > > > > > >> > > precedent
> > >> > > > > > >> > > > > for
> > >> > > > > > >> > > > > >> > >> tight coupling between different Apache
> > >> projects
> > >> > > > (e.g.
> > >> > > > > > >> > Curator
> > >> > > > > > >> > > > and
> > >> > > > > > >> > > > > >> > >> Zookeeper, or Slider and YARN), so I
> think
> > >> > > remaining
> > >> > > > > > >> separate
> > >> > > > > > >> > > > would
> > >> > > > > > >> > > > > >> be
> > >> > > > > > >> > > > > >> > ok.
> > >> > > > > > >> > > > > >> > >> Even if Samza is fully dependent on
> Kafka,
> > >> there
> > >> > > is
> > >> > > > > > enough
> > >> > > > > > >> > > > > substance
> > >> > > > > > >> > > > > >> in
> > >> > > > > > >> > > > > >> > >> Samza that it warrants being a separate
> > >> project.
> > >> > > An
> > >> > > > > > >> argument
> > >> > > > > > >> > in
> > >> > > > > > >> > > > > >> favour
> > >> > > > > > >> > > > > >> > of
> > >> > > > > > >> > > > > >> > >> merging would be if we think Kafka has a
> > much
> > >> > > > stronger
> > >> > > > > > >> "brand
> > >> > > > > > >> > > > > >> presence"
> > >> > > > > > >> > > > > >> > >> than Samza; I'm ambivalent on that one.
> If
> > >> the
> > >> > > Kafka
> > >> > > > > > >> project
> > >> > > > > > >> > is
> > >> > > > > > >> > > > > >> willing
> > >> > > > > > >> > > > > >> > to
> > >> > > > > > >> > > > > >> > >> endorse Samza as the "official" way of
> > doing
> > >> > > > stateful
> > >> > > > > > >> stream
> > >> > > > > > >> > > > > >> > >> transformations, that would probably have
> > >> much
> > >> > the
> > >> > > > > same
> > >> > > > > > >> > effect
> > >> > > > > > >> > > as
> > >> > > > > > >> > > > > >> > >> re-branding Samza as "Kafka Stream
> > >> Processors"
> > >> > or
> > >> > > > > > suchlike.
> > >> > > > > > >> > > Close
> > >> > > > > > >> > > > > >> > >> collaboration between the two projects
> will
> > >> be
> > >> > > > needed
> > >> > > > > in
> > >> > > > > > >> any
> > >> > > > > > >> > > > case.
> > >> > > > > > >> > > > > >> > >>
> > >> > > > > > >> > > > > >> > >> From a project management perspective, I
> > >> guess
> > >> > the
> > >> > > > > "new
> > >> > > > > > >> > Samza"
> > >> > > > > > >> > > > > would
> > >> > > > > > >> > > > > >> > have
> > >> > > > > > >> > > > > >> > >> to be developed on a branch alongside
> > ongoing
> > >> > > > > > maintenance
> > >> > > > > > >> of
> > >> > > > > > >> > > the
> > >> > > > > > >> > > > > >> current
> > >> > > > > > >> > > > > >> > >> line of development? I think it would be
> > >> > important
> > >> > > > to
> > >> > > > > > >> > continue
> > >> > > > > > >> > > > > >> > supporting
> > >> > > > > > >> > > > > >> > >> existing users, and provide a graceful
> > >> migration
> > >> > > > path
> > >> > > > > to
> > >> > > > > > >> the
> > >> > > > > > >> > > new
> > >> > > > > > >> > > > > >> > version.
> > >> > > > > > >> > > > > >> > >> Leaving the current versions unsupported
> > and
> > >> > > forcing
> > >> > > > > > people
> > >> > > > > > >> > to
> > >> > > > > > >> > > > > >> rewrite
> > >> > > > > > >> > > > > >> > >> their jobs would send a bad signal.
> > >> > > > > > >> > > > > >> > >>
> > >> > > > > > >> > > > > >> > >> Best,
> > >> > > > > > >> > > > > >> > >> Martin
> > >> > > > > > >> > > > > >> > >>
> > >> > > > > > >> > > > > >> > >> On 2 Jul 2015, at 16:59, Jay Kreps <
> > >> > > > [email protected]>
> > >> > > > > > >> wrote:
> > >> > > > > > >> > > > > >> > >>
> > >> > > > > > >> > > > > >> > >>> Hey Garry,
> > >> > > > > > >> > > > > >> > >>>
> > >> > > > > > >> > > > > >> > >>> Yeah that's super frustrating. I'd be
> > happy
> > >> to
> > >> > > chat
> > >> > > > > > more
> > >> > > > > > >> > about
> > >> > > > > > >> > > > > this
> > >> > > > > > >> > > > > >> if
> > >> > > > > > >> > > > > >> > >>> you'd be interested. I think Chris and I
> > >> > started
> > >> > > > with
> > >> > > > > > the
> > >> > > > > > >> > idea
> > >> > > > > > >> > > > of
> > >> > > > > > >> > > > > >> "what
> > >> > > > > > >> > > > > >> > >>> would it take to make Samza a kick-ass
> > >> > ingestion
> > >> > > > > tool"
> > >> > > > > > but
> > >> > > > > > >> > > > > >> ultimately
> > >> > > > > > >> > > > > >> > we
> > >> > > > > > >> > > > > >> > >>> kind of came around to the idea that
> > >> ingestion
> > >> > > and
> > >> > > > > > >> > > > transformation
> > >> > > > > > >> > > > > >> had
> > >> > > > > > >> > > > > >> > >>> pretty different needs and coupling the
> > two
> > >> > made
> > >> > > > > things
> > >> > > > > > >> > hard.
> > >> > > > > > >> > > > > >> > >>>
> > >> > > > > > >> > > > > >> > >>> For what it's worth I think copycat
> > (KIP-26)
> > >> > > > actually
> > >> > > > > > will
> > >> > > > > > >> > do
> > >> > > > > > >> > > > what
> > >> > > > > > >> > > > > >> you
> > >> > > > > > >> > > > > >> > >> are
> > >> > > > > > >> > > > > >> > >>> looking for.
> > >> > > > > > >> > > > > >> > >>>
> > >> > > > > > >> > > > > >> > >>> With regard to your point about slider,
> I
> > >> don't
> > >> > > > > > >> necessarily
> > >> > > > > > >> > > > > >> disagree.
> > >> > > > > > >> > > > > >> > >> But I
> > >> > > > > > >> > > > > >> > >>> think getting good YARN support is quite
> > >> doable
> > >> > > > and I
> > >> > > > > > >> think
> > >> > > > > > >> > we
> > >> > > > > > >> > > > can
> > >> > > > > > >> > > > > >> make
> > >> > > > > > >> > > > > >> > >>> that work well. I think the issue this
> > >> proposal
> > >> > > > > solves
> > >> > > > > > is
> > >> > > > > > >> > that
> > >> > > > > > >> > > > > >> > >> technically
> > >> > > > > > >> > > > > >> > >>> it is pretty hard to support multiple
> > >> cluster
> > >> > > > > > management
> > >> > > > > > >> > > systems
> > >> > > > > > >> > > > > the
> > >> > > > > > >> > > > > >> > way
> > >> > > > > > >> > > > > >> > >>> things are now, you need to write an
> "app
> > >> > master"
> > >> > > > or
> > >> > > > > > >> > > "framework"
> > >> > > > > > >> > > > > for
> > >> > > > > > >> > > > > >> > each
> > >> > > > > > >> > > > > >> > >>> and they are all a little different so
> > >> testing
> > >> > is
> > >> > > > > > really
> > >> > > > > > >> > hard.
> > >> > > > > > >> > > > In
> > >> > > > > > >> > > > > >> the
> > >> > > > > > >> > > > > >> > >>> absence of this we have been stuck with
> > just
> > >> > YARN
> > >> > > > > which
> > >> > > > > > >> has
> > >> > > > > > >> > > > > >> fantastic
> > >> > > > > > >> > > > > >> > >>> penetration in the Hadoopy part of the
> > org,
> > >> but
> > >> > > > zero
> > >> > > > > > >> > > penetration
> > >> > > > > > >> > > > > >> > >> elsewhere.
> > >> > > > > > >> > > > > >> > >>> Given the huge amount of work being put
> in
> > >> to
> > >> > > > slider,
> > >> > > > > > >> > > marathon,
> > >> > > > > > >> > > > > aws
> > >> > > > > > >> > > > > >> > >>> tooling, not to mention the umpteen
> > related
> > >> > > > packaging
> > >> > > > > > >> > > > technologies
> > >> > > > > > >> > > > > >> > people
> > >> > > > > > >> > > > > >> > >>> want to use (Docker, Kubernetes, various
> > >> > > > > cloud-specific
> > >> > > > > > >> > deploy
> > >> > > > > > >> > > > > >> tools,
> > >> > > > > > >> > > > > >> > >> etc)
> > >> > > > > > >> > > > > >> > >>> I really think it is important to get
> this
> > >> > right.
> > >> > > > > > >> > > > > >> > >>>
> > >> > > > > > >> > > > > >> > >>> -Jay
> > >> > > > > > >> > > > > >> > >>>
> > >> > > > > > >> > > > > >> > >>> On Thu, Jul 2, 2015 at 4:17 AM, Garry
> > >> > Turkington
> > >> > > <
> > >> > > > > > >> > > > > >> > >>> [email protected]> wrote:
> > >> > > > > > >> > > > > >> > >>>
> > >> > > > > > >> > > > > >> > >>>> Hi all,
> > >> > > > > > >> > > > > >> > >>>>
> > >> > > > > > >> > > > > >> > >>>> I think the question below re does
> Samza
> > >> > become
> > >> > > a
> > >> > > > > > >> > sub-project
> > >> > > > > > >> > > > of
> > >> > > > > > >> > > > > >> Kafka
> > >> > > > > > >> > > > > >> > >>>> highlights the broader point around
> > >> migration.
> > >> > > > Chris
> > >> > > > > > >> > mentions
> > >> > > > > > >> > > > > >> Samza's
> > >> > > > > > >> > > > > >> > >>>> maturity is heading towards a v1
> release
> > >> but
> > >> > I'm
> > >> > > > not
> > >> > > > > > sure
> > >> > > > > > >> > it
> > >> > > > > > >> > > > > feels
> > >> > > > > > >> > > > > >> > >> right to
> > >> > > > > > >> > > > > >> > >>>> launch a v1 then immediately plan to
> > >> deprecate
> > >> > > > most
> > >> > > > > of
> > >> > > > > > >> it.
> > >> > > > > > >> > > > > >> > >>>>
> > >> > > > > > >> > > > > >> > >>>> From a selfish perspective I have some
> > guys
> > >> > who
> > >> > > > have
> > >> > > > > > >> > started
> > >> > > > > > >> > > > > >> working
> > >> > > > > > >> > > > > >> > >> with
> > >> > > > > > >> > > > > >> > >>>> Samza and building some new
> > >> > consumers/producers
> > >> > > > was
> > >> > > > > > next
> > >> > > > > > >> > up.
> > >> > > > > > >> > > > > Sounds
> > >> > > > > > >> > > > > >> > like
> > >> > > > > > >> > > > > >> > >>>> that is absolutely not the direction to
> > >> go. I
> > >> > > need
> > >> > > > > to
> > >> > > > > > >> look
> > >> > > > > > >> > > into
> > >> > > > > > >> > > > > the
> > >> > > > > > >> > > > > >> > KIP
> > >> > > > > > >> > > > > >> > >> in
> > >> > > > > > >> > > > > >> > >>>> more detail but for me the
> attractiveness
> > >> of
> > >> > > > adding
> > >> > > > > > new
> > >> > > > > > >> > Samza
> > >> > > > > > >> > > > > >> > >>>> consumer/producers -- even if yes all
> > they
> > >> > were
> > >> > > > > doing
> > >> > > > > > was
> > >> > > > > > >> > > > really
> > >> > > > > > >> > > > > >> > getting
> > >> > > > > > >> > > > > >> > >>>> data into and out of Kafka --  was to
> > avoid
> > >> > > > having
> > >> > > > > to
> > >> > > > > > >> > worry
> > >> > > > > > >> > > > > about
> > >> > > > > > >> > > > > >> the
> > >> > > > > > >> > > > > >> > >>>> lifecycle management of external
> clients.
> > >> If
> > >> > > there
> > >> > > > > is
> > >> > > > > > a
> > >> > > > > > >> > > generic
> > >> > > > > > >> > > > > >> Kafka
> > >> > > > > > >> > > > > >> > >>>> ingress/egress layer that I can plug a
> > new
> > >> > > > connector
> > >> > > > > > into
> > >> > > > > > >> > and
> > >> > > > > > >> > > > > have
> > >> > > > > > >> > > > > >> a
> > >> > > > > > >> > > > > >> > >> lot of
> > >> > > > > > >> > > > > >> > >>>> the heavy lifting re scale and
> > reliability
> > >> > done
> > >> > > > for
> > >> > > > > me
> > >> > > > > > >> then
> > >> > > > > > >> > > it
> > >> > > > > > >> > > > > >> gives
> > >> > > > > > >> > > > > >> > me
> > >> > > > > > >> > > > > >> > >> all
> > >> > > > > > >> > > > > >> > >>>> the pushing new consumers/producers
> > would.
> > >> If
> > >> > > not
> > >> > > > > > then it
> > >> > > > > > >> > > > > >> complicates
> > >> > > > > > >> > > > > >> > my
> > >> > > > > > >> > > > > >> > >>>> operational deployments.
> > >> > > > > > >> > > > > >> > >>>>
> > >> > > > > > >> > > > > >> > >>>> Which is similar to my other question
> > with
> > >> the
> > >> > > > > > proposal
> > >> > > > > > >> --
> > >> > > > > > >> > if
> > >> > > > > > >> > > > we
> > >> > > > > > >> > > > > >> > build a
> > >> > > > > > >> > > > > >> > >>>> fully available/stand-alone Samza plus
> > the
> > >> > > > requisite
> > >> > > > > > >> shims
> > >> > > > > > >> > to
> > >> > > > > > >> > > > > >> > integrate
> > >> > > > > > >> > > > > >> > >>>> with Slider etc I suspect the former
> may
> > >> be a
> > >> > > lot
> > >> > > > > more
> > >> > > > > > >> work
> > >> > > > > > >> > > > than
> > >> > > > > > >> > > > > we
> > >> > > > > > >> > > > > >> > >> think.
> > >> > > > > > >> > > > > >> > >>>> We may make it much easier for a
> newcomer
> > >> to
> > >> > get
> > >> > > > > > >> something
> > >> > > > > > >> > > > > running
> > >> > > > > > >> > > > > >> but
> > >> > > > > > >> > > > > >> > >>>> having them step up and get a reliable
> > >> > > production
> > >> > > > > > >> > deployment
> > >> > > > > > >> > > > may
> > >> > > > > > >> > > > > >> still
> > >> > > > > > >> > > > > >> > >>>> dominate mailing list  traffic, if for
> > >> > different
> > >> > > > > > reasons
> > >> > > > > > >> > than
> > >> > > > > > >> > > > > >> today.
> > >> > > > > > >> > > > > >> > >>>>
> > >> > > > > > >> > > > > >> > >>>> Don't get me wrong -- I'm comfortable
> > with
> > >> > > making
> > >> > > > > the
> > >> > > > > > >> Samza
> > >> > > > > > >> > > > > >> dependency
> > >> > > > > > >> > > > > >> > >> on
> > >> > > > > > >> > > > > >> > >>>> Kafka much more explicit and I
> absolutely
> > >> see
> > >> > > the
> > >> > > > > > >> benefits
> > >> > > > > > >> > > in
> > >> > > > > > >> > > > > the
> > >> > > > > > >> > > > > >> > >>>> reduction of duplication and clashing
> > >> > > > > > >> > > > terminologies/abstractions
> > >> > > > > > >> > > > > >> that
> > >> > > > > > >> > > > > >> > >>>> Chris/Jay describe. Samza as a library
> > >> would
> > >> > > > likely
> > >> > > > > > be a
> > >> > > > > > >> > very
> > >> > > > > > >> > > > > nice
> > >> > > > > > >> > > > > >> > tool
> > >> > > > > > >> > > > > >> > >> to
> > >> > > > > > >> > > > > >> > >>>> add to the Kafka ecosystem. I just have
> > the
> > >> > > > concerns
> > >> > > > > > >> above
> > >> > > > > > >> > re
> > >> > > > > > >> > > > the
> > >> > > > > > >> > > > > >> > >>>> operational side.
> > >> > > > > > >> > > > > >> > >>>>
> > >> > > > > > >> > > > > >> > >>>> Garry
> > >> > > > > > >> > > > > >> > >>>>
> > >> > > > > > >> > > > > >> > >>>> -----Original Message-----
> > >> > > > > > >> > > > > >> > >>>> From: Gianmarco De Francisci Morales
> > >> [mailto:
> > >> > > > > > >> > [email protected]
> > >> > > > > > >> > > ]
> > >> > > > > > >> > > > > >> > >>>> Sent: 02 July 2015 12:56
> > >> > > > > > >> > > > > >> > >>>> To: [email protected]
> > >> > > > > > >> > > > > >> > >>>> Subject: Re: Thoughts and obesrvations
> on
> > >> > Samza
> > >> > > > > > >> > > > > >> > >>>>
> > >> > > > > > >> > > > > >> > >>>> Very interesting thoughts.
> > >> > > > > > >> > > > > >> > >>>> From outside, I have always perceived
> > Samza
> > >> > as a
> > >> > > > > > >> computing
> > >> > > > > > >> > > > layer
> > >> > > > > > >> > > > > >> over
> > >> > > > > > >> > > > > >> > >>>> Kafka.
> > >> > > > > > >> > > > > >> > >>>>
> > >> > > > > > >> > > > > >> > >>>> The question, maybe a bit provocative,
> is
> > >> > > "should
> > >> > > > > > Samza
> > >> > > > > > >> be
> > >> > > > > > >> > a
> > >> > > > > > >> > > > > >> > sub-project
> > >> > > > > > >> > > > > >> > >>>> of Kafka then?"
> > >> > > > > > >> > > > > >> > >>>> Or does it make sense to keep it as a
> > >> separate
> > >> > > > > project
> > >> > > > > > >> > with a
> > >> > > > > > >> > > > > >> separate
> > >> > > > > > >> > > > > >> > >>>> governance?
> > >> > > > > > >> > > > > >> > >>>>
> > >> > > > > > >> > > > > >> > >>>> Cheers,
> > >> > > > > > >> > > > > >> > >>>>
> > >> > > > > > >> > > > > >> > >>>> --
> > >> > > > > > >> > > > > >> > >>>> Gianmarco
> > >> > > > > > >> > > > > >> > >>>>
> > >> > > > > > >> > > > > >> > >>>> On 2 July 2015 at 08:59, Yan Fang <
> > >> > > > > > [email protected]>
> > >> > > > > > >> > > > wrote:
> > >> > > > > > >> > > > > >> > >>>>
> > >> > > > > > >> > > > > >> > >>>>> Overall, I agree to couple with Kafka
> > more
> > >> > > > tightly.
> > >> > > > > > >> > Because
> > >> > > > > > >> > > > > Samza
> > >> > > > > > >> > > > > >> de
> > >> > > > > > >> > > > > >> > >>>>> facto is based on Kafka, and it should
> > >> > leverage
> > >> > > > > what
> > >> > > > > > >> Kafka
> > >> > > > > > >> > > > has.
> > >> > > > > > >> > > > > At
> > >> > > > > > >> > > > > >> > the
> > >> > > > > > >> > > > > >> > >>>>> same time, Kafka does not need to
> > reinvent
> > >> > what
> > >> > > > > Samza
> > >> > > > > > >> > > already
> > >> > > > > > >> > > > > >> has. I
> > >> > > > > > >> > > > > >> > >>>>> also like the idea of separating the
> > >> > ingestion
> > >> > > > and
> > >> > > > > > >> > > > > transformation.
> > >> > > > > > >> > > > > >> > >>>>>
> > >> > > > > > >> > > > > >> > >>>>> But it is a little difficult for me to
> > >> image
> > >> > > how
> > >> > > > > the
> > >> > > > > > >> Samza
> > >> > > > > > >> > > > will
> > >> > > > > > >> > > > > >> look
> > >> > > > > > >> > > > > >> > >>>> like.
> > >> > > > > > >> > > > > >> > >>>>> And I feel Chris and Jay have a little
> > >> > > difference
> > >> > > > > in
> > >> > > > > > >> terms
> > >> > > > > > >> > > of
> > >> > > > > > >> > > > > how
> > >> > > > > > >> > > > > >> > >>>>> Samza should look like.
> > >> > > > > > >> > > > > >> > >>>>>
> > >> > > > > > >> > > > > >> > >>>>> *** Will it look like what Jay's code
> > >> shows
> > >> > (A
> > >> > > > > > client of
> > >> > > > > > >> > > > Kakfa)
> > >> > > > > > >> > > > > ?
> > >> > > > > > >> > > > > >> And
> > >> > > > > > >> > > > > >> > >>>>> user's application code calls this
> > client?
> > >> > > > > > >> > > > > >> > >>>>>
> > >> > > > > > >> > > > > >> > >>>>> 1. If we make Samza be a library of
> > Kafka
> > >> > (like
> > >> > > > > what
> > >> > > > > > the
> > >> > > > > > >> > > code
> > >> > > > > > >> > > > > >> shows),
> > >> > > > > > >> > > > > >> > >>>>> how do we implement auto-balance and
> > >> > > > > fault-tolerance?
> > >> > > > > > >> Are
> > >> > > > > > >> > > they
> > >> > > > > > >> > > > > >> taken
> > >> > > > > > >> > > > > >> > >>>>> care by the Kafka broker or other
> > >> mechanism,
> > >> > > such
> > >> > > > > as
> > >> > > > > > >> > "Samza
> > >> > > > > > >> > > > > >> worker"
> > >> > > > > > >> > > > > >> > >>>>> (just make up the name) ?
> > >> > > > > > >> > > > > >> > >>>>>
> > >> > > > > > >> > > > > >> > >>>>> 2. What about other features, such as
> > >> > > > auto-scaling,
> > >> > > > > > >> shared
> > >> > > > > > >> > > > > state,
> > >> > > > > > >> > > > > >> > >>>>> monitoring?
> > >> > > > > > >> > > > > >> > >>>>>
> > >> > > > > > >> > > > > >> > >>>>>
> > >> > > > > > >> > > > > >> > >>>>> *** If we have Samza standalone, (is
> > this
> > >> > what
> > >> > > > > Chris
> > >> > > > > > >> > > > suggests?)
> > >> > > > > > >> > > > > >> > >>>>>
> > >> > > > > > >> > > > > >> > >>>>> 1. we still need to ingest data from
> > Kakfa
> > >> > and
> > >> > > > > > produce
> > >> > > > > > >> to
> > >> > > > > > >> > > it.
> > >> > > > > > >> > > > > >> Then it
> > >> > > > > > >> > > > > >> > >>>>> becomes the same as what Samza looks
> > like
> > >> > now,
> > >> > > > > > except it
> > >> > > > > > >> > > does
> > >> > > > > > >> > > > > not
> > >> > > > > > >> > > > > >> > rely
> > >> > > > > > >> > > > > >> > >>>>> on Yarn anymore.
> > >> > > > > > >> > > > > >> > >>>>>
> > >> > > > > > >> > > > > >> > >>>>> 2. if it is standalone, how can it
> > >> leverage
> > >> > > > Kafka's
> > >> > > > > > >> > metrics,
> > >> > > > > > >> > > > > logs,
> > >> > > > > > >> > > > > >> > >>>>> etc? Use Kafka code as the dependency?
> > >> > > > > > >> > > > > >> > >>>>>
> > >> > > > > > >> > > > > >> > >>>>>
> > >> > > > > > >> > > > > >> > >>>>> Thanks,
> > >> > > > > > >> > > > > >> > >>>>>
> > >> > > > > > >> > > > > >> > >>>>> Fang, Yan
> > >> > > > > > >> > > > > >> > >>>>> [email protected]
> > >> > > > > > >> > > > > >> > >>>>>
> > >> > > > > > >> > > > > >> > >>>>> On Wed, Jul 1, 2015 at 5:46 PM,
> Guozhang
> > >> > Wang <
> > >> > > > > > >> > > > > [email protected]
> > >> > > > > > >> > > > > >> >
> > >> > > > > > >> > > > > >> > >>>> wrote:
> > >> > > > > > >> > > > > >> > >>>>>
> > >> > > > > > >> > > > > >> > >>>>>> Read through the code example and it
> > >> looks
> > >> > > good
> > >> > > > to
> > >> > > > > > me.
> > >> > > > > > >> A
> > >> > > > > > >> > > few
> > >> > > > > > >> > > > > >> > >>>>>> thoughts regarding deployment:
> > >> > > > > > >> > > > > >> > >>>>>>
> > >> > > > > > >> > > > > >> > >>>>>> Today Samza deploys as executable
> > >> runnable
> > >> > > like:
> > >> > > > > > >> > > > > >> > >>>>>>
> > >> > > > > > >> > > > > >> > >>>>>> deploy/samza/bin/run-job.sh
> > >> > > --config-factory=...
> > >> > > > > > >> > > > > >> > >>>> --config-path=file://...
> > >> > > > > > >> > > > > >> > >>>>>>
> > >> > > > > > >> > > > > >> > >>>>>> And this proposal advocate for
> > deploying
> > >> > Samza
> > >> > > > > more
> > >> > > > > > as
> > >> > > > > > >> > > > embedded
> > >> > > > > > >> > > > > >> > >>>>>> libraries in user application code
> > >> (ignoring
> > >> > > the
> > >> > > > > > >> > > terminology
> > >> > > > > > >> > > > > >> since
> > >> > > > > > >> > > > > >> > >>>>>> it is not the
> > >> > > > > > >> > > > > >> > >>>>> same
> > >> > > > > > >> > > > > >> > >>>>>> as the prototype code):
> > >> > > > > > >> > > > > >> > >>>>>>
> > >> > > > > > >> > > > > >> > >>>>>> StreamTask task = new
> > >> MyStreamTask(configs);
> > >> > > > > Thread
> > >> > > > > > >> > thread
> > >> > > > > > >> > > =
> > >> > > > > > >> > > > > new
> > >> > > > > > >> > > > > >> > >>>>>> Thread(task); thread.start();
> > >> > > > > > >> > > > > >> > >>>>>>
> > >> > > > > > >> > > > > >> > >>>>>> I think both of these deployment
> modes
> > >> are
> > >> > > > > important
> > >> > > > > > >> for
> > >> > > > > > >> > > > > >> different
> > >> > > > > > >> > > > > >> > >>>>>> types
> > >> > > > > > >> > > > > >> > >>>>> of
> > >> > > > > > >> > > > > >> > >>>>>> users. That said, I think making
> Samza
> > >> > purely
> > >> > > > > > >> standalone
> > >> > > > > > >> > is
> > >> > > > > > >> > > > > still
> > >> > > > > > >> > > > > >> > >>>>>> sufficient for either runnable or
> > library
> > >> > > modes.
> > >> > > > > > >> > > > > >> > >>>>>>
> > >> > > > > > >> > > > > >> > >>>>>> Guozhang
> > >> > > > > > >> > > > > >> > >>>>>>
> > >> > > > > > >> > > > > >> > >>>>>> On Tue, Jun 30, 2015 at 11:33 PM, Jay
> > >> Kreps
> > >> > <
> > >> > > > > > >> > > > [email protected]>
> > >> > > > > > >> > > > > >> > wrote:
> > >> > > > > > >> > > > > >> > >>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>> Looks like gmail mangled the code
> > >> example,
> > >> > it
> > >> > > > was
> > >> > > > > > >> > supposed
> > >> > > > > > >> > > > to
> > >> > > > > > >> > > > > >> look
> > >> > > > > > >> > > > > >> > >>>>>>> like
> > >> > > > > > >> > > > > >> > >>>>>>> this:
> > >> > > > > > >> > > > > >> > >>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>> Properties props = new Properties();
> > >> > > > > > >> > > > > >> > >>>>>>> props.put("bootstrap.servers",
> > >> > > > "localhost:4242");
> > >> > > > > > >> > > > > >> StreamingConfig
> > >> > > > > > >> > > > > >> > >>>>>>> config = new StreamingConfig(props);
> > >> > > > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1",
> > >> > > > "test-topic-2");
> > >> > > > > > >> > > > > >> > >>>>>>>
> > >> > > config.processor(ExampleStreamProcessor.class);
> > >> > > > > > >> > > > > >> > >>>>>>> config.serialization(new
> > >> > StringSerializer(),
> > >> > > > new
> > >> > > > > > >> > > > > >> > >>>>>>> StringDeserializer());
> KafkaStreaming
> > >> > > > container =
> > >> > > > > > new
> > >> > > > > > >> > > > > >> > >>>>>>> KafkaStreaming(config);
> > container.run();
> > >> > > > > > >> > > > > >> > >>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>> -Jay
> > >> > > > > > >> > > > > >> > >>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>> On Tue, Jun 30, 2015 at 11:32 PM,
> Jay
> > >> > Kreps <
> > >> > > > > > >> > > > [email protected]
> > >> > > > > > >> > > > > >
> > >> > > > > > >> > > > > >> > >>>> wrote:
> > >> > > > > > >> > > > > >> > >>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>> Hey guys,
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>> This came out of some conversations
> > >> Chris
> > >> > > and
> > >> > > > I
> > >> > > > > > were
> > >> > > > > > >> > > having
> > >> > > > > > >> > > > > >> > >>>>>>>> around
> > >> > > > > > >> > > > > >> > >>>>>>> whether
> > >> > > > > > >> > > > > >> > >>>>>>>> it would make sense to use Samza
> as a
> > >> kind
> > >> > > of
> > >> > > > > data
> > >> > > > > > >> > > > ingestion
> > >> > > > > > >> > > > > >> > >>>>> framework
> > >> > > > > > >> > > > > >> > >>>>>>> for
> > >> > > > > > >> > > > > >> > >>>>>>>> Kafka (which ultimately lead to
> > KIP-26
> > >> > > > > "copycat").
> > >> > > > > > >> This
> > >> > > > > > >> > > > kind
> > >> > > > > > >> > > > > of
> > >> > > > > > >> > > > > >> > >>>>>> combined
> > >> > > > > > >> > > > > >> > >>>>>>>> with complaints around config and
> > YARN
> > >> and
> > >> > > the
> > >> > > > > > >> > discussion
> > >> > > > > > >> > > > > >> around
> > >> > > > > > >> > > > > >> > >>>>>>>> how
> > >> > > > > > >> > > > > >> > >>>>> to
> > >> > > > > > >> > > > > >> > >>>>>>>> best do a standalone mode.
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>> So the thought experiment was,
> given
> > >> that
> > >> > > > Samza
> > >> > > > > > was
> > >> > > > > > >> > > > basically
> > >> > > > > > >> > > > > >> > >>>>>>>> already totally Kafka specific,
> what
> > if
> > >> > you
> > >> > > > just
> > >> > > > > > >> > embraced
> > >> > > > > > >> > > > > that
> > >> > > > > > >> > > > > >> > >>>>>>>> and turned it
> > >> > > > > > >> > > > > >> > >>>>>> into
> > >> > > > > > >> > > > > >> > >>>>>>>> something less like a heavyweight
> > >> > framework
> > >> > > > and
> > >> > > > > > more
> > >> > > > > > >> > > like a
> > >> > > > > > >> > > > > >> > >>>>>>>> third
> > >> > > > > > >> > > > > >> > >>>>> Kafka
> > >> > > > > > >> > > > > >> > >>>>>>>> client--a kind of "producing
> > consumer"
> > >> > with
> > >> > > > > state
> > >> > > > > > >> > > > management
> > >> > > > > > >> > > > > >> > >>>>>> facilities.
> > >> > > > > > >> > > > > >> > >>>>>>>> Basically a library. Instead of a
> > >> complex
> > >> > > > stream
> > >> > > > > > >> > > processing
> > >> > > > > > >> > > > > >> > >>>>>>>> framework
> > >> > > > > > >> > > > > >> > >>>>>>> this
> > >> > > > > > >> > > > > >> > >>>>>>>> would actually be a very simple
> > thing,
> > >> not
> > >> > > > much
> > >> > > > > > more
> > >> > > > > > >> > > > > >> complicated
> > >> > > > > > >> > > > > >> > >>>>>>>> to
> > >> > > > > > >> > > > > >> > >>>>> use
> > >> > > > > > >> > > > > >> > >>>>>>> or
> > >> > > > > > >> > > > > >> > >>>>>>>> operate than a Kafka consumer. As
> > Chris
> > >> > said
> > >> > > > we
> > >> > > > > > >> thought
> > >> > > > > > >> > > > about
> > >> > > > > > >> > > > > >> it
> > >> > > > > > >> > > > > >> > >>>>>>>> a
> > >> > > > > > >> > > > > >> > >>>>> lot
> > >> > > > > > >> > > > > >> > >>>>>> of
> > >> > > > > > >> > > > > >> > >>>>>>>> what Samza (and the other stream
> > >> > processing
> > >> > > > > > systems
> > >> > > > > > >> > were
> > >> > > > > > >> > > > > doing)
> > >> > > > > > >> > > > > >> > >>>>> seemed
> > >> > > > > > >> > > > > >> > >>>>>>> like
> > >> > > > > > >> > > > > >> > >>>>>>>> kind of a hangover from MapReduce.
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>> Of course you need to ingest/output
> > >> data
> > >> > to
> > >> > > > and
> > >> > > > > > from
> > >> > > > > > >> > the
> > >> > > > > > >> > > > > stream
> > >> > > > > > >> > > > > >> > >>>>>>>> processing. But when we actually
> > looked
> > >> > into
> > >> > > > how
> > >> > > > > > that
> > >> > > > > > >> > > would
> > >> > > > > > >> > > > > >> > >>>>>>>> work,
> > >> > > > > > >> > > > > >> > >>>>> Samza
> > >> > > > > > >> > > > > >> > >>>>>>>> isn't really an ideal data
> ingestion
> > >> > > framework
> > >> > > > > > for a
> > >> > > > > > >> > > bunch
> > >> > > > > > >> > > > of
> > >> > > > > > >> > > > > >> > >>>>> reasons.
> > >> > > > > > >> > > > > >> > >>>>>> To
> > >> > > > > > >> > > > > >> > >>>>>>>> really do that right you need a
> > pretty
> > >> > > > different
> > >> > > > > > >> > internal
> > >> > > > > > >> > > > > data
> > >> > > > > > >> > > > > >> > >>>>>>>> model
> > >> > > > > > >> > > > > >> > >>>>>> and
> > >> > > > > > >> > > > > >> > >>>>>>>> set of apis. So what if you split
> > them
> > >> and
> > >> > > had
> > >> > > > > an
> > >> > > > > > api
> > >> > > > > > >> > for
> > >> > > > > > >> > > > > Kafka
> > >> > > > > > >> > > > > >> > >>>>>>>> ingress/egress (copycat AKA KIP-26)
> > >> and a
> > >> > > > > separate
> > >> > > > > > >> api
> > >> > > > > > >> > > for
> > >> > > > > > >> > > > > >> Kafka
> > >> > > > > > >> > > > > >> > >>>>>>>> transformation (Samza).
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>> This would also allow really
> > embracing
> > >> the
> > >> > > > same
> > >> > > > > > >> > > terminology
> > >> > > > > > >> > > > > and
> > >> > > > > > >> > > > > >> > >>>>>>>> conventions. One complaint about
> the
> > >> > current
> > >> > > > > > state is
> > >> > > > > > >> > > that
> > >> > > > > > >> > > > > the
> > >> > > > > > >> > > > > >> > >>>>>>>> two
> > >> > > > > > >> > > > > >> > >>>>>>> systems
> > >> > > > > > >> > > > > >> > >>>>>>>> kind of feel bolted on. Terminology
> > >> like
> > >> > > > > "stream"
> > >> > > > > > vs
> > >> > > > > > >> > > > "topic"
> > >> > > > > > >> > > > > >> and
> > >> > > > > > >> > > > > >> > >>>>>>> different
> > >> > > > > > >> > > > > >> > >>>>>>>> config and monitoring systems means
> > you
> > >> > kind
> > >> > > > of
> > >> > > > > > have
> > >> > > > > > >> to
> > >> > > > > > >> > > > learn
> > >> > > > > > >> > > > > >> > >>>>>>>> Kafka's
> > >> > > > > > >> > > > > >> > >>>>>>> way,
> > >> > > > > > >> > > > > >> > >>>>>>>> then learn Samza's slightly
> different
> > >> way,
> > >> > > > then
> > >> > > > > > kind
> > >> > > > > > >> of
> > >> > > > > > >> > > > > >> > >>>>>>>> understand
> > >> > > > > > >> > > > > >> > >>>>> how
> > >> > > > > > >> > > > > >> > >>>>>>> they
> > >> > > > > > >> > > > > >> > >>>>>>>> map to each other, which having
> > walked
> > >> a
> > >> > few
> > >> > > > > > people
> > >> > > > > > >> > > through
> > >> > > > > > >> > > > > >> this
> > >> > > > > > >> > > > > >> > >>>>>>>> is surprisingly tricky for folks to
> > >> get.
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>> Since I have been spending a lot of
> > >> time
> > >> > on
> > >> > > > > > >> airplanes I
> > >> > > > > > >> > > > > hacked
> > >> > > > > > >> > > > > >> > >>>>>>>> up an ernest but still somewhat
> > >> incomplete
> > >> > > > > > prototype
> > >> > > > > > >> of
> > >> > > > > > >> > > > what
> > >> > > > > > >> > > > > >> > >>>>>>>> this would
> > >> > > > > > >> > > > > >> > >>>>> look
> > >> > > > > > >> > > > > >> > >>>>>>>> like. This is just unceremoniously
> > >> dumped
> > >> > > into
> > >> > > > > > Kafka
> > >> > > > > > >> as
> > >> > > > > > >> > > it
> > >> > > > > > >> > > > > >> > >>>>>>>> required a
> > >> > > > > > >> > > > > >> > >>>>>> few
> > >> > > > > > >> > > > > >> > >>>>>>>> changes to the new consumer. Here
> is
> > >> the
> > >> > > code:
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>
> > >> > > > > > >> > > > > >> > >>>>>
> > >> > > > > > >> > > > > >> >
> > >> > > > > > >> > > > >
> > >> > > > > > >> >
> > >> > > > > >
> > >> > >
> > >>
> https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org
> > >> > > > > > >> > > > > >> > >>>>> /apache/kafka/clients/streaming
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>> For the purpose of the prototype I
> > just
> > >> > > > > liberally
> > >> > > > > > >> > renamed
> > >> > > > > > >> > > > > >> > >>>>>>>> everything
> > >> > > > > > >> > > > > >> > >>>>> to
> > >> > > > > > >> > > > > >> > >>>>>>>> try to align it with Kafka with no
> > >> regard
> > >> > > for
> > >> > > > > > >> > > > compatibility.
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>> To use this would be something like
> > >> this:
> > >> > > > > > >> > > > > >> > >>>>>>>> Properties props = new
> Properties();
> > >> > > > > > >> > > > > >> > >>>>>>>> props.put("bootstrap.servers",
> > >> > > > > "localhost:4242");
> > >> > > > > > >> > > > > >> > >>>>>>>> StreamingConfig config = new
> > >> > > > > > >> > > > > >> > >>>>> StreamingConfig(props);
> > >> > > > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1",
> > >> > > > > > >> > > > > >> > >>>>>>>> "test-topic-2");
> > >> > > > > > >> > > > > >>
> config.processor(ExampleStreamProcessor.class);
> > >> > > > > > >> > > > > >> > >>>>>>> config.serialization(new
> > >> > > > > > >> > > > > >> > >>>>>>>> StringSerializer(), new
> > >> > > StringDeserializer());
> > >> > > > > > >> > > > KafkaStreaming
> > >> > > > > > >> > > > > >> > >>>>>> container =
> > >> > > > > > >> > > > > >> > >>>>>>>> new KafkaStreaming(config);
> > >> > container.run();
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>> KafkaStreaming is basically the
> > >> > > > SamzaContainer;
> > >> > > > > > >> > > > > StreamProcessor
> > >> > > > > > >> > > > > >> > >>>>>>>> is basically StreamTask.
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>> So rather than putting all the
> class
> > >> names
> > >> > > in
> > >> > > > a
> > >> > > > > > file
> > >> > > > > > >> > and
> > >> > > > > > >> > > > then
> > >> > > > > > >> > > > > >> > >>>>>>>> having
> > >> > > > > > >> > > > > >> > >>>>>> the
> > >> > > > > > >> > > > > >> > >>>>>>>> job assembled by reflection, you
> just
> > >> > > > > instantiate
> > >> > > > > > the
> > >> > > > > > >> > > > > container
> > >> > > > > > >> > > > > >> > >>>>>>>> programmatically. Work is balanced
> > over
> > >> > > > however
> > >> > > > > > many
> > >> > > > > > >> > > > > instances
> > >> > > > > > >> > > > > >> > >>>>>>>> of
> > >> > > > > > >> > > > > >> > >>>>> this
> > >> > > > > > >> > > > > >> > >>>>>>> are
> > >> > > > > > >> > > > > >> > >>>>>>>> alive at any time (i.e. if an
> > instance
> > >> > dies,
> > >> > > > new
> > >> > > > > > >> tasks
> > >> > > > > > >> > > are
> > >> > > > > > >> > > > > >> added
> > >> > > > > > >> > > > > >> > >>>>>>>> to
> > >> > > > > > >> > > > > >> > >>>>> the
> > >> > > > > > >> > > > > >> > >>>>>>>> existing containers without
> shutting
> > >> them
> > >> > > > down).
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>> We would provide some glue for
> > running
> > >> > this
> > >> > > > > stuff
> > >> > > > > > in
> > >> > > > > > >> > YARN
> > >> > > > > > >> > > > via
> > >> > > > > > >> > > > > >> > >>>>>>>> Slider, Mesos via Marathon, and AWS
> > >> using
> > >> > > some
> > >> > > > > of
> > >> > > > > > >> their
> > >> > > > > > >> > > > tools
> > >> > > > > > >> > > > > >> > >>>>>>>> but from the
> > >> > > > > > >> > > > > >> > >>>>>> point
> > >> > > > > > >> > > > > >> > >>>>>>> of
> > >> > > > > > >> > > > > >> > >>>>>>>> view of these frameworks these
> stream
> > >> > > > processing
> > >> > > > > > jobs
> > >> > > > > > >> > are
> > >> > > > > > >> > > > > just
> > >> > > > > > >> > > > > >> > >>>>>> stateless
> > >> > > > > > >> > > > > >> > >>>>>>>> services that can come and go and
> > >> expand
> > >> > and
> > >> > > > > > contract
> > >> > > > > > >> > at
> > >> > > > > > >> > > > > will.
> > >> > > > > > >> > > > > >> > >>>>>>>> There
> > >> > > > > > >> > > > > >> > >>>>> is
> > >> > > > > > >> > > > > >> > >>>>>>> no
> > >> > > > > > >> > > > > >> > >>>>>>>> more custom scheduler.
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>> Here are some relevant details:
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>  1. It is only ~1300 lines of code,
> > it
> > >> > would
> > >> > > > get
> > >> > > > > > >> larger
> > >> > > > > > >> > > if
> > >> > > > > > >> > > > we
> > >> > > > > > >> > > > > >> > >>>>>>>>  productionized but not vastly
> > larger.
> > >> We
> > >> > > > really
> > >> > > > > > do
> > >> > > > > > >> > get a
> > >> > > > > > >> > > > ton
> > >> > > > > > >> > > > > >> > >>>>>>>> of
> > >> > > > > > >> > > > > >> > >>>>>>> leverage
> > >> > > > > > >> > > > > >> > >>>>>>>>  out of Kafka.
> > >> > > > > > >> > > > > >> > >>>>>>>>  2. Partition management is fully
> > >> > delegated
> > >> > > to
> > >> > > > > the
> > >> > > > > > >> new
> > >> > > > > > >> > > > > >> consumer.
> > >> > > > > > >> > > > > >> > >>>>> This
> > >> > > > > > >> > > > > >> > >>>>>>>>  is nice since now any partition
> > >> > management
> > >> > > > > > strategy
> > >> > > > > > >> > > > > available
> > >> > > > > > >> > > > > >> > >>>>>>>> to
> > >> > > > > > >> > > > > >> > >>>>>> Kafka
> > >> > > > > > >> > > > > >> > >>>>>>>>  consumer is also available to
> Samza
> > >> (and
> > >> > > vice
> > >> > > > > > versa)
> > >> > > > > > >> > and
> > >> > > > > > >> > > > > with
> > >> > > > > > >> > > > > >> > >>>>>>>> the
> > >> > > > > > >> > > > > >> > >>>>>>> exact
> > >> > > > > > >> > > > > >> > >>>>>>>>  same configs.
> > >> > > > > > >> > > > > >> > >>>>>>>>  3. It supports state as well as
> > state
> > >> > reuse
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>> Anyhow take a look, hopefully it is
> > >> > thought
> > >> > > > > > >> provoking.
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>> -Jay
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>> On Tue, Jun 30, 2015 at 6:55 PM,
> > Chris
> > >> > > > > Riccomini <
> > >> > > > > > >> > > > > >> > >>>>>> [email protected]>
> > >> > > > > > >> > > > > >> > >>>>>>>> wrote:
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> Hey all,
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> I have had some discussions with
> > Samza
> > >> > > > > engineers
> > >> > > > > > at
> > >> > > > > > >> > > > LinkedIn
> > >> > > > > > >> > > > > >> > >>>>>>>>> and
> > >> > > > > > >> > > > > >> > >>>>>>> Confluent
> > >> > > > > > >> > > > > >> > >>>>>>>>> and we came up with a few
> > observations
> > >> > and
> > >> > > > > would
> > >> > > > > > >> like
> > >> > > > > > >> > to
> > >> > > > > > >> > > > > >> > >>>>>>>>> propose
> > >> > > > > > >> > > > > >> > >>>>> some
> > >> > > > > > >> > > > > >> > >>>>>>>>> changes.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> We've observed some things that I
> > >> want to
> > >> > > > call
> > >> > > > > > out
> > >> > > > > > >> > about
> > >> > > > > > >> > > > > >> > >>>>>>>>> Samza's
> > >> > > > > > >> > > > > >> > >>>>>> design,
> > >> > > > > > >> > > > > >> > >>>>>>>>> and I'd like to propose some
> > changes.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> * Samza is dependent upon a
> dynamic
> > >> > > > deployment
> > >> > > > > > >> system.
> > >> > > > > > >> > > > > >> > >>>>>>>>> * Samza is too pluggable.
> > >> > > > > > >> > > > > >> > >>>>>>>>> * Samza's
> > >> SystemConsumer/SystemProducer
> > >> > and
> > >> > > > > > Kafka's
> > >> > > > > > >> > > > consumer
> > >> > > > > > >> > > > > >> > >>>>>>>>> APIs
> > >> > > > > > >> > > > > >> > >>>>> are
> > >> > > > > > >> > > > > >> > >>>>>>>>> trying to solve a lot of the same
> > >> > problems.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> All three of these issues are
> > related,
> > >> > but
> > >> > > > I'll
> > >> > > > > > >> > address
> > >> > > > > > >> > > > them
> > >> > > > > > >> > > > > >> in
> > >> > > > > > >> > > > > >> > >>>>> order.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> Deployment
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> Samza strongly depends on the use
> > of a
> > >> > > > dynamic
> > >> > > > > > >> > > deployment
> > >> > > > > > >> > > > > >> > >>>>>>>>> scheduler
> > >> > > > > > >> > > > > >> > >>>>>> such
> > >> > > > > > >> > > > > >> > >>>>>>>>> as
> > >> > > > > > >> > > > > >> > >>>>>>>>> YARN, Mesos, etc. When we
> initially
> > >> built
> > >> > > > > Samza,
> > >> > > > > > we
> > >> > > > > > >> > bet
> > >> > > > > > >> > > > that
> > >> > > > > > >> > > > > >> > >>>>>>>>> there
> > >> > > > > > >> > > > > >> > >>>>>> would
> > >> > > > > > >> > > > > >> > >>>>>>>>> be
> > >> > > > > > >> > > > > >> > >>>>>>>>> one or two winners in this area,
> and
> > >> we
> > >> > > could
> > >> > > > > > >> support
> > >> > > > > > >> > > > them,
> > >> > > > > > >> > > > > >> and
> > >> > > > > > >> > > > > >> > >>>>>>>>> the
> > >> > > > > > >> > > > > >> > >>>>>> rest
> > >> > > > > > >> > > > > >> > >>>>>>>>> would go away. In reality, there
> are
> > >> many
> > >> > > > > > >> variations.
> > >> > > > > > >> > > > > >> > >>>>>>>>> Furthermore,
> > >> > > > > > >> > > > > >> > >>>>>> many
> > >> > > > > > >> > > > > >> > >>>>>>>>> people still prefer to just start
> > >> their
> > >> > > > > > processors
> > >> > > > > > >> > like
> > >> > > > > > >> > > > > normal
> > >> > > > > > >> > > > > >> > >>>>>>>>> Java processes, and use
> traditional
> > >> > > > deployment
> > >> > > > > > >> scripts
> > >> > > > > > >> > > > such
> > >> > > > > > >> > > > > as
> > >> > > > > > >> > > > > >> > >>>>>>>>> Fabric,
> > >> > > > > > >> > > > > >> > >>>>>> Chef,
> > >> > > > > > >> > > > > >> > >>>>>>>>> Ansible, etc. Forcing a deployment
> > >> system
> > >> > > on
> > >> > > > > > users
> > >> > > > > > >> > makes
> > >> > > > > > >> > > > the
> > >> > > > > > >> > > > > >> > >>>>>>>>> Samza start-up process really
> > painful
> > >> for
> > >> > > > first
> > >> > > > > > time
> > >> > > > > > >> > > > users.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> Dynamic deployment as a
> requirement
> > >> was
> > >> > > also
> > >> > > > a
> > >> > > > > > bit
> > >> > > > > > >> of
> > >> > > > > > >> > a
> > >> > > > > > >> > > > > >> > >>>>>>>>> mis-fire
> > >> > > > > > >> > > > > >> > >>>>>> because
> > >> > > > > > >> > > > > >> > >>>>>>>>> of
> > >> > > > > > >> > > > > >> > >>>>>>>>> a fundamental misunderstanding
> > between
> > >> > the
> > >> > > > > > nature of
> > >> > > > > > >> > > batch
> > >> > > > > > >> > > > > >> jobs
> > >> > > > > > >> > > > > >> > >>>>>>>>> and
> > >> > > > > > >> > > > > >> > >>>>>>> stream
> > >> > > > > > >> > > > > >> > >>>>>>>>> processing jobs. Early on, we made
> > >> > > conscious
> > >> > > > > > effort
> > >> > > > > > >> to
> > >> > > > > > >> > > > favor
> > >> > > > > > >> > > > > >> > >>>>>>>>> the
> > >> > > > > > >> > > > > >> > >>>>>> Hadoop
> > >> > > > > > >> > > > > >> > >>>>>>>>> (Map/Reduce) way of doing things,
> > >> since
> > >> > it
> > >> > > > > worked
> > >> > > > > > >> and
> > >> > > > > > >> > > was
> > >> > > > > > >> > > > > well
> > >> > > > > > >> > > > > >> > >>>>>>> understood.
> > >> > > > > > >> > > > > >> > >>>>>>>>> One thing that we missed was that
> > >> batch
> > >> > > jobs
> > >> > > > > > have a
> > >> > > > > > >> > > > definite
> > >> > > > > > >> > > > > >> > >>>>>> beginning,
> > >> > > > > > >> > > > > >> > >>>>>>>>> and
> > >> > > > > > >> > > > > >> > >>>>>>>>> end, and stream processing jobs
> > don't
> > >> > > > > (usually).
> > >> > > > > > >> This
> > >> > > > > > >> > > > leads
> > >> > > > > > >> > > > > to
> > >> > > > > > >> > > > > >> > >>>>>>>>> a
> > >> > > > > > >> > > > > >> > >>>>> much
> > >> > > > > > >> > > > > >> > >>>>>>>>> simpler scheduling problem for
> > stream
> > >> > > > > processors.
> > >> > > > > > >> You
> > >> > > > > > >> > > > > >> basically
> > >> > > > > > >> > > > > >> > >>>>>>>>> just
> > >> > > > > > >> > > > > >> > >>>>>>> need
> > >> > > > > > >> > > > > >> > >>>>>>>>> to find a place to start the
> > >> processor,
> > >> > and
> > >> > > > > start
> > >> > > > > > >> it.
> > >> > > > > > >> > > The
> > >> > > > > > >> > > > > way
> > >> > > > > > >> > > > > >> > >>>>>>>>> we run grids, at LinkedIn, there's
> > no
> > >> > > concept
> > >> > > > > of
> > >> > > > > > a
> > >> > > > > > >> > > cluster
> > >> > > > > > >> > > > > >> > >>>>>>>>> being "full". We always
> > >> > > > > > >> > > > > >> > >>>>>> add
> > >> > > > > > >> > > > > >> > >>>>>>>>> more machines. The problem with
> > >> coupling
> > >> > > > Samza
> > >> > > > > > with
> > >> > > > > > >> a
> > >> > > > > > >> > > > > >> scheduler
> > >> > > > > > >> > > > > >> > >>>>>>>>> is
> > >> > > > > > >> > > > > >> > >>>>>> that
> > >> > > > > > >> > > > > >> > >>>>>>>>> Samza (as a framework) now has to
> > >> handle
> > >> > > > > > deployment.
> > >> > > > > > >> > > This
> > >> > > > > > >> > > > > >> pulls
> > >> > > > > > >> > > > > >> > >>>>>>>>> in a
> > >> > > > > > >> > > > > >> > >>>>>>> bunch
> > >> > > > > > >> > > > > >> > >>>>>>>>> of things such as configuration
> > >> > > distribution
> > >> > > > > > (config
> > >> > > > > > >> > > > > stream),
> > >> > > > > > >> > > > > >> > >>>>>>>>> shell
> > >> > > > > > >> > > > > >> > >>>>>>> scrips
> > >> > > > > > >> > > > > >> > >>>>>>>>> (bin/run-job.sh, JobRunner),
> > packaging
> > >> > (all
> > >> > > > the
> > >> > > > > > .tgz
> > >> > > > > > >> > > > stuff),
> > >> > > > > > >> > > > > >> etc.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> Another reason for requiring
> dynamic
> > >> > > > deployment
> > >> > > > > > was
> > >> > > > > > >> to
> > >> > > > > > >> > > > > support
> > >> > > > > > >> > > > > >> > >>>>>>>>> data locality. If you want to have
> > >> > > locality,
> > >> > > > > you
> > >> > > > > > >> need
> > >> > > > > > >> > to
> > >> > > > > > >> > > > put
> > >> > > > > > >> > > > > >> > >>>>>>>>> your
> > >> > > > > > >> > > > > >> > >>>>>> processors
> > >> > > > > > >> > > > > >> > >>>>>>>>> close to the data they're
> > processing.
> > >> > Upon
> > >> > > > > > further
> > >> > > > > > >> > > > > >> > >>>>>>>>> investigation,
> > >> > > > > > >> > > > > >> > >>>>>>> though,
> > >> > > > > > >> > > > > >> > >>>>>>>>> this feature is not that
> beneficial.
> > >> > There
> > >> > > is
> > >> > > > > > some
> > >> > > > > > >> > good
> > >> > > > > > >> > > > > >> > >>>>>>>>> discussion
> > >> > > > > > >> > > > > >> > >>>>>> about
> > >> > > > > > >> > > > > >> > >>>>>>>>> some problems with it on
> SAMZA-335.
> > >> > Again,
> > >> > > we
> > >> > > > > > took
> > >> > > > > > >> the
> > >> > > > > > >> > > > > >> > >>>>>>>>> Map/Reduce
> > >> > > > > > >> > > > > >> > >>>>>> path,
> > >> > > > > > >> > > > > >> > >>>>>>>>> but
> > >> > > > > > >> > > > > >> > >>>>>>>>> there are some fundamental
> > differences
> > >> > > > between
> > >> > > > > > HDFS
> > >> > > > > > >> > and
> > >> > > > > > >> > > > > Kafka.
> > >> > > > > > >> > > > > >> > >>>>>>>>> HDFS
> > >> > > > > > >> > > > > >> > >>>>>> has
> > >> > > > > > >> > > > > >> > >>>>>>>>> blocks, while Kafka has
> partitions.
> > >> This
> > >> > > > leads
> > >> > > > > to
> > >> > > > > > >> less
> > >> > > > > > >> > > > > >> > >>>>>>>>> optimization potential with stream
> > >> > > processors
> > >> > > > > on
> > >> > > > > > top
> > >> > > > > > >> > of
> > >> > > > > > >> > > > > Kafka.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> This feature is also used as a
> > crutch.
> > >> > > Samza
> > >> > > > > > doesn't
> > >> > > > > > >> > > have
> > >> > > > > > >> > > > > any
> > >> > > > > > >> > > > > >> > >>>>>>>>> built
> > >> > > > > > >> > > > > >> > >>>>> in
> > >> > > > > > >> > > > > >> > >>>>>>>>> fault-tolerance logic. Instead, it
> > >> > depends
> > >> > > on
> > >> > > > > the
> > >> > > > > > >> > > dynamic
> > >> > > > > > >> > > > > >> > >>>>>>>>> deployment scheduling system to
> > handle
> > >> > > > restarts
> > >> > > > > > >> when a
> > >> > > > > > >> > > > > >> > >>>>>>>>> processor dies. This has
> > >> > > > > > >> > > > > >> > >>>>>>> made
> > >> > > > > > >> > > > > >> > >>>>>>>>> it very difficult to write a
> > >> standalone
> > >> > > Samza
> > >> > > > > > >> > container
> > >> > > > > > >> > > > > >> > >>>> (SAMZA-516).
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> Pluggability
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> In some cases pluggability is
> good,
> > >> but I
> > >> > > > think
> > >> > > > > > that
> > >> > > > > > >> > > we've
> > >> > > > > > >> > > > > >> gone
> > >> > > > > > >> > > > > >> > >>>>>>>>> too
> > >> > > > > > >> > > > > >> > >>>>>> far
> > >> > > > > > >> > > > > >> > >>>>>>>>> with it. Currently, Samza has:
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable config.
> > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable metrics.
> > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable deployment systems.
> > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable streaming systems
> > >> > > > (SystemConsumer,
> > >> > > > > > >> > > > > SystemProducer,
> > >> > > > > > >> > > > > >> > >>>> etc).
> > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable serdes.
> > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable storage engines.
> > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable strategies for just
> > about
> > >> > every
> > >> > > > > > >> component
> > >> > > > > > >> > > > > >> > >>>>> (MessageChooser,
> > >> > > > > > >> > > > > >> > >>>>>>>>> SystemStreamPartitionGrouper,
> > >> > > ConfigRewriter,
> > >> > > > > > etc).
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> There's probably more that I've
> > >> > forgotten,
> > >> > > as
> > >> > > > > > well.
> > >> > > > > > >> > Some
> > >> > > > > > >> > > > of
> > >> > > > > > >> > > > > >> > >>>>>>>>> these
> > >> > > > > > >> > > > > >> > >>>>> are
> > >> > > > > > >> > > > > >> > >>>>>>>>> useful, but some have proven not
> to
> > >> be.
> > >> > > This
> > >> > > > > all
> > >> > > > > > >> comes
> > >> > > > > > >> > > at
> > >> > > > > > >> > > > a
> > >> > > > > > >> > > > > >> cost:
> > >> > > > > > >> > > > > >> > >>>>>>>>> complexity. This complexity is
> > making
> > >> it
> > >> > > > harder
> > >> > > > > > for
> > >> > > > > > >> > our
> > >> > > > > > >> > > > > users
> > >> > > > > > >> > > > > >> > >>>>>>>>> to
> > >> > > > > > >> > > > > >> > >>>>> pick
> > >> > > > > > >> > > > > >> > >>>>>> up
> > >> > > > > > >> > > > > >> > >>>>>>>>> and use Samza out of the box. It
> > also
> > >> > makes
> > >> > > > it
> > >> > > > > > >> > difficult
> > >> > > > > > >> > > > for
> > >> > > > > > >> > > > > >> > >>>>>>>>> Samza developers to reason about
> > what
> > >> the
> > >> > > > > > >> > > characteristics
> > >> > > > > > >> > > > of
> > >> > > > > > >> > > > > >> > >>>>>>>>> the container (since the
> > >> characteristics
> > >> > > > change
> > >> > > > > > >> > > depending
> > >> > > > > > >> > > > on
> > >> > > > > > >> > > > > >> > >>>>>>>>> which plugins are use).
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> The issues with pluggability are
> > most
> > >> > > visible
> > >> > > > > in
> > >> > > > > > the
> > >> > > > > > >> > > > System
> > >> > > > > > >> > > > > >> APIs.
> > >> > > > > > >> > > > > >> > >>>>> What
> > >> > > > > > >> > > > > >> > >>>>>>>>> Samza really requires to be
> > >> functional is
> > >> > > > Kafka
> > >> > > > > > as
> > >> > > > > > >> its
> > >> > > > > > >> > > > > >> > >>>>>>>>> transport
> > >> > > > > > >> > > > > >> > >>>>>> layer.
> > >> > > > > > >> > > > > >> > >>>>>>>>> But
> > >> > > > > > >> > > > > >> > >>>>>>>>> we've conflated two unrelated use
> > >> cases
> > >> > > into
> > >> > > > > one
> > >> > > > > > >> API:
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> 1. Get data into/out of Kafka.
> > >> > > > > > >> > > > > >> > >>>>>>>>> 2. Process the data in Kafka.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> The current System API supports
> both
> > >> of
> > >> > > these
> > >> > > > > use
> > >> > > > > > >> > cases.
> > >> > > > > > >> > > > The
> > >> > > > > > >> > > > > >> > >>>>>>>>> problem
> > >> > > > > > >> > > > > >> > >>>>>> is,
> > >> > > > > > >> > > > > >> > >>>>>>>>> we
> > >> > > > > > >> > > > > >> > >>>>>>>>> actually want different features
> for
> > >> each
> > >> > > use
> > >> > > > > > case.
> > >> > > > > > >> By
> > >> > > > > > >> > > > > >> papering
> > >> > > > > > >> > > > > >> > >>>>>>>>> over
> > >> > > > > > >> > > > > >> > >>>>>>> these
> > >> > > > > > >> > > > > >> > >>>>>>>>> two use cases, and providing a
> > single
> > >> > API,
> > >> > > > > we've
> > >> > > > > > >> > > > introduced
> > >> > > > > > >> > > > > a
> > >> > > > > > >> > > > > >> > >>>>>>>>> ton of
> > >> > > > > > >> > > > > >> > >>>>>>> leaky
> > >> > > > > > >> > > > > >> > >>>>>>>>> abstractions.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> For example, what we'd really like
> > in
> > >> (2)
> > >> > > is
> > >> > > > to
> > >> > > > > > have
> > >> > > > > > >> > > > > >> > >>>>>>>>> monotonically increasing longs for
> > >> > offsets
> > >> > > > > (like
> > >> > > > > > >> > Kafka).
> > >> > > > > > >> > > > > This
> > >> > > > > > >> > > > > >> > >>>>>>>>> would be at odds
> > >> > > > > > >> > > > > >> > >>>>> with
> > >> > > > > > >> > > > > >> > >>>>>>> (1),
> > >> > > > > > >> > > > > >> > >>>>>>>>> though, since different systems
> have
> > >> > > > different
> > >> > > > > > >> > > > > >> > >>>>>>> SCNs/Offsets/UUIDs/vectors.
> > >> > > > > > >> > > > > >> > >>>>>>>>> There was discussion both on the
> > >> mailing
> > >> > > list
> > >> > > > > and
> > >> > > > > > >> the
> > >> > > > > > >> > > SQL
> > >> > > > > > >> > > > > >> JIRAs
> > >> > > > > > >> > > > > >> > >>>>> about
> > >> > > > > > >> > > > > >> > >>>>>>> the
> > >> > > > > > >> > > > > >> > >>>>>>>>> need for this.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> The same thing holds true for
> > >> > > replayability.
> > >> > > > > > Kafka
> > >> > > > > > >> > > allows
> > >> > > > > > >> > > > us
> > >> > > > > > >> > > > > >> to
> > >> > > > > > >> > > > > >> > >>>>> rewind
> > >> > > > > > >> > > > > >> > >>>>>>>>> when
> > >> > > > > > >> > > > > >> > >>>>>>>>> we have a failure. Many other
> > systems
> > >> > > don't.
> > >> > > > In
> > >> > > > > > some
> > >> > > > > > >> > > > cases,
> > >> > > > > > >> > > > > >> > >>>>>>>>> systems
> > >> > > > > > >> > > > > >> > >>>>>>> return
> > >> > > > > > >> > > > > >> > >>>>>>>>> null for their offsets (e.g.
> > >> > > > > > >> WikipediaSystemConsumer)
> > >> > > > > > >> > > > > because
> > >> > > > > > >> > > > > >> > >>>>>>>>> they
> > >> > > > > > >> > > > > >> > >>>>>> have
> > >> > > > > > >> > > > > >> > >>>>>>> no
> > >> > > > > > >> > > > > >> > >>>>>>>>> offsets.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> Partitioning is another example.
> > Kafka
> > >> > > > supports
> > >> > > > > > >> > > > > partitioning,
> > >> > > > > > >> > > > > >> > >>>>>>>>> but
> > >> > > > > > >> > > > > >> > >>>>> many
> > >> > > > > > >> > > > > >> > >>>>>>>>> systems don't. We model this by
> > >> having a
> > >> > > > single
> > >> > > > > > >> > > partition
> > >> > > > > > >> > > > > for
> > >> > > > > > >> > > > > >> > >>>>>>>>> those systems. Still, other
> systems
> > >> model
> > >> > > > > > >> partitioning
> > >> > > > > > >> > > > > >> > >>>> differently (e.g.
> > >> > > > > > >> > > > > >> > >>>>>>>>> Kinesis).
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> The SystemAdmin interface is also
> a
> > >> mess.
> > >> > > > > > Creating
> > >> > > > > > >> > > streams
> > >> > > > > > >> > > > > in
> > >> > > > > > >> > > > > >> a
> > >> > > > > > >> > > > > >> > >>>>>>>>> system-agnostic way is almost
> > >> impossible.
> > >> > > As
> > >> > > > is
> > >> > > > > > >> > modeling
> > >> > > > > > >> > > > > >> > >>>>>>>>> metadata
> > >> > > > > > >> > > > > >> > >>>>> for
> > >> > > > > > >> > > > > >> > >>>>>>> the
> > >> > > > > > >> > > > > >> > >>>>>>>>> system (replication factor,
> > >> partitions,
> > >> > > > > location,
> > >> > > > > > >> > etc).
> > >> > > > > > >> > > > The
> > >> > > > > > >> > > > > >> > >>>>>>>>> list
> > >> > > > > > >> > > > > >> > >>>>> goes
> > >> > > > > > >> > > > > >> > >>>>>>> on.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> Duplicate work
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> At the time that we began writing
> > >> Samza,
> > >> > > > > Kafka's
> > >> > > > > > >> > > consumer
> > >> > > > > > >> > > > > and
> > >> > > > > > >> > > > > >> > >>>>> producer
> > >> > > > > > >> > > > > >> > >>>>>>>>> APIs
> > >> > > > > > >> > > > > >> > >>>>>>>>> had a relatively weak feature set.
> > On
> > >> the
> > >> > > > > > >> > consumer-side,
> > >> > > > > > >> > > > you
> > >> > > > > > >> > > > > >> > >>>>>>>>> had two
> > >> > > > > > >> > > > > >> > >>>>>>>>> options: use the high level
> > consumer,
> > >> or
> > >> > > the
> > >> > > > > > simple
> > >> > > > > > >> > > > > consumer.
> > >> > > > > > >> > > > > >> > >>>>>>>>> The
> > >> > > > > > >> > > > > >> > >>>>>>> problem
> > >> > > > > > >> > > > > >> > >>>>>>>>> with the high-level consumer was
> > that
> > >> it
> > >> > > > > > controlled
> > >> > > > > > >> > your
> > >> > > > > > >> > > > > >> > >>>>>>>>> offsets, partition assignments,
> and
> > >> the
> > >> > > order
> > >> > > > > in
> > >> > > > > > >> which
> > >> > > > > > >> > > you
> > >> > > > > > >> > > > > >> > >>>>>>>>> received messages. The
> > >> > > > > > >> > > > > >> > >>>>> problem
> > >> > > > > > >> > > > > >> > >>>>>>>>> with
> > >> > > > > > >> > > > > >> > >>>>>>>>> the simple consumer is that it's
> not
> > >> > > simple.
> > >> > > > > It's
> > >> > > > > > >> > basic.
> > >> > > > > > >> > > > You
> > >> > > > > > >> > > > > >> > >>>>>>>>> end up
> > >> > > > > > >> > > > > >> > >>>>>>> having
> > >> > > > > > >> > > > > >> > >>>>>>>>> to handle a lot of really
> low-level
> > >> stuff
> > >> > > > that
> > >> > > > > > you
> > >> > > > > > >> > > > > shouldn't.
> > >> > > > > > >> > > > > >> > >>>>>>>>> We
> > >> > > > > > >> > > > > >> > >>>>>> spent a
> > >> > > > > > >> > > > > >> > >>>>>>>>> lot of time to make Samza's
> > >> > > > KafkaSystemConsumer
> > >> > > > > > very
> > >> > > > > > >> > > > robust.
> > >> > > > > > >> > > > > >> It
> > >> > > > > > >> > > > > >> > >>>>>>>>> also allows us to support some
> cool
> > >> > > features:
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> * Per-partition message ordering
> and
> > >> > > > > > prioritization.
> > >> > > > > > >> > > > > >> > >>>>>>>>> * Tight control over partition
> > >> assignment
> > >> > > to
> > >> > > > > > support
> > >> > > > > > >> > > > joins,
> > >> > > > > > >> > > > > >> > >>>>>>>>> global
> > >> > > > > > >> > > > > >> > >>>>>> state
> > >> > > > > > >> > > > > >> > >>>>>>>>> (if we want to implement it :)),
> > etc.
> > >> > > > > > >> > > > > >> > >>>>>>>>> * Tight control over offset
> > >> > checkpointing.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> What we didn't realize at the time
> > is
> > >> > that
> > >> > > > > these
> > >> > > > > > >> > > features
> > >> > > > > > >> > > > > >> > >>>>>>>>> should
> > >> > > > > > >> > > > > >> > >>>>>>> actually
> > >> > > > > > >> > > > > >> > >>>>>>>>> be in Kafka. A lot of Kafka
> > consumers
> > >> > (not
> > >> > > > just
> > >> > > > > > >> Samza
> > >> > > > > > >> > > > stream
> > >> > > > > > >> > > > > >> > >>>>>> processors)
> > >> > > > > > >> > > > > >> > >>>>>>>>> end up wanting to do things like
> > joins
> > >> > and
> > >> > > > > > partition
> > >> > > > > > >> > > > > >> > >>>>>>>>> assignment. The
> > >> > > > > > >> > > > > >> > >>>>>>> Kafka
> > >> > > > > > >> > > > > >> > >>>>>>>>> community has come to the same
> > >> > conclusion.
> > >> > > > > > They're
> > >> > > > > > >> > > adding
> > >> > > > > > >> > > > a
> > >> > > > > > >> > > > > >> ton
> > >> > > > > > >> > > > > >> > >>>>>>>>> of upgrades into their new Kafka
> > >> consumer
> > >> > > > > > >> > > implementation.
> > >> > > > > > >> > > > > To a
> > >> > > > > > >> > > > > >> > >>>>>>>>> large extent,
> > >> > > > > > >> > > > > >> > >>>>> it's
> > >> > > > > > >> > > > > >> > >>>>>>>>> duplicate work to what we've
> already
> > >> done
> > >> > > in
> > >> > > > > > Samza.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> On top of this, Kafka ended up
> > taking
> > >> a
> > >> > > very
> > >> > > > > > similar
> > >> > > > > > >> > > > > approach
> > >> > > > > > >> > > > > >> > >>>>>>>>> to
> > >> > > > > > >> > > > > >> > >>>>>> Samza's
> > >> > > > > > >> > > > > >> > >>>>>>>>> KafkaCheckpointManager
> > implementation
> > >> for
> > >> > > > > > handling
> > >> > > > > > >> > > offset
> > >> > > > > > >> > > > > >> > >>>>>> checkpointing.
> > >> > > > > > >> > > > > >> > >>>>>>>>> Like Samza, Kafka's new offset
> > >> management
> > >> > > > > feature
> > >> > > > > > >> > stores
> > >> > > > > > >> > > > > >> offset
> > >> > > > > > >> > > > > >> > >>>>>>>>> checkpoints in a topic, and allows
> > >> you to
> > >> > > > fetch
> > >> > > > > > them
> > >> > > > > > >> > > from
> > >> > > > > > >> > > > > the
> > >> > > > > > >> > > > > >> > >>>>>>>>> broker.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> A lot of this seems like a waste,
> > >> since
> > >> > we
> > >> > > > > could
> > >> > > > > > >> have
> > >> > > > > > >> > > > shared
> > >> > > > > > >> > > > > >> > >>>>>>>>> the
> > >> > > > > > >> > > > > >> > >>>>> work
> > >> > > > > > >> > > > > >> > >>>>>> if
> > >> > > > > > >> > > > > >> > >>>>>>>>> it
> > >> > > > > > >> > > > > >> > >>>>>>>>> had been done in Kafka from the
> > >> get-go.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> Vision
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> All of this leads me to a rather
> > >> radical
> > >> > > > > > proposal.
> > >> > > > > > >> > Samza
> > >> > > > > > >> > > > is
> > >> > > > > > >> > > > > >> > >>>>> relatively
> > >> > > > > > >> > > > > >> > >>>>>>>>> stable at this point. I'd venture
> to
> > >> say
> > >> > > that
> > >> > > > > > we're
> > >> > > > > > >> > > near a
> > >> > > > > > >> > > > > 1.0
> > >> > > > > > >> > > > > >> > >>>>>> release.
> > >> > > > > > >> > > > > >> > >>>>>>>>> I'd
> > >> > > > > > >> > > > > >> > >>>>>>>>> like to propose that we take what
> > >> we've
> > >> > > > > learned,
> > >> > > > > > and
> > >> > > > > > >> > > begin
> > >> > > > > > >> > > > > >> > >>>>>>>>> thinking
> > >> > > > > > >> > > > > >> > >>>>>>> about
> > >> > > > > > >> > > > > >> > >>>>>>>>> Samza beyond 1.0. What would we
> > >> change if
> > >> > > we
> > >> > > > > were
> > >> > > > > > >> > > starting
> > >> > > > > > >> > > > > >> from
> > >> > > > > > >> > > > > >> > >>>>>> scratch?
> > >> > > > > > >> > > > > >> > >>>>>>>>> My
> > >> > > > > > >> > > > > >> > >>>>>>>>> proposal is to:
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> 1. Make Samza standalone the
> *only*
> > >> way
> > >> > to
> > >> > > > run
> > >> > > > > > Samza
> > >> > > > > > >> > > > > >> > >>>>>>>>> processors, and eliminate all
> direct
> > >> > > > > dependences
> > >> > > > > > on
> > >> > > > > > >> > > YARN,
> > >> > > > > > >> > > > > >> Mesos,
> > >> > > > > > >> > > > > >> > >>>> etc.
> > >> > > > > > >> > > > > >> > >>>>>>>>> 2. Make a definitive call to
> support
> > >> only
> > >> > > > Kafka
> > >> > > > > > as
> > >> > > > > > >> the
> > >> > > > > > >> > > > > stream
> > >> > > > > > >> > > > > >> > >>>>>> processing
> > >> > > > > > >> > > > > >> > >>>>>>>>> layer.
> > >> > > > > > >> > > > > >> > >>>>>>>>> 3. Eliminate Samza's metrics,
> > logging,
> > >> > > > > > >> serialization,
> > >> > > > > > >> > > and
> > >> > > > > > >> > > > > >> > >>>>>>>>> config
> > >> > > > > > >> > > > > >> > >>>>>>> systems,
> > >> > > > > > >> > > > > >> > >>>>>>>>> and simply use Kafka's instead.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> This would fix all of the issues
> > that
> > >> I
> > >> > > > > outlined
> > >> > > > > > >> > above.
> > >> > > > > > >> > > It
> > >> > > > > > >> > > > > >> > >>>>>>>>> should
> > >> > > > > > >> > > > > >> > >>>>> also
> > >> > > > > > >> > > > > >> > >>>>>>>>> shrink the Samza code base pretty
> > >> > > > dramatically.
> > >> > > > > > >> > > Supporting
> > >> > > > > > >> > > > > >> only
> > >> > > > > > >> > > > > >> > >>>>>>>>> a standalone container will allow
> > >> Samza
> > >> > to
> > >> > > be
> > >> > > > > > >> executed
> > >> > > > > > >> > > on
> > >> > > > > > >> > > > > YARN
> > >> > > > > > >> > > > > >> > >>>>>>>>> (using Slider), Mesos (using
> > >> > > > Marathon/Aurora),
> > >> > > > > or
> > >> > > > > > >> most
> > >> > > > > > >> > > > other
> > >> > > > > > >> > > > > >> > >>>>>>>>> in-house
> > >> > > > > > >> > > > > >> > >>>>>>> deployment
> > >> > > > > > >> > > > > >> > >>>>>>>>> systems. This should make life a
> lot
> > >> > easier
> > >> > > > for
> > >> > > > > > new
> > >> > > > > > >> > > users.
> > >> > > > > > >> > > > > >> > >>>>>>>>> Imagine
> > >> > > > > > >> > > > > >> > >>>>>>> having
> > >> > > > > > >> > > > > >> > >>>>>>>>> the hello-samza tutorial without
> > YARN.
> > >> > The
> > >> > > > drop
> > >> > > > > > in
> > >> > > > > > >> > > mailing
> > >> > > > > > >> > > > > >> list
> > >> > > > > > >> > > > > >> > >>>>>> traffic
> > >> > > > > > >> > > > > >> > >>>>>>>>> will be pretty dramatic.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> Coupling with Kafka seems long
> > >> overdue to
> > >> > > me.
> > >> > > > > The
> > >> > > > > > >> > > reality
> > >> > > > > > >> > > > > is,
> > >> > > > > > >> > > > > >> > >>>>> everyone
> > >> > > > > > >> > > > > >> > >>>>>>>>> that
> > >> > > > > > >> > > > > >> > >>>>>>>>> I'm aware of is using Samza with
> > >> Kafka.
> > >> > We
> > >> > > > > > basically
> > >> > > > > > >> > > > require
> > >> > > > > > >> > > > > >> it
> > >> > > > > > >> > > > > >> > >>>>>> already
> > >> > > > > > >> > > > > >> > >>>>>>> in
> > >> > > > > > >> > > > > >> > >>>>>>>>> order for most features to work.
> > Those
> > >> > that
> > >> > > > are
> > >> > > > > > >> using
> > >> > > > > > >> > > > other
> > >> > > > > > >> > > > > >> > >>>>>>>>> systems
> > >> > > > > > >> > > > > >> > >>>>>> are
> > >> > > > > > >> > > > > >> > >>>>>>>>> generally using it for ingest into
> > >> Kafka
> > >> > > (1),
> > >> > > > > and
> > >> > > > > > >> then
> > >> > > > > > >> > > > they
> > >> > > > > > >> > > > > do
> > >> > > > > > >> > > > > >> > >>>>>>>>> the processing on top. There is
> > >> already
> > >> > > > > > discussion (
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>
> > >> > > > > > >> > > > > >> > >>>>>
> > >> > > > > > >> > > > > >> >
> > >> > > > > > >> > > > >
> > >> > > > > > >> >
> > >> > > > > >
> > >> > >
> > >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851
> > >> > > > > > >> > > > > >> > >>>>> 767
> > >> > > > > > >> > > > > >> > >>>>>>>>> )
> > >> > > > > > >> > > > > >> > >>>>>>>>> in Kafka to make ingesting into
> > Kafka
> > >> > > > extremely
> > >> > > > > > >> easy.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> Once we make the call to couple
> with
> > >> > Kafka,
> > >> > > > we
> > >> > > > > > can
> > >> > > > > > >> > > > leverage
> > >> > > > > > >> > > > > a
> > >> > > > > > >> > > > > >> > >>>>>>>>> ton of
> > >> > > > > > >> > > > > >> > >>>>>>> their
> > >> > > > > > >> > > > > >> > >>>>>>>>> ecosystem. We no longer have to
> > >> maintain
> > >> > > our
> > >> > > > > own
> > >> > > > > > >> > config,
> > >> > > > > > >> > > > > >> > >>>>>>>>> metrics,
> > >> > > > > > >> > > > > >> > >>>>> etc.
> > >> > > > > > >> > > > > >> > >>>>>>> We
> > >> > > > > > >> > > > > >> > >>>>>>>>> can all share the same libraries,
> > and
> > >> > make
> > >> > > > them
> > >> > > > > > >> > better.
> > >> > > > > > >> > > > This
> > >> > > > > > >> > > > > >> > >>>>>>>>> will
> > >> > > > > > >> > > > > >> > >>>>> also
> > >> > > > > > >> > > > > >> > >>>>>>>>> allow us to share the
> > >> consumer/producer
> > >> > > APIs,
> > >> > > > > and
> > >> > > > > > >> will
> > >> > > > > > >> > > let
> > >> > > > > > >> > > > > us
> > >> > > > > > >> > > > > >> > >>>>> leverage
> > >> > > > > > >> > > > > >> > >>>>>>>>> their offset management and
> > partition
> > >> > > > > management,
> > >> > > > > > >> > rather
> > >> > > > > > >> > > > > than
> > >> > > > > > >> > > > > >> > >>>>>>>>> having
> > >> > > > > > >> > > > > >> > >>>>>> our
> > >> > > > > > >> > > > > >> > >>>>>>>>> own. All of the coordinator stream
> > >> code
> > >> > > would
> > >> > > > > go
> > >> > > > > > >> away,
> > >> > > > > > >> > > as
> > >> > > > > > >> > > > > >> would
> > >> > > > > > >> > > > > >> > >>>>>>>>> most
> > >> > > > > > >> > > > > >> > >>>>>> of
> > >> > > > > > >> > > > > >> > >>>>>>>>> the
> > >> > > > > > >> > > > > >> > >>>>>>>>> YARN AppMaster code. We'd probably
> > >> have
> > >> > to
> > >> > > > push
> > >> > > > > > some
> > >> > > > > > >> > > > > partition
> > >> > > > > > >> > > > > >> > >>>>>>> management
> > >> > > > > > >> > > > > >> > >>>>>>>>> features into the Kafka broker,
> but
> > >> > they're
> > >> > > > > > already
> > >> > > > > > >> > > moving
> > >> > > > > > >> > > > > in
> > >> > > > > > >> > > > > >> > >>>>>>>>> that direction with the new
> consumer
> > >> API.
> > >> > > The
> > >> > > > > > >> features
> > >> > > > > > >> > > we
> > >> > > > > > >> > > > > have
> > >> > > > > > >> > > > > >> > >>>>>>>>> for
> > >> > > > > > >> > > > > >> > >>>>>> partition
> > >> > > > > > >> > > > > >> > >>>>>>>>> assignment aren't unique to Samza,
> > and
> > >> > seem
> > >> > > > > like
> > >> > > > > > >> they
> > >> > > > > > >> > > > should
> > >> > > > > > >> > > > > >> be
> > >> > > > > > >> > > > > >> > >>>>>>>>> in
> > >> > > > > > >> > > > > >> > >>>>>> Kafka
> > >> > > > > > >> > > > > >> > >>>>>>>>> anyway. There will always be some
> > >> niche
> > >> > > > usages
> > >> > > > > > which
> > >> > > > > > >> > > will
> > >> > > > > > >> > > > > >> > >>>>>>>>> require
> > >> > > > > > >> > > > > >> > >>>>>> extra
> > >> > > > > > >> > > > > >> > >>>>>>>>> care and hence full control over
> > >> > partition
> > >> > > > > > >> assignments
> > >> > > > > > >> > > > much
> > >> > > > > > >> > > > > >> > >>>>>>>>> like the
> > >> > > > > > >> > > > > >> > >>>>>>> Kafka
> > >> > > > > > >> > > > > >> > >>>>>>>>> low level consumer api. These
> would
> > >> > > continue
> > >> > > > to
> > >> > > > > > be
> > >> > > > > > >> > > > > supported.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> These items will be good for the
> > Samza
> > >> > > > > community.
> > >> > > > > > >> > > They'll
> > >> > > > > > >> > > > > make
> > >> > > > > > >> > > > > >> > >>>>>>>>> Samza easier to use, and make it
> > >> easier
> > >> > for
> > >> > > > > > >> developers
> > >> > > > > > >> > > to
> > >> > > > > > >> > > > > add
> > >> > > > > > >> > > > > >> > >>>>>>>>> new features.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> Obviously this is a fairly large
> > (and
> > >> > > > somewhat
> > >> > > > > > >> > backwards
> > >> > > > > > >> > > > > >> > >>>>> incompatible
> > >> > > > > > >> > > > > >> > >>>>>>>>> change). If we choose to go this
> > >> route,
> > >> > > it's
> > >> > > > > > >> important
> > >> > > > > > >> > > > that
> > >> > > > > > >> > > > > we
> > >> > > > > > >> > > > > >> > >>>>> openly
> > >> > > > > > >> > > > > >> > >>>>>>>>> communicate how we're going to
> > >> provide a
> > >> > > > > > migration
> > >> > > > > > >> > path
> > >> > > > > > >> > > > from
> > >> > > > > > >> > > > > >> > >>>>>>>>> the
> > >> > > > > > >> > > > > >> > >>>>>>> existing
> > >> > > > > > >> > > > > >> > >>>>>>>>> APIs to the new ones (if we make
> > >> > > incompatible
> > >> > > > > > >> > changes).
> > >> > > > > > >> > > I
> > >> > > > > > >> > > > > >> think
> > >> > > > > > >> > > > > >> > >>>>>>>>> at a minimum, we'd probably need
> to
> > >> > > provide a
> > >> > > > > > >> wrapper
> > >> > > > > > >> > to
> > >> > > > > > >> > > > > allow
> > >> > > > > > >> > > > > >> > >>>>>>>>> existing StreamTask
> implementations
> > to
> > >> > > > continue
> > >> > > > > > >> > running
> > >> > > > > > >> > > on
> > >> > > > > > >> > > > > the
> > >> > > > > > >> > > > > >> > >>>> new container.
> > >> > > > > > >> > > > > >> > >>>>>>> It's
> > >> > > > > > >> > > > > >> > >>>>>>>>> also important that we openly
> > >> communicate
> > >> > > > about
> > >> > > > > > >> > timing,
> > >> > > > > > >> > > > and
> > >> > > > > > >> > > > > >> > >>>>>>>>> stages
> > >> > > > > > >> > > > > >> > >>>>> of
> > >> > > > > > >> > > > > >> > >>>>>>> the
> > >> > > > > > >> > > > > >> > >>>>>>>>> migration.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> If you made it this far, I'm sure
> > you
> > >> > have
> > >> > > > > > opinions.
> > >> > > > > > >> > :)
> > >> > > > > > >> > > > > Please
> > >> > > > > > >> > > > > >> > >>>>>>>>> send
> > >> > > > > > >> > > > > >> > >>>>>> your
> > >> > > > > > >> > > > > >> > >>>>>>>>> thoughts and feedback.
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>> Cheers,
> > >> > > > > > >> > > > > >> > >>>>>>>>> Chris
> > >> > > > > > >> > > > > >> > >>>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>
> > >> > > > > > >> > > > > >> > >>>>>>
> > >> > > > > > >> > > > > >> > >>>>>> --
> > >> > > > > > >> > > > > >> > >>>>>> -- Guozhang
> > >> > > > > > >> > > > > >> > >>>>>>
> > >> > > > > > >> > > > > >> > >>>>>
> > >> > > > > > >> > > > > >> > >>>>
> > >> > > > > > >> > > > > >> > >>
> > >> > > > > > >> > > > > >> > >>
> > >> > > > > > >> > > > > >> >
> > >> > > > > > >> > > > > >> >
> > >> > > > > > >> > > > > >> >
> > >> > > > > > >> > > > > >>
> > >> > > > > > >> > > > > >
> > >> > > > > > >> > > > > >
> > >> > > > > > >> > > > >
> > >> > > > > > >> > > >
> > >> > > > > > >> > >
> > >> > > > > > >> >
> > >> > > > > > >>
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Thoughts and obesrvations on Samza

Reply via email to