Re: Thoughts and obesrvations on Samza

Jay Kreps Fri, 10 Jul 2015 14:36:11 -0700

Hey Yan,

Yeah philosophically I think the argument is that you should capture the
stream in Kafka independent of the transformation. This is obviously a
Kafka-centric view point.


Advantages of this:
- In practice I think this is what e.g. Storm people often end up doing
anyway. You usually need to throttle any access to a live serving database.
- Can have multiple subscribers and they get the same thing without
additional load on the source system.
- Applications can tap into the stream if need be by subscribing.
- You can debug your transformation by tailing the Kafka topic with the
console consumer
- Can tee off the same data stream for batch analysis or Lambda arch style
re-processing

The disadvantage is that it will use Kafka resources. But the idea is
eventually you will have multiple subscribers to any data source (at least
for monitoring) so you will end up there soon enough anyway.

Down the road the technical benefit is that I think it gives us a good path
towards end-to-end exactly once semantics from source to destination.
Basically the connectors need to support idempotence when talking to Kafka
and we need the transactional write feature in Kafka to make the
transformation atomic. This is actually pretty doable if you separate
connector=>kafka problem from the generic transformations which are always
kafka=>kafka. However I think it is quite impossible to do in a all_things
=> all_things environment. Today you can say "well the semantics of the
Samza APIs depend on the connectors you use" but it is actually worse then
that because the semantics actually depend on the pairing of connectors--so
not only can you probably not get a usable "exactly once" guarantee
end-to-end it can actually be quite hard to reverse engineer what property
(if any) your end-to-end flow has if you have heterogenous systems.

-Jay

On Fri, Jul 10, 2015 at 2:00 PM, Yan Fang <[email protected]> wrote:

> {quote}
> maintained in a separate repository and retaining the existing
> committership but sharing as much else as possible (website, etc)
> {quote}
>
> Overall, I agree on this idea. Now the question is more about "how to do
> it".
>
> On the other hand, one thing I want to point out is that, if we decide to
> go this way, how do we want to support
> otherSystem-transformation-otherSystem use case?
>
> Basically, there are four user groups here:
>
> 1. Kafka-transformation-Kafka
> 2. Kafka-transformation-otherSystem
> 3. otherSystem-transformation-Kafka
> 4. otherSystem-transformation-otherSystem
>
> For group 1, they can easily use the new Samza library to achieve. For
> group 2 and 3, they can use copyCat -> transformation -> Kafka or Kafka->
> transformation -> copyCat.
>
> The problem is for group 4. Do we want to abandon this or still support it?
> Of course, this use case can be achieved by using copyCat -> transformation
> -> Kafka -> transformation -> copyCat, the thing is how we persuade them to
> do this long chain. If yes, it will also be a win for Kafka too. Or if
> there is no one in this community actually doing this so far, maybe ok to
> not support the group 4 directly.
>
> Thanks,
>
> Fang, Yan
> [email protected]
>
> On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps <[email protected]> wrote:
>
> > Yeah I agree with this summary. I think there are kind of two questions
> > here:
> > 1. Technically does alignment/reliance on Kafka make sense
> > 2. Branding wise (naming, website, concepts, etc) does alignment with
> Kafka
> > make sense
> >
> > Personally I do think both of these things would be really valuable, and
> > would dramatically alter the trajectory of the project.
> >
> > My preference would be to see if people can mostly agree on a direction
> > rather than splintering things off. From my point of view the ideal
> outcome
> > of all the options discussed would be to make Samza a closely aligned
> > subproject, maintained in a separate repository and retaining the
> existing
> > committership but sharing as much else as possible (website, etc). No
> idea
> > about how these things work, Jacob, you probably know more.
> >
> > No discussion amongst the Kafka folks has happened on this, but likely we
> > should figure out what the Samza community actually wants first.
> >
> > I admit that this is a fairly radical departure from how things are.
> >
> > If that doesn't fly, I think, yeah we could leave Samza as it is and do
> the
> > more radical reboot inside Kafka. From my point of view that does leave
> > things in a somewhat confusing state since now there are two stream
> > processing systems more or less coupled to Kafka in large part made by
> the
> > same people. But, arguably that might be a cleaner way to make the
> cut-over
> > and perhaps less risky for Samza community since if it works people can
> > switch and if it doesn't nothing will have changed. Dunno, how do people
> > feel about this?
> >
> > -Jay
> >
> > On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan <[email protected]> wrote:
> >
> > > >  This leads me to thinking that merging projects and communities
> might
> > > be a good idea: with the union of experience from both communities, we
> > will
> > > probably build a better system that is better for users.
> > > Is this what's being proposed though? Merging the projects seems like
> > > a consequence of at most one of the three directions under discussion:
> > > 1) Samza 2.0: The Samza community relies more heavily on Kafka for
> > > configuration, etc. (to a greater or lesser extent to be determined)
> > > but the Samza community would not automatically merge withe Kafka
> > > community (the Phoenix/HBase example is a good one here).
> > > 2) Samza Reboot: The Samza community continues to exist with a limited
> > > project scope, but similarly would not need to be part of the Kafka
> > > community (ie given committership) to progress.  Here, maybe the Samza
> > > team would become a subproject of Kafka (the Board frowns on
> > > subprojects at the moment, so I'm not sure if that's even feasible),
> > > but that would not be required.
> > > 3) Hey Samza! FYI, Kafka does streaming now: In this option the Kafka
> > > team builds its own streaming library, possibly off of Jay's
> > > prototype, which has not direct lineage to the Samza team.  There's no
> > > reason for the Kafka team to bring in the Samza team.
> > >
> > > Is the Kafka community on board with this?
> > >
> > > To be clear, all three options under discussion are interesting,
> > > technically valid and likely healthy directions for the project.
> > > Also, they are not mutually exclusive.  The Samza community could
> > > decide to pursue, say, 'Samza 2.0', while the Kafka community went
> > > forward with 'Hey Samza!'  My points above are directed entirely at
> > > the community aspect of these choices.
> > > -Jakob
> > >
> > > On 10 July 2015 at 09:10, Roger Hoover <[email protected]> wrote:
> > > > That's great.  Thanks, Jay.
> > > >
> > > > On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps <[email protected]> wrote:
> > > >
> > > >> Yeah totally agree. I think you have this issue even today, right?
> > I.e.
> > > if
> > > >> you need to make a simple config change and you're running in YARN
> > today
> > > >> you end up bouncing the job which then rebuilds state. I think the
> fix
> > > is
> > > >> exactly what you described which is to have a long timeout on
> > partition
> > > >> movement for stateful jobs so that if a job is just getting bounced,
> > and
> > > >> the cluster manager (or admin) is smart enough to restart it on the
> > same
> > > >> host when possible, it can optimistically reuse any existing state
> it
> > > finds
> > > >> on disk (if it is valid).
> > > >>
> > > >> So in this model the charter of the CM is to place processes as
> > > stickily as
> > > >> possible and to restart or re-place failed processes. The charter of
> > the
> > > >> partition management system is to control the assignment of work to
> > > these
> > > >> processes. The nice thing about this is that the work assignment,
> > > timeouts,
> > > >> behavior, configs, and code will all be the same across all cluster
> > > >> managers.
> > > >>
> > > >> So I think that prototype would actually give you exactly what you
> > want
> > > >> today for any cluster manager (or manual placement + restart script)
> > > that
> > > >> was sticky in terms of host placement since there is already a
> > > configurable
> > > >> partition movement timeout and task-by-task state reuse with a check
> > on
> > > >> state validity.
> > > >>
> > > >> -Jay
> > > >>
> > > >> On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover <
> [email protected]
> > >
> > > >> wrote:
> > > >>
> > > >> > That would be great to let Kafka do as much heavy lifting as
> > possible
> > > and
> > > >> > make it easier for other languages to implement Samza apis.
> > > >> >
> > > >> > One thing to watch out for is the interplay between Kafka's group
> > > >> > management and the external scheduler/process manager's fault
> > > tolerance.
> > > >> > If a container dies, the Kafka group membership protocol will try
> to
> > > >> assign
> > > >> > it's tasks to other containers while at the same time the process
> > > manager
> > > >> > is trying to relaunch the container.  Without some consideration
> for
> > > this
> > > >> > (like a configurable amount of time to wait before Kafka alters
> the
> > > group
> > > >> > membership), there may be thrashing going on which is especially
> bad
> > > for
> > > >> > containers with large amounts of local state.
> > > >> >
> > > >> > Someone else pointed this out already but I thought it might be
> > worth
> > > >> > calling out again.
> > > >> >
> > > >> > Cheers,
> > > >> >
> > > >> > Roger
> > > >> >
> > > >> >
> > > >> > On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps <[email protected]>
> > wrote:
> > > >> >
> > > >> > > Hey Roger,
> > > >> > >
> > > >> > > I couldn't agree more. We spent a bunch of time talking to
> people
> > > and
> > > >> > that
> > > >> > > is exactly the stuff we heard time and again. What makes it
> hard,
> > of
> > > >> > > course, is that there is some tension between compatibility with
> > > what's
> > > >> > > there now and making things better for new users.
> > > >> > >
> > > >> > > I also strongly agree with the importance of multi-language
> > > support. We
> > > >> > are
> > > >> > > talking now about Java, but for application development use
> cases
> > > >> people
> > > >> > > want to work in whatever language they are using elsewhere. I
> > think
> > > >> > moving
> > > >> > > to a model where Kafka itself does the group membership,
> lifecycle
> > > >> > control,
> > > >> > > and partition assignment has the advantage of putting all that
> > > complex
> > > >> > > stuff behind a clean api that the clients are already going to
> be
> > > >> > > implementing for their consumer, so the added functionality for
> > > stream
> > > >> > > processing beyond a consumer becomes very minor.
> > > >> > >
> > > >> > > -Jay
> > > >> > >
> > > >> > > On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover <
> > > [email protected]>
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Metamorphosis...nice. :)
> > > >> > > >
> > > >> > > > This has been a great discussion.  As a user of Samza who's
> > > recently
> > > >> > > > integrated it into a relatively large organization, I just
> want
> > to
> > > >> add
> > > >> > > > support to a few points already made.
> > > >> > > >
> > > >> > > > The biggest hurdles to adoption of Samza as it currently
> exists
> > > that
> > > >> > I've
> > > >> > > > experienced are:
> > > >> > > > 1) YARN - YARN is overly complex in many environments where
> > Puppet
> > > >> > would
> > > >> > > do
> > > >> > > > just fine but it was the only mechanism to get fault
> tolerance.
> > > >> > > > 2) Configuration - I think I like the idea of configuring most
> > of
> > > the
> > > >> > job
> > > >> > > > in code rather than config files.  In general, I think the
> goal
> > > >> should
> > > >> > be
> > > >> > > > to make it harder to make mistakes, especially of the kind
> where
> > > the
> > > >> > code
> > > >> > > > expects something and the config doesn't match.  The current
> > > config
> > > >> is
> > > >> > > > quite intricate and error-prone.  For example, the application
> > > logic
> > > >> > may
> > > >> > > > depend on bootstrapping a topic but rather than asserting that
> > in
> > > the
> > > >> > > code,
> > > >> > > > you have to rely on getting the config right.  Likewise with
> > > serdes,
> > > >> > the
> > > >> > > > Java representations produced by various serdes (JSON, Avro,
> > etc.)
> > > >> are
> > > >> > > not
> > > >> > > > equivalent so you cannot just reconfigure a serde without
> > changing
> > > >> the
> > > >> > > > code.   It would be nice for jobs to be able to assert what
> they
> > > >> expect
> > > >> > > > from their input topics in terms of partitioning.  This is
> > > getting a
> > > >> > > little
> > > >> > > > off topic but I was even thinking about creating a "Samza
> config
> > > >> > linter"
> > > >> > > > that would sanity check a set of configs.  Especially in
> > > >> organizations
> > > >> > > > where config is managed by a different team than the
> application
> > > >> > > developer,
> > > >> > > > it's very hard to get avoid config mistakes.
> > > >> > > > 3) Java/Scala centric - for many teams (especially DevOps-type
> > > >> folks),
> > > >> > > the
> > > >> > > > pain of the Java toolchain (maven, slow builds, weak command
> > line
> > > >> > > support,
> > > >> > > > configuration over convention) really inhibits productivity.
> As
> > > more
> > > >> > and
> > > >> > > > more high-quality clients become available for Kafka, I hope
> > > they'll
> > > >> > > follow
> > > >> > > > Samza's model.  Not sure how much it affects the proposals in
> > this
> > > >> > thread
> > > >> > > > but please consider other languages in the ecosystem as well.
> > > From
> > > >> > what
> > > >> > > > I've heard, Spark has more Python users than Java/Scala.
> > > >> > > > (FYI, we added a Jython wrapper for the Samza API
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza
> > > >> > > > and are working on a Yeoman generator
> > > >> > > > https://github.com/Quantiply/generator-rico for Jython/Samza
> > > >> projects
> > > >> > to
> > > >> > > > alleviate some of the pain)
> > > >> > > >
> > > >> > > > I also want to underscore Jay's point about improving the user
> > > >> > > experience.
> > > >> > > > That's a very important factor for adoption.  I think the goal
> > > should
> > > >> > be
> > > >> > > to
> > > >> > > > make Samza as easy to get started with as something like
> > Logstash.
> > > >> > > > Logstash is vastly inferior in terms of capabilities to Samza
> > but
> > > >> it's
> > > >> > > easy
> > > >> > > > to get started and that makes a big difference.
> > > >> > > >
> > > >> > > > Cheers,
> > > >> > > >
> > > >> > > > Roger
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De Francisci
> Morales <
> > > >> > > > [email protected]> wrote:
> > > >> > > >
> > > >> > > > > Forgot to add. On the naming issues, Kafka Metamorphosis is
> a
> > > clear
> > > >> > > > winner
> > > >> > > > > :)
> > > >> > > > >
> > > >> > > > > --
> > > >> > > > > Gianmarco
> > > >> > > > >
> > > >> > > > > On 7 July 2015 at 13:26, Gianmarco De Francisci Morales <
> > > >> > > [email protected]
> > > >> > > > >
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > Hi,
> > > >> > > > > >
> > > >> > > > > > @Martin, thanks for you comments.
> > > >> > > > > > Maybe I'm missing some important point, but I think
> coupling
> > > the
> > > >> > > > releases
> > > >> > > > > > is actually a *good* thing.
> > > >> > > > > > To make an example, would it be better if the MR and HDFS
> > > >> > components
> > > >> > > of
> > > >> > > > > > Hadoop had different release schedules?
> > > >> > > > > >
> > > >> > > > > > Actually, keeping the discussion in a single place would
> > make
> > > >> > > agreeing
> > > >> > > > on
> > > >> > > > > > releases (and backwards compatibility) much easier, as
> > > everybody
> > > >> > > would
> > > >> > > > be
> > > >> > > > > > responsible for the whole codebase.
> > > >> > > > > >
> > > >> > > > > > That said, I like the idea of absorbing samza-core as a
> > > >> > sub-project,
> > > >> > > > and
> > > >> > > > > > leave the fancy stuff separate.
> > > >> > > > > > It probably gives 90% of the benefits we have been
> > discussing
> > > >> here.
> > > >> > > > > >
> > > >> > > > > > Cheers,
> > > >> > > > > >
> > > >> > > > > > --
> > > >> > > > > > Gianmarco
> > > >> > > > > >
> > > >> > > > > > On 7 July 2015 at 02:30, Jay Kreps <[email protected]>
> > > wrote:
> > > >> > > > > >
> > > >> > > > > >> Hey Martin,
> > > >> > > > > >>
> > > >> > > > > >> I agree coupling release schedules is a downside.
> > > >> > > > > >>
> > > >> > > > > >> Definitely we can try to solve some of the integration
> > > problems
> > > >> in
> > > >> > > > > >> Confluent Platform or in other distributions. But I think
> > > this
> > > >> > ends
> > > >> > > up
> > > >> > > > > >> being really shallow. I guess I feel to really get a good
> > > user
> > > >> > > > > experience
> > > >> > > > > >> the two systems have to kind of feel like part of the
> same
> > > thing
> > > >> > and
> > > >> > > > you
> > > >> > > > > >> can't really add that in later--you can put both in the
> > same
> > > >> > > > > downloadable
> > > >> > > > > >> tar file but it doesn't really give a very cohesive
> > feeling.
> > > I
> > > >> > agree
> > > >> > > > > that
> > > >> > > > > >> ultimately any of the project stuff is as much social and
> > > naming
> > > >> > as
> > > >> > > > > >> anything else--theoretically two totally independent
> > projects
> > > >> > could
> > > >> > > > work
> > > >> > > > > >> to
> > > >> > > > > >> tightly align. In practice this seems to be quite
> difficult
> > > >> > though.
> > > >> > > > > >>
> > > >> > > > > >> For the frameworks--totally agree it would be good to
> > > maintain
> > > >> the
> > > >> > > > > >> framework support with the project. In some cases there
> may
> > > not
> > > >> be
> > > >> > > too
> > > >> > > > > >> much
> > > >> > > > > >> there since the integration gets lighter but I think
> > whatever
> > > >> > stubs
> > > >> > > > you
> > > >> > > > > >> need should be included. So no I definitely wasn't trying
> > to
> > > >> imply
> > > >> > > > > >> dropping
> > > >> > > > > >> support for these frameworks, just making the integration
> > > >> lighter
> > > >> > by
> > > >> > > > > >> separating process management from partition management.
> > > >> > > > > >>
> > > >> > > > > >> You raise two good points we would have to figure out if
> we
> > > went
> > > >> > > down
> > > >> > > > > the
> > > >> > > > > >> alignment path:
> > > >> > > > > >> 1. With respect to the name, yeah I think the first
> > question
> > > is
> > > >> > > > whether
> > > >> > > > > >> some "re-branding" would be worth it. If so then I think
> we
> > > can
> > > >> > > have a
> > > >> > > > > big
> > > >> > > > > >> thread on the name. I'm definitely not set on Kafka
> > > Streaming or
> > > >> > > Kafka
> > > >> > > > > >> Streams I was just using them to be kind of
> illustrative. I
> > > >> agree
> > > >> > > with
> > > >> > > > > >> your
> > > >> > > > > >> critique of these names, though I think people would get
> > the
> > > >> idea.
> > > >> > > > > >> 2. Yeah you also raise a good point about how to "factor"
> > it.
> > > >> Here
> > > >> > > are
> > > >> > > > > the
> > > >> > > > > >> options I see (I could get enthusiastic about any of
> them):
> > > >> > > > > >>    a. One repo for both Kafka and Samza
> > > >> > > > > >>    b. Two repos, retaining the current seperation
> > > >> > > > > >>    c. Two repos, the equivalent of samza-api and
> samza-core
> > > is
> > > >> > > > absorbed
> > > >> > > > > >> almost like a third client
> > > >> > > > > >>
> > > >> > > > > >> Cheers,
> > > >> > > > > >>
> > > >> > > > > >> -Jay
> > > >> > > > > >>
> > > >> > > > > >> On Mon, Jul 6, 2015 at 1:18 PM, Martin Kleppmann <
> > > >> > > > [email protected]>
> > > >> > > > > >> wrote:
> > > >> > > > > >>
> > > >> > > > > >> > Ok, thanks for the clarifications. Just a few follow-up
> > > >> > comments.
> > > >> > > > > >> >
> > > >> > > > > >> > - I see the appeal of merging with Kafka or becoming a
> > > >> > subproject:
> > > >> > > > the
> > > >> > > > > >> > reasons you mention are good. The risk I see is that
> > > release
> > > >> > > > schedules
> > > >> > > > > >> > become coupled to each other, which can slow everyone
> > down,
> > > >> and
> > > >> > > > large
> > > >> > > > > >> > projects with many contributors are harder to manage.
> > > (Jakob,
> > > >> > can
> > > >> > > > you
> > > >> > > > > >> speak
> > > >> > > > > >> > from experience, having seen a wider range of Hadoop
> > > ecosystem
> > > >> > > > > >> projects?)
> > > >> > > > > >> >
> > > >> > > > > >> > Some of the goals of a better unified developer
> > experience
> > > >> could
> > > >> > > > also
> > > >> > > > > be
> > > >> > > > > >> > solved by integrating Samza nicely into a Kafka
> > > distribution
> > > >> > (such
> > > >> > > > as
> > > >> > > > > >> > Confluent's). I'm not against merging projects if we
> > decide
> > > >> > that's
> > > >> > > > the
> > > >> > > > > >> way
> > > >> > > > > >> > to go, just pointing out the same goals can perhaps
> also
> > be
> > > >> > > achieved
> > > >> > > > > in
> > > >> > > > > >> > other ways.
> > > >> > > > > >> >
> > > >> > > > > >> > - With regard to dropping the YARN dependency: are you
> > > >> proposing
> > > >> > > > that
> > > >> > > > > >> > Samza doesn't give any help to people wanting to run on
> > > >> > > > > >> YARN/Mesos/AWS/etc?
> > > >> > > > > >> > So the docs would basically have a link to Slider and
> > > nothing
> > > >> > > else?
> > > >> > > > Or
> > > >> > > > > >> > would we maintain integrations with a bunch of popular
> > > >> > deployment
> > > >> > > > > >> methods
> > > >> > > > > >> > (e.g. the necessary glue and shell scripts to make
> Samza
> > > work
> > > >> > with
> > > >> > > > > >> Slider)?
> > > >> > > > > >> >
> > > >> > > > > >> > I absolutely think it's a good idea to have the "as a
> > > library"
> > > >> > and
> > > >> > > > > "as a
> > > >> > > > > >> > process" (using Yi's taxonomy) options for people who
> > want
> > > >> them,
> > > >> > > > but I
> > > >> > > > > >> > think there should also be a low-friction path for
> common
> > > "as
> > > >> a
> > > >> > > > > service"
> > > >> > > > > >> > deployment methods, for which we probably need to
> > maintain
> > > >> > > > > integrations.
> > > >> > > > > >> >
> > > >> > > > > >> > - Project naming: "Kafka Streams" seems odd to me,
> > because
> > > >> Kafka
> > > >> > > is
> > > >> > > > > all
> > > >> > > > > >> > about streams already. Perhaps "Kafka Transformers" or
> > > "Kafka
> > > >> > > > Filters"
> > > >> > > > > >> > would be more apt?
> > > >> > > > > >> >
> > > >> > > > > >> > One suggestion: perhaps the core of Samza (stream
> > > >> transformation
> > > >> > > > with
> > > >> > > > > >> > state management -- i.e. the "Samza as a library" bit)
> > > could
> > > >> > > become
> > > >> > > > > >> part of
> > > >> > > > > >> > Kafka, while higher-level tools such as streaming SQL
> and
> > > >> > > > integrations
> > > >> > > > > >> with
> > > >> > > > > >> > deployment frameworks remain in a separate project? In
> > > other
> > > >> > > words,
> > > >> > > > > >> Kafka
> > > >> > > > > >> > would absorb the proven, stable core of Samza, which
> > would
> > > >> > become
> > > >> > > > the
> > > >> > > > > >> > "third Kafka client" mentioned early in this thread.
> The
> > > Samza
> > > >> > > > project
> > > >> > > > > >> > would then target that third Kafka client as its base
> > API,
> > > and
> > > >> > the
> > > >> > > > > >> project
> > > >> > > > > >> > would be freed up to explore more experimental new
> > > horizons.
> > > >> > > > > >> >
> > > >> > > > > >> > Martin
> > > >> > > > > >> >
> > > >> > > > > >> > On 6 Jul 2015, at 18:51, Jay Kreps <
> [email protected]>
> > > >> wrote:
> > > >> > > > > >> >
> > > >> > > > > >> > > Hey Martin,
> > > >> > > > > >> > >
> > > >> > > > > >> > > For the YARN/Mesos/etc decoupling I actually don't
> > think
> > > it
> > > >> > ties
> > > >> > > > our
> > > >> > > > > >> > hands
> > > >> > > > > >> > > at all, all it does is refactor things. The division
> of
> > > >> > > > > >> responsibility is
> > > >> > > > > >> > > that Samza core is responsible for task lifecycle,
> > state,
> > > >> and
> > > >> > > > > >> partition
> > > >> > > > > >> > > management (using the Kafka co-ordinator) but it is
> NOT
> > > >> > > > responsible
> > > >> > > > > >> for
> > > >> > > > > >> > > packaging, configuration deployment or execution of
> > > >> processes.
> > > >> > > The
> > > >> > > > > >> > problem
> > > >> > > > > >> > > of packaging and starting these processes is
> > > >> > > > > >> > > framework/environment-specific. This leaves
> individual
> > > >> > > frameworks
> > > >> > > > to
> > > >> > > > > >> be
> > > >> > > > > >> > as
> > > >> > > > > >> > > fancy or vanilla as they like. So you can get simple
> > > >> stateless
> > > >> > > > > >> support in
> > > >> > > > > >> > > YARN, Mesos, etc using their off-the-shelf app
> > framework
> > > >> > > (Slider,
> > > >> > > > > >> > Marathon,
> > > >> > > > > >> > > etc). These are well known by people and have nice
> UIs
> > > and a
> > > >> > lot
> > > >> > > > of
> > > >> > > > > >> > > flexibility. I don't think they have node affinity
> as a
> > > >> built
> > > >> > in
> > > >> > > > > >> option
> > > >> > > > > >> > > (though I could be wrong). So if we want that we can
> > > either
> > > >> > wait
> > > >> > > > for
> > > >> > > > > >> them
> > > >> > > > > >> > > to add it or do a custom framework to add that
> feature
> > > (as
> > > >> > now).
> > > >> > > > > >> > Obviously
> > > >> > > > > >> > > if you manage things with old-school ops tools
> > > >> > (puppet/chef/etc)
> > > >> > > > you
> > > >> > > > > >> get
> > > >> > > > > >> > > locality easily. The nice thing, though, is that all
> > the
> > > >> samza
> > > >> > > > > >> "business
> > > >> > > > > >> > > logic" around partition management and fault
> tolerance
> > > is in
> > > >> > > Samza
> > > >> > > > > >> core
> > > >> > > > > >> > so
> > > >> > > > > >> > > it is shared across frameworks and the framework
> > specific
> > > >> bit
> > > >> > is
> > > >> > > > > just
> > > >> > > > > >> > > whether it is smart enough to try to get the same
> host
> > > when
> > > >> a
> > > >> > > job
> > > >> > > > is
> > > >> > > > > >> > > restarted.
> > > >> > > > > >> > >
> > > >> > > > > >> > > With respect to the Kafka-alignment, yeah I think the
> > > goal
> > > >> > would
> > > >> > > > be
> > > >> > > > > >> (a)
> > > >> > > > > >> > > actually get better alignment in user experience, and
> > (b)
> > > >> > > express
> > > >> > > > > >> this in
> > > >> > > > > >> > > the naming and project branding. Specifically:
> > > >> > > > > >> > > 1. Website/docs, it would be nice for the
> > > "transformation"
> > > >> api
> > > >> > > to
> > > >> > > > be
> > > >> > > > > >> > > discoverable in the main Kafka docs--i.e. be able to
> > > explain
> > > >> > > when
> > > >> > > > to
> > > >> > > > > >> use
> > > >> > > > > >> > > the consumer and when to use the stream processing
> > > >> > functionality
> > > >> > > > and
> > > >> > > > > >> lead
> > > >> > > > > >> > > people into that experience.
> > > >> > > > > >> > > 2. Align releases so if you get Kafkza 1.4.2 (or
> > > whatever)
> > > >> > that
> > > >> > > > has
> > > >> > > > > >> both
> > > >> > > > > >> > > Kafka and the stream processing part and they
> actually
> > > work
> > > >> > > > > together.
> > > >> > > > > >> > > 3. Unify the programming experience so the client and
> > > Samza
> > > >> > api
> > > >> > > > > share
> > > >> > > > > >> > > config/monitoring/naming/packaging/etc.
> > > >> > > > > >> > >
> > > >> > > > > >> > > I think sub-projects keep separate committers and can
> > > have a
> > > >> > > > > separate
> > > >> > > > > >> > repo,
> > > >> > > > > >> > > but I'm actually not really sure (I can't find a
> > > definition
> > > >> > of a
> > > >> > > > > >> > subproject
> > > >> > > > > >> > > in Apache).
> > > >> > > > > >> > >
> > > >> > > > > >> > > Basically at a high-level you want the experience to
> > > "feel"
> > > >> > > like a
> > > >> > > > > >> single
> > > >> > > > > >> > > system, not to relatively independent things that are
> > > kind
> > > >> of
> > > >> > > > > >> awkwardly
> > > >> > > > > >> > > glued together.
> > > >> > > > > >> > >
> > > >> > > > > >> > > I think if we did that they having naming or branding
> > > like
> > > >> > > "kafka
> > > >> > > > > >> > > streaming" or "kafka streams" or something like that
> > > would
> > > >> > > > actually
> > > >> > > > > >> do a
> > > >> > > > > >> > > good job of conveying what it is. I do that this
> would
> > > help
> > > >> > > > adoption
> > > >> > > > > >> > quite
> > > >> > > > > >> > > a lot as it would correctly convey that using Kafka
> > > >> Streaming
> > > >> > > with
> > > >> > > > > >> Kafka
> > > >> > > > > >> > is
> > > >> > > > > >> > > a fairly seamless experience and Kafka is pretty
> > heavily
> > > >> > adopted
> > > >> > > > at
> > > >> > > > > >> this
> > > >> > > > > >> > > point.
> > > >> > > > > >> > >
> > > >> > > > > >> > > Fwiw we actually considered this model originally
> when
> > > open
> > > >> > > > sourcing
> > > >> > > > > >> > Samza,
> > > >> > > > > >> > > however at that time Kafka was relatively unknown and
> > we
> > > >> > decided
> > > >> > > > not
> > > >> > > > > >> to
> > > >> > > > > >> > do
> > > >> > > > > >> > > it since we felt it would be limiting. From my point
> of
> > > view
> > > >> > the
> > > >> > > > > three
> > > >> > > > > >> > > things have changed (1) Kafka is now really heavily
> > used
> > > for
> > > >> > > > stream
> > > >> > > > > >> > > processing, (2) we learned that abstracting out the
> > > stream
> > > >> > well
> > > >> > > is
> > > >> > > > > >> > > basically impossible, (3) we learned it is really
> hard
> > to
> > > >> keep
> > > >> > > the
> > > >> > > > > two
> > > >> > > > > >> > > things feeling like a single product.
> > > >> > > > > >> > >
> > > >> > > > > >> > > -Jay
> > > >> > > > > >> > >
> > > >> > > > > >> > >
> > > >> > > > > >> > > On Mon, Jul 6, 2015 at 3:37 AM, Martin Kleppmann <
> > > >> > > > > >> [email protected]>
> > > >> > > > > >> > > wrote:
> > > >> > > > > >> > >
> > > >> > > > > >> > >> Hi all,
> > > >> > > > > >> > >>
> > > >> > > > > >> > >> Lots of good thoughts here.
> > > >> > > > > >> > >>
> > > >> > > > > >> > >> I agree with the general philosophy of tying Samza
> > more
> > > >> > firmly
> > > >> > > to
> > > >> > > > > >> Kafka.
> > > >> > > > > >> > >> After I spent a while looking at integrating other
> > > message
> > > >> > > > brokers
> > > >> > > > > >> (e.g.
> > > >> > > > > >> > >> Kinesis) with SystemConsumer, I came to the
> conclusion
> > > that
> > > >> > > > > >> > SystemConsumer
> > > >> > > > > >> > >> tacitly assumes a model so much like Kafka's that
> > pretty
> > > >> much
> > > >> > > > > nobody
> > > >> > > > > >> but
> > > >> > > > > >> > >> Kafka actually implements it. (Databus is perhaps an
> > > >> > exception,
> > > >> > > > but
> > > >> > > > > >> it
> > > >> > > > > >> > >> isn't widely used outside of LinkedIn.) Thus, making
> > > Samza
> > > >> > > fully
> > > >> > > > > >> > dependent
> > > >> > > > > >> > >> on Kafka acknowledges that the system-independence
> was
> > > >> never
> > > >> > as
> > > >> > > > > real
> > > >> > > > > >> as
> > > >> > > > > >> > we
> > > >> > > > > >> > >> perhaps made it out to be. The gains of code reuse
> are
> > > >> real.
> > > >> > > > > >> > >>
> > > >> > > > > >> > >> The idea of decoupling Samza from YARN has also
> always
> > > been
> > > >> > > > > >> appealing to
> > > >> > > > > >> > >> me, for various reasons already mentioned in this
> > > thread.
> > > >> > > > Although
> > > >> > > > > >> > making
> > > >> > > > > >> > >> Samza jobs deployable on anything
> (YARN/Mesos/AWS/etc)
> > > >> seems
> > > >> > > > > >> laudable,
> > > >> > > > > >> > I am
> > > >> > > > > >> > >> a little concerned that it will restrict us to a
> > lowest
> > > >> > common
> > > >> > > > > >> > denominator.
> > > >> > > > > >> > >> For example, would host affinity (SAMZA-617) still
> be
> > > >> > possible?
> > > >> > > > For
> > > >> > > > > >> jobs
> > > >> > > > > >> > >> with large amounts of state, I think SAMZA-617 would
> > be
> > > a
> > > >> big
> > > >> > > > boon,
> > > >> > > > > >> > since
> > > >> > > > > >> > >> restoring state off the changelog on every single
> > > restart
> > > >> is
> > > >> > > > > painful,
> > > >> > > > > >> > due
> > > >> > > > > >> > >> to long recovery times. It would be a shame if the
> > > >> decoupling
> > > >> > > > from
> > > >> > > > > >> YARN
> > > >> > > > > >> > >> made host affinity impossible.
> > > >> > > > > >> > >>
> > > >> > > > > >> > >> Jay, a question about the proposed API for
> > > instantiating a
> > > >> > job
> > > >> > > in
> > > >> > > > > >> code
> > > >> > > > > >> > >> (rather than a properties file): when submitting a
> job
> > > to a
> > > >> > > > > cluster,
> > > >> > > > > >> is
> > > >> > > > > >> > the
> > > >> > > > > >> > >> idea that the instantiation code runs on a client
> > > >> somewhere,
> > > >> > > > which
> > > >> > > > > >> then
> > > >> > > > > >> > >> pokes the necessary endpoints on YARN/Mesos/AWS/etc?
> > Or
> > > >> does
> > > >> > > that
> > > >> > > > > >> code
> > > >> > > > > >> > run
> > > >> > > > > >> > >> on each container that is part of the job (in which
> > > case,
> > > >> how
> > > >> > > > does
> > > >> > > > > >> the
> > > >> > > > > >> > job
> > > >> > > > > >> > >> submission to the cluster work)?
> > > >> > > > > >> > >>
> > > >> > > > > >> > >> I agree with Garry that it doesn't feel right to
> make
> > a
> > > 1.0
> > > >> > > > release
> > > >> > > > > >> > with a
> > > >> > > > > >> > >> plan for it to be immediately obsolete. So if this
> is
> > > going
> > > >> > to
> > > >> > > > > >> happen, I
> > > >> > > > > >> > >> think it would be more honest to stick with 0.*
> > version
> > > >> > numbers
> > > >> > > > > until
> > > >> > > > > >> > the
> > > >> > > > > >> > >> library-ified Samza has been implemented, is stable
> > and
> > > >> > widely
> > > >> > > > > used.
> > > >> > > > > >> > >>
> > > >> > > > > >> > >> Should the new Samza be a subproject of Kafka? There
> > is
> > > >> > > precedent
> > > >> > > > > for
> > > >> > > > > >> > >> tight coupling between different Apache projects
> (e.g.
> > > >> > Curator
> > > >> > > > and
> > > >> > > > > >> > >> Zookeeper, or Slider and YARN), so I think remaining
> > > >> separate
> > > >> > > > would
> > > >> > > > > >> be
> > > >> > > > > >> > ok.
> > > >> > > > > >> > >> Even if Samza is fully dependent on Kafka, there is
> > > enough
> > > >> > > > > substance
> > > >> > > > > >> in
> > > >> > > > > >> > >> Samza that it warrants being a separate project. An
> > > >> argument
> > > >> > in
> > > >> > > > > >> favour
> > > >> > > > > >> > of
> > > >> > > > > >> > >> merging would be if we think Kafka has a much
> stronger
> > > >> "brand
> > > >> > > > > >> presence"
> > > >> > > > > >> > >> than Samza; I'm ambivalent on that one. If the Kafka
> > > >> project
> > > >> > is
> > > >> > > > > >> willing
> > > >> > > > > >> > to
> > > >> > > > > >> > >> endorse Samza as the "official" way of doing
> stateful
> > > >> stream
> > > >> > > > > >> > >> transformations, that would probably have much the
> > same
> > > >> > effect
> > > >> > > as
> > > >> > > > > >> > >> re-branding Samza as "Kafka Stream Processors" or
> > > suchlike.
> > > >> > > Close
> > > >> > > > > >> > >> collaboration between the two projects will be
> needed
> > in
> > > >> any
> > > >> > > > case.
> > > >> > > > > >> > >>
> > > >> > > > > >> > >> From a project management perspective, I guess the
> > "new
> > > >> > Samza"
> > > >> > > > > would
> > > >> > > > > >> > have
> > > >> > > > > >> > >> to be developed on a branch alongside ongoing
> > > maintenance
> > > >> of
> > > >> > > the
> > > >> > > > > >> current
> > > >> > > > > >> > >> line of development? I think it would be important
> to
> > > >> > continue
> > > >> > > > > >> > supporting
> > > >> > > > > >> > >> existing users, and provide a graceful migration
> path
> > to
> > > >> the
> > > >> > > new
> > > >> > > > > >> > version.
> > > >> > > > > >> > >> Leaving the current versions unsupported and forcing
> > > people
> > > >> > to
> > > >> > > > > >> rewrite
> > > >> > > > > >> > >> their jobs would send a bad signal.
> > > >> > > > > >> > >>
> > > >> > > > > >> > >> Best,
> > > >> > > > > >> > >> Martin
> > > >> > > > > >> > >>
> > > >> > > > > >> > >> On 2 Jul 2015, at 16:59, Jay Kreps <
> [email protected]>
> > > >> wrote:
> > > >> > > > > >> > >>
> > > >> > > > > >> > >>> Hey Garry,
> > > >> > > > > >> > >>>
> > > >> > > > > >> > >>> Yeah that's super frustrating. I'd be happy to chat
> > > more
> > > >> > about
> > > >> > > > > this
> > > >> > > > > >> if
> > > >> > > > > >> > >>> you'd be interested. I think Chris and I started
> with
> > > the
> > > >> > idea
> > > >> > > > of
> > > >> > > > > >> "what
> > > >> > > > > >> > >>> would it take to make Samza a kick-ass ingestion
> > tool"
> > > but
> > > >> > > > > >> ultimately
> > > >> > > > > >> > we
> > > >> > > > > >> > >>> kind of came around to the idea that ingestion and
> > > >> > > > transformation
> > > >> > > > > >> had
> > > >> > > > > >> > >>> pretty different needs and coupling the two made
> > things
> > > >> > hard.
> > > >> > > > > >> > >>>
> > > >> > > > > >> > >>> For what it's worth I think copycat (KIP-26)
> actually
> > > will
> > > >> > do
> > > >> > > > what
> > > >> > > > > >> you
> > > >> > > > > >> > >> are
> > > >> > > > > >> > >>> looking for.
> > > >> > > > > >> > >>>
> > > >> > > > > >> > >>> With regard to your point about slider, I don't
> > > >> necessarily
> > > >> > > > > >> disagree.
> > > >> > > > > >> > >> But I
> > > >> > > > > >> > >>> think getting good YARN support is quite doable
> and I
> > > >> think
> > > >> > we
> > > >> > > > can
> > > >> > > > > >> make
> > > >> > > > > >> > >>> that work well. I think the issue this proposal
> > solves
> > > is
> > > >> > that
> > > >> > > > > >> > >> technically
> > > >> > > > > >> > >>> it is pretty hard to support multiple cluster
> > > management
> > > >> > > systems
> > > >> > > > > the
> > > >> > > > > >> > way
> > > >> > > > > >> > >>> things are now, you need to write an "app master"
> or
> > > >> > > "framework"
> > > >> > > > > for
> > > >> > > > > >> > each
> > > >> > > > > >> > >>> and they are all a little different so testing is
> > > really
> > > >> > hard.
> > > >> > > > In
> > > >> > > > > >> the
> > > >> > > > > >> > >>> absence of this we have been stuck with just YARN
> > which
> > > >> has
> > > >> > > > > >> fantastic
> > > >> > > > > >> > >>> penetration in the Hadoopy part of the org, but
> zero
> > > >> > > penetration
> > > >> > > > > >> > >> elsewhere.
> > > >> > > > > >> > >>> Given the huge amount of work being put in to
> slider,
> > > >> > > marathon,
> > > >> > > > > aws
> > > >> > > > > >> > >>> tooling, not to mention the umpteen related
> packaging
> > > >> > > > technologies
> > > >> > > > > >> > people
> > > >> > > > > >> > >>> want to use (Docker, Kubernetes, various
> > cloud-specific
> > > >> > deploy
> > > >> > > > > >> tools,
> > > >> > > > > >> > >> etc)
> > > >> > > > > >> > >>> I really think it is important to get this right.
> > > >> > > > > >> > >>>
> > > >> > > > > >> > >>> -Jay
> > > >> > > > > >> > >>>
> > > >> > > > > >> > >>> On Thu, Jul 2, 2015 at 4:17 AM, Garry Turkington <
> > > >> > > > > >> > >>> [email protected]> wrote:
> > > >> > > > > >> > >>>
> > > >> > > > > >> > >>>> Hi all,
> > > >> > > > > >> > >>>>
> > > >> > > > > >> > >>>> I think the question below re does Samza become a
> > > >> > sub-project
> > > >> > > > of
> > > >> > > > > >> Kafka
> > > >> > > > > >> > >>>> highlights the broader point around migration.
> Chris
> > > >> > mentions
> > > >> > > > > >> Samza's
> > > >> > > > > >> > >>>> maturity is heading towards a v1 release but I'm
> not
> > > sure
> > > >> > it
> > > >> > > > > feels
> > > >> > > > > >> > >> right to
> > > >> > > > > >> > >>>> launch a v1 then immediately plan to deprecate
> most
> > of
> > > >> it.
> > > >> > > > > >> > >>>>
> > > >> > > > > >> > >>>> From a selfish perspective I have some guys who
> have
> > > >> > started
> > > >> > > > > >> working
> > > >> > > > > >> > >> with
> > > >> > > > > >> > >>>> Samza and building some new consumers/producers
> was
> > > next
> > > >> > up.
> > > >> > > > > Sounds
> > > >> > > > > >> > like
> > > >> > > > > >> > >>>> that is absolutely not the direction to go. I need
> > to
> > > >> look
> > > >> > > into
> > > >> > > > > the
> > > >> > > > > >> > KIP
> > > >> > > > > >> > >> in
> > > >> > > > > >> > >>>> more detail but for me the attractiveness of
> adding
> > > new
> > > >> > Samza
> > > >> > > > > >> > >>>> consumer/producers -- even if yes all they were
> > doing
> > > was
> > > >> > > > really
> > > >> > > > > >> > getting
> > > >> > > > > >> > >>>> data into and out of Kafka --  was to avoid
> having
> > to
> > > >> > worry
> > > >> > > > > about
> > > >> > > > > >> the
> > > >> > > > > >> > >>>> lifecycle management of external clients. If there
> > is
> > > a
> > > >> > > generic
> > > >> > > > > >> Kafka
> > > >> > > > > >> > >>>> ingress/egress layer that I can plug a new
> connector
> > > into
> > > >> > and
> > > >> > > > > have
> > > >> > > > > >> a
> > > >> > > > > >> > >> lot of
> > > >> > > > > >> > >>>> the heavy lifting re scale and reliability done
> for
> > me
> > > >> then
> > > >> > > it
> > > >> > > > > >> gives
> > > >> > > > > >> > me
> > > >> > > > > >> > >> all
> > > >> > > > > >> > >>>> the pushing new consumers/producers would. If not
> > > then it
> > > >> > > > > >> complicates
> > > >> > > > > >> > my
> > > >> > > > > >> > >>>> operational deployments.
> > > >> > > > > >> > >>>>
> > > >> > > > > >> > >>>> Which is similar to my other question with the
> > > proposal
> > > >> --
> > > >> > if
> > > >> > > > we
> > > >> > > > > >> > build a
> > > >> > > > > >> > >>>> fully available/stand-alone Samza plus the
> requisite
> > > >> shims
> > > >> > to
> > > >> > > > > >> > integrate
> > > >> > > > > >> > >>>> with Slider etc I suspect the former may be a lot
> > more
> > > >> work
> > > >> > > > than
> > > >> > > > > we
> > > >> > > > > >> > >> think.
> > > >> > > > > >> > >>>> We may make it much easier for a newcomer to get
> > > >> something
> > > >> > > > > running
> > > >> > > > > >> but
> > > >> > > > > >> > >>>> having them step up and get a reliable production
> > > >> > deployment
> > > >> > > > may
> > > >> > > > > >> still
> > > >> > > > > >> > >>>> dominate mailing list  traffic, if for different
> > > reasons
> > > >> > than
> > > >> > > > > >> today.
> > > >> > > > > >> > >>>>
> > > >> > > > > >> > >>>> Don't get me wrong -- I'm comfortable with making
> > the
> > > >> Samza
> > > >> > > > > >> dependency
> > > >> > > > > >> > >> on
> > > >> > > > > >> > >>>> Kafka much more explicit and I absolutely see the
> > > >> benefits
> > > >> > > in
> > > >> > > > > the
> > > >> > > > > >> > >>>> reduction of duplication and clashing
> > > >> > > > terminologies/abstractions
> > > >> > > > > >> that
> > > >> > > > > >> > >>>> Chris/Jay describe. Samza as a library would
> likely
> > > be a
> > > >> > very
> > > >> > > > > nice
> > > >> > > > > >> > tool
> > > >> > > > > >> > >> to
> > > >> > > > > >> > >>>> add to the Kafka ecosystem. I just have the
> concerns
> > > >> above
> > > >> > re
> > > >> > > > the
> > > >> > > > > >> > >>>> operational side.
> > > >> > > > > >> > >>>>
> > > >> > > > > >> > >>>> Garry
> > > >> > > > > >> > >>>>
> > > >> > > > > >> > >>>> -----Original Message-----
> > > >> > > > > >> > >>>> From: Gianmarco De Francisci Morales [mailto:
> > > >> > [email protected]
> > > >> > > ]
> > > >> > > > > >> > >>>> Sent: 02 July 2015 12:56
> > > >> > > > > >> > >>>> To: [email protected]
> > > >> > > > > >> > >>>> Subject: Re: Thoughts and obesrvations on Samza
> > > >> > > > > >> > >>>>
> > > >> > > > > >> > >>>> Very interesting thoughts.
> > > >> > > > > >> > >>>> From outside, I have always perceived Samza as a
> > > >> computing
> > > >> > > > layer
> > > >> > > > > >> over
> > > >> > > > > >> > >>>> Kafka.
> > > >> > > > > >> > >>>>
> > > >> > > > > >> > >>>> The question, maybe a bit provocative, is "should
> > > Samza
> > > >> be
> > > >> > a
> > > >> > > > > >> > sub-project
> > > >> > > > > >> > >>>> of Kafka then?"
> > > >> > > > > >> > >>>> Or does it make sense to keep it as a separate
> > project
> > > >> > with a
> > > >> > > > > >> separate
> > > >> > > > > >> > >>>> governance?
> > > >> > > > > >> > >>>>
> > > >> > > > > >> > >>>> Cheers,
> > > >> > > > > >> > >>>>
> > > >> > > > > >> > >>>> --
> > > >> > > > > >> > >>>> Gianmarco
> > > >> > > > > >> > >>>>
> > > >> > > > > >> > >>>> On 2 July 2015 at 08:59, Yan Fang <
> > > [email protected]>
> > > >> > > > wrote:
> > > >> > > > > >> > >>>>
> > > >> > > > > >> > >>>>> Overall, I agree to couple with Kafka more
> tightly.
> > > >> > Because
> > > >> > > > > Samza
> > > >> > > > > >> de
> > > >> > > > > >> > >>>>> facto is based on Kafka, and it should leverage
> > what
> > > >> Kafka
> > > >> > > > has.
> > > >> > > > > At
> > > >> > > > > >> > the
> > > >> > > > > >> > >>>>> same time, Kafka does not need to reinvent what
> > Samza
> > > >> > > already
> > > >> > > > > >> has. I
> > > >> > > > > >> > >>>>> also like the idea of separating the ingestion
> and
> > > >> > > > > transformation.
> > > >> > > > > >> > >>>>>
> > > >> > > > > >> > >>>>> But it is a little difficult for me to image how
> > the
> > > >> Samza
> > > >> > > > will
> > > >> > > > > >> look
> > > >> > > > > >> > >>>> like.
> > > >> > > > > >> > >>>>> And I feel Chris and Jay have a little difference
> > in
> > > >> terms
> > > >> > > of
> > > >> > > > > how
> > > >> > > > > >> > >>>>> Samza should look like.
> > > >> > > > > >> > >>>>>
> > > >> > > > > >> > >>>>> *** Will it look like what Jay's code shows (A
> > > client of
> > > >> > > > Kakfa)
> > > >> > > > > ?
> > > >> > > > > >> And
> > > >> > > > > >> > >>>>> user's application code calls this client?
> > > >> > > > > >> > >>>>>
> > > >> > > > > >> > >>>>> 1. If we make Samza be a library of Kafka (like
> > what
> > > the
> > > >> > > code
> > > >> > > > > >> shows),
> > > >> > > > > >> > >>>>> how do we implement auto-balance and
> > fault-tolerance?
> > > >> Are
> > > >> > > they
> > > >> > > > > >> taken
> > > >> > > > > >> > >>>>> care by the Kafka broker or other mechanism, such
> > as
> > > >> > "Samza
> > > >> > > > > >> worker"
> > > >> > > > > >> > >>>>> (just make up the name) ?
> > > >> > > > > >> > >>>>>
> > > >> > > > > >> > >>>>> 2. What about other features, such as
> auto-scaling,
> > > >> shared
> > > >> > > > > state,
> > > >> > > > > >> > >>>>> monitoring?
> > > >> > > > > >> > >>>>>
> > > >> > > > > >> > >>>>>
> > > >> > > > > >> > >>>>> *** If we have Samza standalone, (is this what
> > Chris
> > > >> > > > suggests?)
> > > >> > > > > >> > >>>>>
> > > >> > > > > >> > >>>>> 1. we still need to ingest data from Kakfa and
> > > produce
> > > >> to
> > > >> > > it.
> > > >> > > > > >> Then it
> > > >> > > > > >> > >>>>> becomes the same as what Samza looks like now,
> > > except it
> > > >> > > does
> > > >> > > > > not
> > > >> > > > > >> > rely
> > > >> > > > > >> > >>>>> on Yarn anymore.
> > > >> > > > > >> > >>>>>
> > > >> > > > > >> > >>>>> 2. if it is standalone, how can it leverage
> Kafka's
> > > >> > metrics,
> > > >> > > > > logs,
> > > >> > > > > >> > >>>>> etc? Use Kafka code as the dependency?
> > > >> > > > > >> > >>>>>
> > > >> > > > > >> > >>>>>
> > > >> > > > > >> > >>>>> Thanks,
> > > >> > > > > >> > >>>>>
> > > >> > > > > >> > >>>>> Fang, Yan
> > > >> > > > > >> > >>>>> [email protected]
> > > >> > > > > >> > >>>>>
> > > >> > > > > >> > >>>>> On Wed, Jul 1, 2015 at 5:46 PM, Guozhang Wang <
> > > >> > > > > [email protected]
> > > >> > > > > >> >
> > > >> > > > > >> > >>>> wrote:
> > > >> > > > > >> > >>>>>
> > > >> > > > > >> > >>>>>> Read through the code example and it looks good
> to
> > > me.
> > > >> A
> > > >> > > few
> > > >> > > > > >> > >>>>>> thoughts regarding deployment:
> > > >> > > > > >> > >>>>>>
> > > >> > > > > >> > >>>>>> Today Samza deploys as executable runnable like:
> > > >> > > > > >> > >>>>>>
> > > >> > > > > >> > >>>>>> deploy/samza/bin/run-job.sh --config-factory=...
> > > >> > > > > >> > >>>> --config-path=file://...
> > > >> > > > > >> > >>>>>>
> > > >> > > > > >> > >>>>>> And this proposal advocate for deploying Samza
> > more
> > > as
> > > >> > > > embedded
> > > >> > > > > >> > >>>>>> libraries in user application code (ignoring the
> > > >> > > terminology
> > > >> > > > > >> since
> > > >> > > > > >> > >>>>>> it is not the
> > > >> > > > > >> > >>>>> same
> > > >> > > > > >> > >>>>>> as the prototype code):
> > > >> > > > > >> > >>>>>>
> > > >> > > > > >> > >>>>>> StreamTask task = new MyStreamTask(configs);
> > Thread
> > > >> > thread
> > > >> > > =
> > > >> > > > > new
> > > >> > > > > >> > >>>>>> Thread(task); thread.start();
> > > >> > > > > >> > >>>>>>
> > > >> > > > > >> > >>>>>> I think both of these deployment modes are
> > important
> > > >> for
> > > >> > > > > >> different
> > > >> > > > > >> > >>>>>> types
> > > >> > > > > >> > >>>>> of
> > > >> > > > > >> > >>>>>> users. That said, I think making Samza purely
> > > >> standalone
> > > >> > is
> > > >> > > > > still
> > > >> > > > > >> > >>>>>> sufficient for either runnable or library modes.
> > > >> > > > > >> > >>>>>>
> > > >> > > > > >> > >>>>>> Guozhang
> > > >> > > > > >> > >>>>>>
> > > >> > > > > >> > >>>>>> On Tue, Jun 30, 2015 at 11:33 PM, Jay Kreps <
> > > >> > > > [email protected]>
> > > >> > > > > >> > wrote:
> > > >> > > > > >> > >>>>>>
> > > >> > > > > >> > >>>>>>> Looks like gmail mangled the code example, it
> was
> > > >> > supposed
> > > >> > > > to
> > > >> > > > > >> look
> > > >> > > > > >> > >>>>>>> like
> > > >> > > > > >> > >>>>>>> this:
> > > >> > > > > >> > >>>>>>>
> > > >> > > > > >> > >>>>>>> Properties props = new Properties();
> > > >> > > > > >> > >>>>>>> props.put("bootstrap.servers",
> "localhost:4242");
> > > >> > > > > >> StreamingConfig
> > > >> > > > > >> > >>>>>>> config = new StreamingConfig(props);
> > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1",
> "test-topic-2");
> > > >> > > > > >> > >>>>>>> config.processor(ExampleStreamProcessor.class);
> > > >> > > > > >> > >>>>>>> config.serialization(new StringSerializer(),
> new
> > > >> > > > > >> > >>>>>>> StringDeserializer()); KafkaStreaming
> container =
> > > new
> > > >> > > > > >> > >>>>>>> KafkaStreaming(config); container.run();
> > > >> > > > > >> > >>>>>>>
> > > >> > > > > >> > >>>>>>> -Jay
> > > >> > > > > >> > >>>>>>>
> > > >> > > > > >> > >>>>>>> On Tue, Jun 30, 2015 at 11:32 PM, Jay Kreps <
> > > >> > > > [email protected]
> > > >> > > > > >
> > > >> > > > > >> > >>>> wrote:
> > > >> > > > > >> > >>>>>>>
> > > >> > > > > >> > >>>>>>>> Hey guys,
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>> This came out of some conversations Chris and
> I
> > > were
> > > >> > > having
> > > >> > > > > >> > >>>>>>>> around
> > > >> > > > > >> > >>>>>>> whether
> > > >> > > > > >> > >>>>>>>> it would make sense to use Samza as a kind of
> > data
> > > >> > > > ingestion
> > > >> > > > > >> > >>>>> framework
> > > >> > > > > >> > >>>>>>> for
> > > >> > > > > >> > >>>>>>>> Kafka (which ultimately lead to KIP-26
> > "copycat").
> > > >> This
> > > >> > > > kind
> > > >> > > > > of
> > > >> > > > > >> > >>>>>> combined
> > > >> > > > > >> > >>>>>>>> with complaints around config and YARN and the
> > > >> > discussion
> > > >> > > > > >> around
> > > >> > > > > >> > >>>>>>>> how
> > > >> > > > > >> > >>>>> to
> > > >> > > > > >> > >>>>>>>> best do a standalone mode.
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>> So the thought experiment was, given that
> Samza
> > > was
> > > >> > > > basically
> > > >> > > > > >> > >>>>>>>> already totally Kafka specific, what if you
> just
> > > >> > embraced
> > > >> > > > > that
> > > >> > > > > >> > >>>>>>>> and turned it
> > > >> > > > > >> > >>>>>> into
> > > >> > > > > >> > >>>>>>>> something less like a heavyweight framework
> and
> > > more
> > > >> > > like a
> > > >> > > > > >> > >>>>>>>> third
> > > >> > > > > >> > >>>>> Kafka
> > > >> > > > > >> > >>>>>>>> client--a kind of "producing consumer" with
> > state
> > > >> > > > management
> > > >> > > > > >> > >>>>>> facilities.
> > > >> > > > > >> > >>>>>>>> Basically a library. Instead of a complex
> stream
> > > >> > > processing
> > > >> > > > > >> > >>>>>>>> framework
> > > >> > > > > >> > >>>>>>> this
> > > >> > > > > >> > >>>>>>>> would actually be a very simple thing, not
> much
> > > more
> > > >> > > > > >> complicated
> > > >> > > > > >> > >>>>>>>> to
> > > >> > > > > >> > >>>>> use
> > > >> > > > > >> > >>>>>>> or
> > > >> > > > > >> > >>>>>>>> operate than a Kafka consumer. As Chris said
> we
> > > >> thought
> > > >> > > > about
> > > >> > > > > >> it
> > > >> > > > > >> > >>>>>>>> a
> > > >> > > > > >> > >>>>> lot
> > > >> > > > > >> > >>>>>> of
> > > >> > > > > >> > >>>>>>>> what Samza (and the other stream processing
> > > systems
> > > >> > were
> > > >> > > > > doing)
> > > >> > > > > >> > >>>>> seemed
> > > >> > > > > >> > >>>>>>> like
> > > >> > > > > >> > >>>>>>>> kind of a hangover from MapReduce.
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>> Of course you need to ingest/output data to
> and
> > > from
> > > >> > the
> > > >> > > > > stream
> > > >> > > > > >> > >>>>>>>> processing. But when we actually looked into
> how
> > > that
> > > >> > > would
> > > >> > > > > >> > >>>>>>>> work,
> > > >> > > > > >> > >>>>> Samza
> > > >> > > > > >> > >>>>>>>> isn't really an ideal data ingestion framework
> > > for a
> > > >> > > bunch
> > > >> > > > of
> > > >> > > > > >> > >>>>> reasons.
> > > >> > > > > >> > >>>>>> To
> > > >> > > > > >> > >>>>>>>> really do that right you need a pretty
> different
> > > >> > internal
> > > >> > > > > data
> > > >> > > > > >> > >>>>>>>> model
> > > >> > > > > >> > >>>>>> and
> > > >> > > > > >> > >>>>>>>> set of apis. So what if you split them and had
> > an
> > > api
> > > >> > for
> > > >> > > > > Kafka
> > > >> > > > > >> > >>>>>>>> ingress/egress (copycat AKA KIP-26) and a
> > separate
> > > >> api
> > > >> > > for
> > > >> > > > > >> Kafka
> > > >> > > > > >> > >>>>>>>> transformation (Samza).
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>> This would also allow really embracing the
> same
> > > >> > > terminology
> > > >> > > > > and
> > > >> > > > > >> > >>>>>>>> conventions. One complaint about the current
> > > state is
> > > >> > > that
> > > >> > > > > the
> > > >> > > > > >> > >>>>>>>> two
> > > >> > > > > >> > >>>>>>> systems
> > > >> > > > > >> > >>>>>>>> kind of feel bolted on. Terminology like
> > "stream"
> > > vs
> > > >> > > > "topic"
> > > >> > > > > >> and
> > > >> > > > > >> > >>>>>>> different
> > > >> > > > > >> > >>>>>>>> config and monitoring systems means you kind
> of
> > > have
> > > >> to
> > > >> > > > learn
> > > >> > > > > >> > >>>>>>>> Kafka's
> > > >> > > > > >> > >>>>>>> way,
> > > >> > > > > >> > >>>>>>>> then learn Samza's slightly different way,
> then
> > > kind
> > > >> of
> > > >> > > > > >> > >>>>>>>> understand
> > > >> > > > > >> > >>>>> how
> > > >> > > > > >> > >>>>>>> they
> > > >> > > > > >> > >>>>>>>> map to each other, which having walked a few
> > > people
> > > >> > > through
> > > >> > > > > >> this
> > > >> > > > > >> > >>>>>>>> is surprisingly tricky for folks to get.
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>> Since I have been spending a lot of time on
> > > >> airplanes I
> > > >> > > > > hacked
> > > >> > > > > >> > >>>>>>>> up an ernest but still somewhat incomplete
> > > prototype
> > > >> of
> > > >> > > > what
> > > >> > > > > >> > >>>>>>>> this would
> > > >> > > > > >> > >>>>> look
> > > >> > > > > >> > >>>>>>>> like. This is just unceremoniously dumped into
> > > Kafka
> > > >> as
> > > >> > > it
> > > >> > > > > >> > >>>>>>>> required a
> > > >> > > > > >> > >>>>>> few
> > > >> > > > > >> > >>>>>>>> changes to the new consumer. Here is the code:
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>
> > > >> > > > > >> > >>>>>>
> > > >> > > > > >> > >>>>>
> > > >> > > > > >> >
> > > >> > > > >
> > > >> >
> > > https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org
> > > >> > > > > >> > >>>>> /apache/kafka/clients/streaming
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>> For the purpose of the prototype I just
> > liberally
> > > >> > renamed
> > > >> > > > > >> > >>>>>>>> everything
> > > >> > > > > >> > >>>>> to
> > > >> > > > > >> > >>>>>>>> try to align it with Kafka with no regard for
> > > >> > > > compatibility.
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>> To use this would be something like this:
> > > >> > > > > >> > >>>>>>>> Properties props = new Properties();
> > > >> > > > > >> > >>>>>>>> props.put("bootstrap.servers",
> > "localhost:4242");
> > > >> > > > > >> > >>>>>>>> StreamingConfig config = new
> > > >> > > > > >> > >>>>> StreamingConfig(props);
> > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1",
> > > >> > > > > >> > >>>>>>>> "test-topic-2");
> > > >> > > > > >> config.processor(ExampleStreamProcessor.class);
> > > >> > > > > >> > >>>>>>> config.serialization(new
> > > >> > > > > >> > >>>>>>>> StringSerializer(), new StringDeserializer());
> > > >> > > > KafkaStreaming
> > > >> > > > > >> > >>>>>> container =
> > > >> > > > > >> > >>>>>>>> new KafkaStreaming(config); container.run();
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>> KafkaStreaming is basically the
> SamzaContainer;
> > > >> > > > > StreamProcessor
> > > >> > > > > >> > >>>>>>>> is basically StreamTask.
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>> So rather than putting all the class names in
> a
> > > file
> > > >> > and
> > > >> > > > then
> > > >> > > > > >> > >>>>>>>> having
> > > >> > > > > >> > >>>>>> the
> > > >> > > > > >> > >>>>>>>> job assembled by reflection, you just
> > instantiate
> > > the
> > > >> > > > > container
> > > >> > > > > >> > >>>>>>>> programmatically. Work is balanced over
> however
> > > many
> > > >> > > > > instances
> > > >> > > > > >> > >>>>>>>> of
> > > >> > > > > >> > >>>>> this
> > > >> > > > > >> > >>>>>>> are
> > > >> > > > > >> > >>>>>>>> alive at any time (i.e. if an instance dies,
> new
> > > >> tasks
> > > >> > > are
> > > >> > > > > >> added
> > > >> > > > > >> > >>>>>>>> to
> > > >> > > > > >> > >>>>> the
> > > >> > > > > >> > >>>>>>>> existing containers without shutting them
> down).
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>> We would provide some glue for running this
> > stuff
> > > in
> > > >> > YARN
> > > >> > > > via
> > > >> > > > > >> > >>>>>>>> Slider, Mesos via Marathon, and AWS using some
> > of
> > > >> their
> > > >> > > > tools
> > > >> > > > > >> > >>>>>>>> but from the
> > > >> > > > > >> > >>>>>> point
> > > >> > > > > >> > >>>>>>> of
> > > >> > > > > >> > >>>>>>>> view of these frameworks these stream
> processing
> > > jobs
> > > >> > are
> > > >> > > > > just
> > > >> > > > > >> > >>>>>> stateless
> > > >> > > > > >> > >>>>>>>> services that can come and go and expand and
> > > contract
> > > >> > at
> > > >> > > > > will.
> > > >> > > > > >> > >>>>>>>> There
> > > >> > > > > >> > >>>>> is
> > > >> > > > > >> > >>>>>>> no
> > > >> > > > > >> > >>>>>>>> more custom scheduler.
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>> Here are some relevant details:
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>>  1. It is only ~1300 lines of code, it would
> get
> > > >> larger
> > > >> > > if
> > > >> > > > we
> > > >> > > > > >> > >>>>>>>>  productionized but not vastly larger. We
> really
> > > do
> > > >> > get a
> > > >> > > > ton
> > > >> > > > > >> > >>>>>>>> of
> > > >> > > > > >> > >>>>>>> leverage
> > > >> > > > > >> > >>>>>>>>  out of Kafka.
> > > >> > > > > >> > >>>>>>>>  2. Partition management is fully delegated to
> > the
> > > >> new
> > > >> > > > > >> consumer.
> > > >> > > > > >> > >>>>> This
> > > >> > > > > >> > >>>>>>>>  is nice since now any partition management
> > > strategy
> > > >> > > > > available
> > > >> > > > > >> > >>>>>>>> to
> > > >> > > > > >> > >>>>>> Kafka
> > > >> > > > > >> > >>>>>>>>  consumer is also available to Samza (and vice
> > > versa)
> > > >> > and
> > > >> > > > > with
> > > >> > > > > >> > >>>>>>>> the
> > > >> > > > > >> > >>>>>>> exact
> > > >> > > > > >> > >>>>>>>>  same configs.
> > > >> > > > > >> > >>>>>>>>  3. It supports state as well as state reuse
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>> Anyhow take a look, hopefully it is thought
> > > >> provoking.
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>> -Jay
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>> On Tue, Jun 30, 2015 at 6:55 PM, Chris
> > Riccomini <
> > > >> > > > > >> > >>>>>> [email protected]>
> > > >> > > > > >> > >>>>>>>> wrote:
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>>> Hey all,
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> I have had some discussions with Samza
> > engineers
> > > at
> > > >> > > > LinkedIn
> > > >> > > > > >> > >>>>>>>>> and
> > > >> > > > > >> > >>>>>>> Confluent
> > > >> > > > > >> > >>>>>>>>> and we came up with a few observations and
> > would
> > > >> like
> > > >> > to
> > > >> > > > > >> > >>>>>>>>> propose
> > > >> > > > > >> > >>>>> some
> > > >> > > > > >> > >>>>>>>>> changes.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> We've observed some things that I want to
> call
> > > out
> > > >> > about
> > > >> > > > > >> > >>>>>>>>> Samza's
> > > >> > > > > >> > >>>>>> design,
> > > >> > > > > >> > >>>>>>>>> and I'd like to propose some changes.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> * Samza is dependent upon a dynamic
> deployment
> > > >> system.
> > > >> > > > > >> > >>>>>>>>> * Samza is too pluggable.
> > > >> > > > > >> > >>>>>>>>> * Samza's SystemConsumer/SystemProducer and
> > > Kafka's
> > > >> > > > consumer
> > > >> > > > > >> > >>>>>>>>> APIs
> > > >> > > > > >> > >>>>> are
> > > >> > > > > >> > >>>>>>>>> trying to solve a lot of the same problems.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> All three of these issues are related, but
> I'll
> > > >> > address
> > > >> > > > them
> > > >> > > > > >> in
> > > >> > > > > >> > >>>>> order.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> Deployment
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> Samza strongly depends on the use of a
> dynamic
> > > >> > > deployment
> > > >> > > > > >> > >>>>>>>>> scheduler
> > > >> > > > > >> > >>>>>> such
> > > >> > > > > >> > >>>>>>>>> as
> > > >> > > > > >> > >>>>>>>>> YARN, Mesos, etc. When we initially built
> > Samza,
> > > we
> > > >> > bet
> > > >> > > > that
> > > >> > > > > >> > >>>>>>>>> there
> > > >> > > > > >> > >>>>>> would
> > > >> > > > > >> > >>>>>>>>> be
> > > >> > > > > >> > >>>>>>>>> one or two winners in this area, and we could
> > > >> support
> > > >> > > > them,
> > > >> > > > > >> and
> > > >> > > > > >> > >>>>>>>>> the
> > > >> > > > > >> > >>>>>> rest
> > > >> > > > > >> > >>>>>>>>> would go away. In reality, there are many
> > > >> variations.
> > > >> > > > > >> > >>>>>>>>> Furthermore,
> > > >> > > > > >> > >>>>>> many
> > > >> > > > > >> > >>>>>>>>> people still prefer to just start their
> > > processors
> > > >> > like
> > > >> > > > > normal
> > > >> > > > > >> > >>>>>>>>> Java processes, and use traditional
> deployment
> > > >> scripts
> > > >> > > > such
> > > >> > > > > as
> > > >> > > > > >> > >>>>>>>>> Fabric,
> > > >> > > > > >> > >>>>>> Chef,
> > > >> > > > > >> > >>>>>>>>> Ansible, etc. Forcing a deployment system on
> > > users
> > > >> > makes
> > > >> > > > the
> > > >> > > > > >> > >>>>>>>>> Samza start-up process really painful for
> first
> > > time
> > > >> > > > users.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> Dynamic deployment as a requirement was also
> a
> > > bit
> > > >> of
> > > >> > a
> > > >> > > > > >> > >>>>>>>>> mis-fire
> > > >> > > > > >> > >>>>>> because
> > > >> > > > > >> > >>>>>>>>> of
> > > >> > > > > >> > >>>>>>>>> a fundamental misunderstanding between the
> > > nature of
> > > >> > > batch
> > > >> > > > > >> jobs
> > > >> > > > > >> > >>>>>>>>> and
> > > >> > > > > >> > >>>>>>> stream
> > > >> > > > > >> > >>>>>>>>> processing jobs. Early on, we made conscious
> > > effort
> > > >> to
> > > >> > > > favor
> > > >> > > > > >> > >>>>>>>>> the
> > > >> > > > > >> > >>>>>> Hadoop
> > > >> > > > > >> > >>>>>>>>> (Map/Reduce) way of doing things, since it
> > worked
> > > >> and
> > > >> > > was
> > > >> > > > > well
> > > >> > > > > >> > >>>>>>> understood.
> > > >> > > > > >> > >>>>>>>>> One thing that we missed was that batch jobs
> > > have a
> > > >> > > > definite
> > > >> > > > > >> > >>>>>> beginning,
> > > >> > > > > >> > >>>>>>>>> and
> > > >> > > > > >> > >>>>>>>>> end, and stream processing jobs don't
> > (usually).
> > > >> This
> > > >> > > > leads
> > > >> > > > > to
> > > >> > > > > >> > >>>>>>>>> a
> > > >> > > > > >> > >>>>> much
> > > >> > > > > >> > >>>>>>>>> simpler scheduling problem for stream
> > processors.
> > > >> You
> > > >> > > > > >> basically
> > > >> > > > > >> > >>>>>>>>> just
> > > >> > > > > >> > >>>>>>> need
> > > >> > > > > >> > >>>>>>>>> to find a place to start the processor, and
> > start
> > > >> it.
> > > >> > > The
> > > >> > > > > way
> > > >> > > > > >> > >>>>>>>>> we run grids, at LinkedIn, there's no concept
> > of
> > > a
> > > >> > > cluster
> > > >> > > > > >> > >>>>>>>>> being "full". We always
> > > >> > > > > >> > >>>>>> add
> > > >> > > > > >> > >>>>>>>>> more machines. The problem with coupling
> Samza
> > > with
> > > >> a
> > > >> > > > > >> scheduler
> > > >> > > > > >> > >>>>>>>>> is
> > > >> > > > > >> > >>>>>> that
> > > >> > > > > >> > >>>>>>>>> Samza (as a framework) now has to handle
> > > deployment.
> > > >> > > This
> > > >> > > > > >> pulls
> > > >> > > > > >> > >>>>>>>>> in a
> > > >> > > > > >> > >>>>>>> bunch
> > > >> > > > > >> > >>>>>>>>> of things such as configuration distribution
> > > (config
> > > >> > > > > stream),
> > > >> > > > > >> > >>>>>>>>> shell
> > > >> > > > > >> > >>>>>>> scrips
> > > >> > > > > >> > >>>>>>>>> (bin/run-job.sh, JobRunner), packaging (all
> the
> > > .tgz
> > > >> > > > stuff),
> > > >> > > > > >> etc.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> Another reason for requiring dynamic
> deployment
> > > was
> > > >> to
> > > >> > > > > support
> > > >> > > > > >> > >>>>>>>>> data locality. If you want to have locality,
> > you
> > > >> need
> > > >> > to
> > > >> > > > put
> > > >> > > > > >> > >>>>>>>>> your
> > > >> > > > > >> > >>>>>> processors
> > > >> > > > > >> > >>>>>>>>> close to the data they're processing. Upon
> > > further
> > > >> > > > > >> > >>>>>>>>> investigation,
> > > >> > > > > >> > >>>>>>> though,
> > > >> > > > > >> > >>>>>>>>> this feature is not that beneficial. There is
> > > some
> > > >> > good
> > > >> > > > > >> > >>>>>>>>> discussion
> > > >> > > > > >> > >>>>>> about
> > > >> > > > > >> > >>>>>>>>> some problems with it on SAMZA-335. Again, we
> > > took
> > > >> the
> > > >> > > > > >> > >>>>>>>>> Map/Reduce
> > > >> > > > > >> > >>>>>> path,
> > > >> > > > > >> > >>>>>>>>> but
> > > >> > > > > >> > >>>>>>>>> there are some fundamental differences
> between
> > > HDFS
> > > >> > and
> > > >> > > > > Kafka.
> > > >> > > > > >> > >>>>>>>>> HDFS
> > > >> > > > > >> > >>>>>> has
> > > >> > > > > >> > >>>>>>>>> blocks, while Kafka has partitions. This
> leads
> > to
> > > >> less
> > > >> > > > > >> > >>>>>>>>> optimization potential with stream processors
> > on
> > > top
> > > >> > of
> > > >> > > > > Kafka.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> This feature is also used as a crutch. Samza
> > > doesn't
> > > >> > > have
> > > >> > > > > any
> > > >> > > > > >> > >>>>>>>>> built
> > > >> > > > > >> > >>>>> in
> > > >> > > > > >> > >>>>>>>>> fault-tolerance logic. Instead, it depends on
> > the
> > > >> > > dynamic
> > > >> > > > > >> > >>>>>>>>> deployment scheduling system to handle
> restarts
> > > >> when a
> > > >> > > > > >> > >>>>>>>>> processor dies. This has
> > > >> > > > > >> > >>>>>>> made
> > > >> > > > > >> > >>>>>>>>> it very difficult to write a standalone Samza
> > > >> > container
> > > >> > > > > >> > >>>> (SAMZA-516).
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> Pluggability
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> In some cases pluggability is good, but I
> think
> > > that
> > > >> > > we've
> > > >> > > > > >> gone
> > > >> > > > > >> > >>>>>>>>> too
> > > >> > > > > >> > >>>>>> far
> > > >> > > > > >> > >>>>>>>>> with it. Currently, Samza has:
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> * Pluggable config.
> > > >> > > > > >> > >>>>>>>>> * Pluggable metrics.
> > > >> > > > > >> > >>>>>>>>> * Pluggable deployment systems.
> > > >> > > > > >> > >>>>>>>>> * Pluggable streaming systems
> (SystemConsumer,
> > > >> > > > > SystemProducer,
> > > >> > > > > >> > >>>> etc).
> > > >> > > > > >> > >>>>>>>>> * Pluggable serdes.
> > > >> > > > > >> > >>>>>>>>> * Pluggable storage engines.
> > > >> > > > > >> > >>>>>>>>> * Pluggable strategies for just about every
> > > >> component
> > > >> > > > > >> > >>>>> (MessageChooser,
> > > >> > > > > >> > >>>>>>>>> SystemStreamPartitionGrouper, ConfigRewriter,
> > > etc).
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> There's probably more that I've forgotten, as
> > > well.
> > > >> > Some
> > > >> > > > of
> > > >> > > > > >> > >>>>>>>>> these
> > > >> > > > > >> > >>>>> are
> > > >> > > > > >> > >>>>>>>>> useful, but some have proven not to be. This
> > all
> > > >> comes
> > > >> > > at
> > > >> > > > a
> > > >> > > > > >> cost:
> > > >> > > > > >> > >>>>>>>>> complexity. This complexity is making it
> harder
> > > for
> > > >> > our
> > > >> > > > > users
> > > >> > > > > >> > >>>>>>>>> to
> > > >> > > > > >> > >>>>> pick
> > > >> > > > > >> > >>>>>> up
> > > >> > > > > >> > >>>>>>>>> and use Samza out of the box. It also makes
> it
> > > >> > difficult
> > > >> > > > for
> > > >> > > > > >> > >>>>>>>>> Samza developers to reason about what the
> > > >> > > characteristics
> > > >> > > > of
> > > >> > > > > >> > >>>>>>>>> the container (since the characteristics
> change
> > > >> > > depending
> > > >> > > > on
> > > >> > > > > >> > >>>>>>>>> which plugins are use).
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> The issues with pluggability are most visible
> > in
> > > the
> > > >> > > > System
> > > >> > > > > >> APIs.
> > > >> > > > > >> > >>>>> What
> > > >> > > > > >> > >>>>>>>>> Samza really requires to be functional is
> Kafka
> > > as
> > > >> its
> > > >> > > > > >> > >>>>>>>>> transport
> > > >> > > > > >> > >>>>>> layer.
> > > >> > > > > >> > >>>>>>>>> But
> > > >> > > > > >> > >>>>>>>>> we've conflated two unrelated use cases into
> > one
> > > >> API:
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> 1. Get data into/out of Kafka.
> > > >> > > > > >> > >>>>>>>>> 2. Process the data in Kafka.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> The current System API supports both of these
> > use
> > > >> > cases.
> > > >> > > > The
> > > >> > > > > >> > >>>>>>>>> problem
> > > >> > > > > >> > >>>>>> is,
> > > >> > > > > >> > >>>>>>>>> we
> > > >> > > > > >> > >>>>>>>>> actually want different features for each use
> > > case.
> > > >> By
> > > >> > > > > >> papering
> > > >> > > > > >> > >>>>>>>>> over
> > > >> > > > > >> > >>>>>>> these
> > > >> > > > > >> > >>>>>>>>> two use cases, and providing a single API,
> > we've
> > > >> > > > introduced
> > > >> > > > > a
> > > >> > > > > >> > >>>>>>>>> ton of
> > > >> > > > > >> > >>>>>>> leaky
> > > >> > > > > >> > >>>>>>>>> abstractions.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> For example, what we'd really like in (2) is
> to
> > > have
> > > >> > > > > >> > >>>>>>>>> monotonically increasing longs for offsets
> > (like
> > > >> > Kafka).
> > > >> > > > > This
> > > >> > > > > >> > >>>>>>>>> would be at odds
> > > >> > > > > >> > >>>>> with
> > > >> > > > > >> > >>>>>>> (1),
> > > >> > > > > >> > >>>>>>>>> though, since different systems have
> different
> > > >> > > > > >> > >>>>>>> SCNs/Offsets/UUIDs/vectors.
> > > >> > > > > >> > >>>>>>>>> There was discussion both on the mailing list
> > and
> > > >> the
> > > >> > > SQL
> > > >> > > > > >> JIRAs
> > > >> > > > > >> > >>>>> about
> > > >> > > > > >> > >>>>>>> the
> > > >> > > > > >> > >>>>>>>>> need for this.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> The same thing holds true for replayability.
> > > Kafka
> > > >> > > allows
> > > >> > > > us
> > > >> > > > > >> to
> > > >> > > > > >> > >>>>> rewind
> > > >> > > > > >> > >>>>>>>>> when
> > > >> > > > > >> > >>>>>>>>> we have a failure. Many other systems don't.
> In
> > > some
> > > >> > > > cases,
> > > >> > > > > >> > >>>>>>>>> systems
> > > >> > > > > >> > >>>>>>> return
> > > >> > > > > >> > >>>>>>>>> null for their offsets (e.g.
> > > >> WikipediaSystemConsumer)
> > > >> > > > > because
> > > >> > > > > >> > >>>>>>>>> they
> > > >> > > > > >> > >>>>>> have
> > > >> > > > > >> > >>>>>>> no
> > > >> > > > > >> > >>>>>>>>> offsets.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> Partitioning is another example. Kafka
> supports
> > > >> > > > > partitioning,
> > > >> > > > > >> > >>>>>>>>> but
> > > >> > > > > >> > >>>>> many
> > > >> > > > > >> > >>>>>>>>> systems don't. We model this by having a
> single
> > > >> > > partition
> > > >> > > > > for
> > > >> > > > > >> > >>>>>>>>> those systems. Still, other systems model
> > > >> partitioning
> > > >> > > > > >> > >>>> differently (e.g.
> > > >> > > > > >> > >>>>>>>>> Kinesis).
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> The SystemAdmin interface is also a mess.
> > > Creating
> > > >> > > streams
> > > >> > > > > in
> > > >> > > > > >> a
> > > >> > > > > >> > >>>>>>>>> system-agnostic way is almost impossible. As
> is
> > > >> > modeling
> > > >> > > > > >> > >>>>>>>>> metadata
> > > >> > > > > >> > >>>>> for
> > > >> > > > > >> > >>>>>>> the
> > > >> > > > > >> > >>>>>>>>> system (replication factor, partitions,
> > location,
> > > >> > etc).
> > > >> > > > The
> > > >> > > > > >> > >>>>>>>>> list
> > > >> > > > > >> > >>>>> goes
> > > >> > > > > >> > >>>>>>> on.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> Duplicate work
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> At the time that we began writing Samza,
> > Kafka's
> > > >> > > consumer
> > > >> > > > > and
> > > >> > > > > >> > >>>>> producer
> > > >> > > > > >> > >>>>>>>>> APIs
> > > >> > > > > >> > >>>>>>>>> had a relatively weak feature set. On the
> > > >> > consumer-side,
> > > >> > > > you
> > > >> > > > > >> > >>>>>>>>> had two
> > > >> > > > > >> > >>>>>>>>> options: use the high level consumer, or the
> > > simple
> > > >> > > > > consumer.
> > > >> > > > > >> > >>>>>>>>> The
> > > >> > > > > >> > >>>>>>> problem
> > > >> > > > > >> > >>>>>>>>> with the high-level consumer was that it
> > > controlled
> > > >> > your
> > > >> > > > > >> > >>>>>>>>> offsets, partition assignments, and the order
> > in
> > > >> which
> > > >> > > you
> > > >> > > > > >> > >>>>>>>>> received messages. The
> > > >> > > > > >> > >>>>> problem
> > > >> > > > > >> > >>>>>>>>> with
> > > >> > > > > >> > >>>>>>>>> the simple consumer is that it's not simple.
> > It's
> > > >> > basic.
> > > >> > > > You
> > > >> > > > > >> > >>>>>>>>> end up
> > > >> > > > > >> > >>>>>>> having
> > > >> > > > > >> > >>>>>>>>> to handle a lot of really low-level stuff
> that
> > > you
> > > >> > > > > shouldn't.
> > > >> > > > > >> > >>>>>>>>> We
> > > >> > > > > >> > >>>>>> spent a
> > > >> > > > > >> > >>>>>>>>> lot of time to make Samza's
> KafkaSystemConsumer
> > > very
> > > >> > > > robust.
> > > >> > > > > >> It
> > > >> > > > > >> > >>>>>>>>> also allows us to support some cool features:
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> * Per-partition message ordering and
> > > prioritization.
> > > >> > > > > >> > >>>>>>>>> * Tight control over partition assignment to
> > > support
> > > >> > > > joins,
> > > >> > > > > >> > >>>>>>>>> global
> > > >> > > > > >> > >>>>>> state
> > > >> > > > > >> > >>>>>>>>> (if we want to implement it :)), etc.
> > > >> > > > > >> > >>>>>>>>> * Tight control over offset checkpointing.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> What we didn't realize at the time is that
> > these
> > > >> > > features
> > > >> > > > > >> > >>>>>>>>> should
> > > >> > > > > >> > >>>>>>> actually
> > > >> > > > > >> > >>>>>>>>> be in Kafka. A lot of Kafka consumers (not
> just
> > > >> Samza
> > > >> > > > stream
> > > >> > > > > >> > >>>>>> processors)
> > > >> > > > > >> > >>>>>>>>> end up wanting to do things like joins and
> > > partition
> > > >> > > > > >> > >>>>>>>>> assignment. The
> > > >> > > > > >> > >>>>>>> Kafka
> > > >> > > > > >> > >>>>>>>>> community has come to the same conclusion.
> > > They're
> > > >> > > adding
> > > >> > > > a
> > > >> > > > > >> ton
> > > >> > > > > >> > >>>>>>>>> of upgrades into their new Kafka consumer
> > > >> > > implementation.
> > > >> > > > > To a
> > > >> > > > > >> > >>>>>>>>> large extent,
> > > >> > > > > >> > >>>>> it's
> > > >> > > > > >> > >>>>>>>>> duplicate work to what we've already done in
> > > Samza.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> On top of this, Kafka ended up taking a very
> > > similar
> > > >> > > > > approach
> > > >> > > > > >> > >>>>>>>>> to
> > > >> > > > > >> > >>>>>> Samza's
> > > >> > > > > >> > >>>>>>>>> KafkaCheckpointManager implementation for
> > > handling
> > > >> > > offset
> > > >> > > > > >> > >>>>>> checkpointing.
> > > >> > > > > >> > >>>>>>>>> Like Samza, Kafka's new offset management
> > feature
> > > >> > stores
> > > >> > > > > >> offset
> > > >> > > > > >> > >>>>>>>>> checkpoints in a topic, and allows you to
> fetch
> > > them
> > > >> > > from
> > > >> > > > > the
> > > >> > > > > >> > >>>>>>>>> broker.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> A lot of this seems like a waste, since we
> > could
> > > >> have
> > > >> > > > shared
> > > >> > > > > >> > >>>>>>>>> the
> > > >> > > > > >> > >>>>> work
> > > >> > > > > >> > >>>>>> if
> > > >> > > > > >> > >>>>>>>>> it
> > > >> > > > > >> > >>>>>>>>> had been done in Kafka from the get-go.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> Vision
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> All of this leads me to a rather radical
> > > proposal.
> > > >> > Samza
> > > >> > > > is
> > > >> > > > > >> > >>>>> relatively
> > > >> > > > > >> > >>>>>>>>> stable at this point. I'd venture to say that
> > > we're
> > > >> > > near a
> > > >> > > > > 1.0
> > > >> > > > > >> > >>>>>> release.
> > > >> > > > > >> > >>>>>>>>> I'd
> > > >> > > > > >> > >>>>>>>>> like to propose that we take what we've
> > learned,
> > > and
> > > >> > > begin
> > > >> > > > > >> > >>>>>>>>> thinking
> > > >> > > > > >> > >>>>>>> about
> > > >> > > > > >> > >>>>>>>>> Samza beyond 1.0. What would we change if we
> > were
> > > >> > > starting
> > > >> > > > > >> from
> > > >> > > > > >> > >>>>>> scratch?
> > > >> > > > > >> > >>>>>>>>> My
> > > >> > > > > >> > >>>>>>>>> proposal is to:
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> 1. Make Samza standalone the *only* way to
> run
> > > Samza
> > > >> > > > > >> > >>>>>>>>> processors, and eliminate all direct
> > dependences
> > > on
> > > >> > > YARN,
> > > >> > > > > >> Mesos,
> > > >> > > > > >> > >>>> etc.
> > > >> > > > > >> > >>>>>>>>> 2. Make a definitive call to support only
> Kafka
> > > as
> > > >> the
> > > >> > > > > stream
> > > >> > > > > >> > >>>>>> processing
> > > >> > > > > >> > >>>>>>>>> layer.
> > > >> > > > > >> > >>>>>>>>> 3. Eliminate Samza's metrics, logging,
> > > >> serialization,
> > > >> > > and
> > > >> > > > > >> > >>>>>>>>> config
> > > >> > > > > >> > >>>>>>> systems,
> > > >> > > > > >> > >>>>>>>>> and simply use Kafka's instead.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> This would fix all of the issues that I
> > outlined
> > > >> > above.
> > > >> > > It
> > > >> > > > > >> > >>>>>>>>> should
> > > >> > > > > >> > >>>>> also
> > > >> > > > > >> > >>>>>>>>> shrink the Samza code base pretty
> dramatically.
> > > >> > > Supporting
> > > >> > > > > >> only
> > > >> > > > > >> > >>>>>>>>> a standalone container will allow Samza to be
> > > >> executed
> > > >> > > on
> > > >> > > > > YARN
> > > >> > > > > >> > >>>>>>>>> (using Slider), Mesos (using
> Marathon/Aurora),
> > or
> > > >> most
> > > >> > > > other
> > > >> > > > > >> > >>>>>>>>> in-house
> > > >> > > > > >> > >>>>>>> deployment
> > > >> > > > > >> > >>>>>>>>> systems. This should make life a lot easier
> for
> > > new
> > > >> > > users.
> > > >> > > > > >> > >>>>>>>>> Imagine
> > > >> > > > > >> > >>>>>>> having
> > > >> > > > > >> > >>>>>>>>> the hello-samza tutorial without YARN. The
> drop
> > > in
> > > >> > > mailing
> > > >> > > > > >> list
> > > >> > > > > >> > >>>>>> traffic
> > > >> > > > > >> > >>>>>>>>> will be pretty dramatic.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> Coupling with Kafka seems long overdue to me.
> > The
> > > >> > > reality
> > > >> > > > > is,
> > > >> > > > > >> > >>>>> everyone
> > > >> > > > > >> > >>>>>>>>> that
> > > >> > > > > >> > >>>>>>>>> I'm aware of is using Samza with Kafka. We
> > > basically
> > > >> > > > require
> > > >> > > > > >> it
> > > >> > > > > >> > >>>>>> already
> > > >> > > > > >> > >>>>>>> in
> > > >> > > > > >> > >>>>>>>>> order for most features to work. Those that
> are
> > > >> using
> > > >> > > > other
> > > >> > > > > >> > >>>>>>>>> systems
> > > >> > > > > >> > >>>>>> are
> > > >> > > > > >> > >>>>>>>>> generally using it for ingest into Kafka (1),
> > and
> > > >> then
> > > >> > > > they
> > > >> > > > > do
> > > >> > > > > >> > >>>>>>>>> the processing on top. There is already
> > > discussion (
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>
> > > >> > > > > >> > >>>>>>
> > > >> > > > > >> > >>>>>
> > > >> > > > > >> >
> > > >> > > > >
> > > >> >
> > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851
> > > >> > > > > >> > >>>>> 767
> > > >> > > > > >> > >>>>>>>>> )
> > > >> > > > > >> > >>>>>>>>> in Kafka to make ingesting into Kafka
> extremely
> > > >> easy.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> Once we make the call to couple with Kafka,
> we
> > > can
> > > >> > > > leverage
> > > >> > > > > a
> > > >> > > > > >> > >>>>>>>>> ton of
> > > >> > > > > >> > >>>>>>> their
> > > >> > > > > >> > >>>>>>>>> ecosystem. We no longer have to maintain our
> > own
> > > >> > config,
> > > >> > > > > >> > >>>>>>>>> metrics,
> > > >> > > > > >> > >>>>> etc.
> > > >> > > > > >> > >>>>>>> We
> > > >> > > > > >> > >>>>>>>>> can all share the same libraries, and make
> them
> > > >> > better.
> > > >> > > > This
> > > >> > > > > >> > >>>>>>>>> will
> > > >> > > > > >> > >>>>> also
> > > >> > > > > >> > >>>>>>>>> allow us to share the consumer/producer APIs,
> > and
> > > >> will
> > > >> > > let
> > > >> > > > > us
> > > >> > > > > >> > >>>>> leverage
> > > >> > > > > >> > >>>>>>>>> their offset management and partition
> > management,
> > > >> > rather
> > > >> > > > > than
> > > >> > > > > >> > >>>>>>>>> having
> > > >> > > > > >> > >>>>>> our
> > > >> > > > > >> > >>>>>>>>> own. All of the coordinator stream code would
> > go
> > > >> away,
> > > >> > > as
> > > >> > > > > >> would
> > > >> > > > > >> > >>>>>>>>> most
> > > >> > > > > >> > >>>>>> of
> > > >> > > > > >> > >>>>>>>>> the
> > > >> > > > > >> > >>>>>>>>> YARN AppMaster code. We'd probably have to
> push
> > > some
> > > >> > > > > partition
> > > >> > > > > >> > >>>>>>> management
> > > >> > > > > >> > >>>>>>>>> features into the Kafka broker, but they're
> > > already
> > > >> > > moving
> > > >> > > > > in
> > > >> > > > > >> > >>>>>>>>> that direction with the new consumer API. The
> > > >> features
> > > >> > > we
> > > >> > > > > have
> > > >> > > > > >> > >>>>>>>>> for
> > > >> > > > > >> > >>>>>> partition
> > > >> > > > > >> > >>>>>>>>> assignment aren't unique to Samza, and seem
> > like
> > > >> they
> > > >> > > > should
> > > >> > > > > >> be
> > > >> > > > > >> > >>>>>>>>> in
> > > >> > > > > >> > >>>>>> Kafka
> > > >> > > > > >> > >>>>>>>>> anyway. There will always be some niche
> usages
> > > which
> > > >> > > will
> > > >> > > > > >> > >>>>>>>>> require
> > > >> > > > > >> > >>>>>> extra
> > > >> > > > > >> > >>>>>>>>> care and hence full control over partition
> > > >> assignments
> > > >> > > > much
> > > >> > > > > >> > >>>>>>>>> like the
> > > >> > > > > >> > >>>>>>> Kafka
> > > >> > > > > >> > >>>>>>>>> low level consumer api. These would continue
> to
> > > be
> > > >> > > > > supported.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> These items will be good for the Samza
> > community.
> > > >> > > They'll
> > > >> > > > > make
> > > >> > > > > >> > >>>>>>>>> Samza easier to use, and make it easier for
> > > >> developers
> > > >> > > to
> > > >> > > > > add
> > > >> > > > > >> > >>>>>>>>> new features.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> Obviously this is a fairly large (and
> somewhat
> > > >> > backwards
> > > >> > > > > >> > >>>>> incompatible
> > > >> > > > > >> > >>>>>>>>> change). If we choose to go this route, it's
> > > >> important
> > > >> > > > that
> > > >> > > > > we
> > > >> > > > > >> > >>>>> openly
> > > >> > > > > >> > >>>>>>>>> communicate how we're going to provide a
> > > migration
> > > >> > path
> > > >> > > > from
> > > >> > > > > >> > >>>>>>>>> the
> > > >> > > > > >> > >>>>>>> existing
> > > >> > > > > >> > >>>>>>>>> APIs to the new ones (if we make incompatible
> > > >> > changes).
> > > >> > > I
> > > >> > > > > >> think
> > > >> > > > > >> > >>>>>>>>> at a minimum, we'd probably need to provide a
> > > >> wrapper
> > > >> > to
> > > >> > > > > allow
> > > >> > > > > >> > >>>>>>>>> existing StreamTask implementations to
> continue
> > > >> > running
> > > >> > > on
> > > >> > > > > the
> > > >> > > > > >> > >>>> new container.
> > > >> > > > > >> > >>>>>>> It's
> > > >> > > > > >> > >>>>>>>>> also important that we openly communicate
> about
> > > >> > timing,
> > > >> > > > and
> > > >> > > > > >> > >>>>>>>>> stages
> > > >> > > > > >> > >>>>> of
> > > >> > > > > >> > >>>>>>> the
> > > >> > > > > >> > >>>>>>>>> migration.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> If you made it this far, I'm sure you have
> > > opinions.
> > > >> > :)
> > > >> > > > > Please
> > > >> > > > > >> > >>>>>>>>> send
> > > >> > > > > >> > >>>>>> your
> > > >> > > > > >> > >>>>>>>>> thoughts and feedback.
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>> Cheers,
> > > >> > > > > >> > >>>>>>>>> Chris
> > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>>
> > > >> > > > > >> > >>>>>>>
> > > >> > > > > >> > >>>>>>
> > > >> > > > > >> > >>>>>>
> > > >> > > > > >> > >>>>>>
> > > >> > > > > >> > >>>>>> --
> > > >> > > > > >> > >>>>>> -- Guozhang
> > > >> > > > > >> > >>>>>>
> > > >> > > > > >> > >>>>>
> > > >> > > > > >> > >>>>
> > > >> > > > > >> > >>
> > > >> > > > > >> > >>
> > > >> > > > > >> >
> > > >> > > > > >> >
> > > >> > > > > >> >
> > > >> > > > > >>
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
>

Re: Thoughts and obesrvations on Samza

Reply via email to