Re: Thoughts and obesrvations on Samza

Jay Kreps Mon, 13 Jul 2015 10:31:56 -0700

Hmm, thought about this more. Maybe this is just too much too quick.
Overall I think there is some enthusiasm for the proposal but it's not
really unanimous enough to make any kind of change this big cleanly. The
board doesn't really like the merging stuff, user's are concerned about
compatibility, I didn't feel there was unanimous agreement on dropping
SystemConsumer, etc. Even if this is the right end state to get to,
probably trying to push all this through at once isn't the right way to do
it.


So let me propose a kind of fifth (?) option which I think is less dramatic
and let's things happen gradually. I think this is kind of like combining
the first part of Yi's proposal and Jakob's third option, leaving the rest
to be figured out incrementally:

Option 5: We continue the prototype I shared and propose that as a kind of
"transformer" client API in Kafka. This isn't really a full-fledged stream
processing layer, more like a supped up consumer api for munging topics.
This would let us figure out some of the technical bits, how to do this on
Kafka's group management features, how to integrate the txn feature to do
the exactly-once stuff in these transformations, and get all this stuff
solid. This api would have valid uses in it's own right, especially when
your transformation will be embedded inside an existing service or
application which isn't possible with Samza (or other existing systems that
I know of).

Independently we can iterate on some of the ideas of the original proposal
individually and figure out how (if at all) to make use of this
functionality. This can be done bit-by-bit:
- Could be that the existing StreamTask API ends up wrapping this
- Could end up exposed directly in Samza as Yi proposed
- Could be that just the lower-level group-management stuff get's used, and
in this case it could be either just for standalone mode, or always
- Could be that it stays as-is

The advantage of this is it is lower risk...we basically don't have to make
12 major decisions all at once that kind of hinge on what amounts to a
pretty aggressive rewrite. The disadvantage of this is it is a bit more
confusing as all this is getting figured out.

As with some of the other stuff, this would require a further discussion in
the Kafka community if people do like this approach.

Thoughts?

-Jay




On Sun, Jul 12, 2015 at 10:52 PM, Jay Kreps <[email protected]> wrote:

> Hey Chris,
>
> Yeah, I'm obviously in favor of this.
>
> The sub-project approach seems the ideal way to take a graceful step in
> this direction, so I will ping the board folks and see why they are
> discouraged, it would be good to understand that. If we go that route we
> would need to do a similar discussion in the Kafka list (but makes sense to
> figure out first if it is what Samza wants).
>
> Irrespective of how it's implemented, though, to me the important things
> are the following:
> 1. Unify the website, config, naming, docs, metrics, etc--basically fix
> the product experience so the "stream" and the "processing" feel like a
> single user experience and brand. This seems minor but I think is a really
> big deal.
> 2. Make "standalone" mode a first class citizen and have a real technical
> plan to be able to support cluster managers other than YARN.
> 3. Make the config and out-of-the-box experience more usable
>
> I think that prototype gives a practical example of how 1-3 could be done
> and we should pursue it. This is a pretty radical change, so I wouldn't be
> shocked if people didn't want to take a step like that.
>
> Maybe it would make sense to see if people are on board with that general
> idea, and then try to get some advice on sub-projects in parallel and nail
> down those details?
>
> -Jay
>
> On Sun, Jul 12, 2015 at 5:54 PM, Chris Riccomini <[email protected]>
> wrote:
>
>> Hey all,
>>
>> I want to start by saying that I'm absolutely thrilled to be a part of
>> this
>> community. The amount of level-headed, thoughtful, educated discussion
>> that's gone on over the past ~10 days is overwhelming. Wonderful.
>>
>> It seems like discussion is waning a bit, and we've reached some
>> conclusions. There are several key emails in this threat, which I want to
>> call out:
>>
>> 1. Jakob's summary of the three potential ways forward.
>>
>>
>> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVu-hxdBfyQ4qm3LDC55cUQbPdmbe4zGzTOOatYF1Pz43A%40mail.gmail.com%3E
>> 2. Julian's call out that we should be focusing on community over code.
>>
>>
>> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCAPSgeESZ_7bVFbwN%2Bzqi5MH%3D4CWu9MZUSanKg0-1woMqt55Fvg%40mail.gmail.com%3E
>> 3. Martin's summary about the benefits of merging communities.
>>
>>
>> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CBFB866B6-D9D8-4578-93C0-FFAEB1DF00FC%40kleppmann.com%3E
>> 4. Jakob's comments about the distinction between community and code
>> paths.
>>
>>
>> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVtWPjHLLDsmxvz9KggVA5DfBi-nUvfqB6QdA-du%2B_a9Ng%40mail.gmail.com%3E
>>
>> I agree with the comments on all of these emails. I think Martin's summary
>> of his position aligns very closely with my own. To that end, I think we
>> should get concrete about what the proposal is, and call a vote on it.
>> Given that Jay, Martin, and I seem to be aligning fairly closely, I think
>> we should start with:
>>
>> 1. [community] Make Samza a subproject of Kafka.
>> 2. [community] Make all Samza PMC/committers committers of the subproject.
>> 3. [community] Migrate Samza's website/documentation into Kafka's.
>> 4. [code] Have the Samza community and the Kafka community start a
>> from-scratch reboot together in the new Kafka subproject. We can
>> borrow/copy &  paste significant chunks of code from Samza's code base.
>> 5. [code] The subproject would intentionally eliminate support for both
>> other streaming systems and all deployment systems.
>> 6. [code] Attempt to provide a bridge from our SystemConsumer to KIP-26
>> (copy cat)
>> 7. [code] Attempt to provide a bridge from the new subproject's processor
>> interface to our legacy StreamTask interface.
>> 8. [code/community] Sunset Samza as a TLP when we have a working Kafka
>> subproject that has a fault-tolerant container with state management.
>>
>> It's likely that (6) and (7) won't be fully drop-in. Still, the closer we
>> can get, the better it's going to be for our existing community.
>>
>> One thing that I didn't touch on with (2) is whether any Samza PMC members
>> should be rolled into Kafka PMC membership as well (though, Jay and Jakob
>> are already PMC members on both). I think that Samza's community deserves
>> a
>> voice on the PMC, so I'd propose that we roll at least a few PMC members
>> into the Kafka PMC, but I don't have a strong framework for which people
>> to
>> pick.
>>
>> Before (8), I think that Samza's TLP can continue to commit bug fixes and
>> patches as it sees fit, provided that we openly communicate that we won't
>> necessarily migrate new features to the new subproject, and that the TLP
>> will be shut down after the migration to the Kafka subproject occurs.
>>
>> Jakob, I could use your guidance here about about how to achieve this from
>> an Apache process perspective (sorry).
>>
>> * Should I just call a vote on this proposal?
>> * Should it happen on dev or private?
>> * Do committers have binding votes, or just PMC?
>>
>> Having trouble finding much detail on the Apache wikis. :(
>>
>> Cheers,
>> Chris
>>
>> On Fri, Jul 10, 2015 at 2:38 PM, Yan Fang <[email protected]> wrote:
>>
>> > Thanks, Jay. This argument persuaded me actually. :)
>> >
>> > Fang, Yan
>> > [email protected]
>> >
>> > On Fri, Jul 10, 2015 at 2:33 PM, Jay Kreps <[email protected]> wrote:
>> >
>> > > Hey Yan,
>> > >
>> > > Yeah philosophically I think the argument is that you should capture
>> the
>> > > stream in Kafka independent of the transformation. This is obviously a
>> > > Kafka-centric view point.
>> > >
>> > > Advantages of this:
>> > > - In practice I think this is what e.g. Storm people often end up
>> doing
>> > > anyway. You usually need to throttle any access to a live serving
>> > database.
>> > > - Can have multiple subscribers and they get the same thing without
>> > > additional load on the source system.
>> > > - Applications can tap into the stream if need be by subscribing.
>> > > - You can debug your transformation by tailing the Kafka topic with
>> the
>> > > console consumer
>> > > - Can tee off the same data stream for batch analysis or Lambda arch
>> > style
>> > > re-processing
>> > >
>> > > The disadvantage is that it will use Kafka resources. But the idea is
>> > > eventually you will have multiple subscribers to any data source (at
>> > least
>> > > for monitoring) so you will end up there soon enough anyway.
>> > >
>> > > Down the road the technical benefit is that I think it gives us a good
>> > path
>> > > towards end-to-end exactly once semantics from source to destination.
>> > > Basically the connectors need to support idempotence when talking to
>> > Kafka
>> > > and we need the transactional write feature in Kafka to make the
>> > > transformation atomic. This is actually pretty doable if you separate
>> > > connector=>kafka problem from the generic transformations which are
>> > always
>> > > kafka=>kafka. However I think it is quite impossible to do in a
>> > all_things
>> > > => all_things environment. Today you can say "well the semantics of
>> the
>> > > Samza APIs depend on the connectors you use" but it is actually worse
>> > then
>> > > that because the semantics actually depend on the pairing of
>> > connectors--so
>> > > not only can you probably not get a usable "exactly once" guarantee
>> > > end-to-end it can actually be quite hard to reverse engineer what
>> > property
>> > > (if any) your end-to-end flow has if you have heterogenous systems.
>> > >
>> > > -Jay
>> > >
>> > > On Fri, Jul 10, 2015 at 2:00 PM, Yan Fang <[email protected]>
>> wrote:
>> > >
>> > > > {quote}
>> > > > maintained in a separate repository and retaining the existing
>> > > > committership but sharing as much else as possible (website, etc)
>> > > > {quote}
>> > > >
>> > > > Overall, I agree on this idea. Now the question is more about "how
>> to
>> > do
>> > > > it".
>> > > >
>> > > > On the other hand, one thing I want to point out is that, if we
>> decide
>> > to
>> > > > go this way, how do we want to support
>> > > > otherSystem-transformation-otherSystem use case?
>> > > >
>> > > > Basically, there are four user groups here:
>> > > >
>> > > > 1. Kafka-transformation-Kafka
>> > > > 2. Kafka-transformation-otherSystem
>> > > > 3. otherSystem-transformation-Kafka
>> > > > 4. otherSystem-transformation-otherSystem
>> > > >
>> > > > For group 1, they can easily use the new Samza library to achieve.
>> For
>> > > > group 2 and 3, they can use copyCat -> transformation -> Kafka or
>> > Kafka->
>> > > > transformation -> copyCat.
>> > > >
>> > > > The problem is for group 4. Do we want to abandon this or still
>> support
>> > > it?
>> > > > Of course, this use case can be achieved by using copyCat ->
>> > > transformation
>> > > > -> Kafka -> transformation -> copyCat, the thing is how we persuade
>> > them
>> > > to
>> > > > do this long chain. If yes, it will also be a win for Kafka too. Or
>> if
>> > > > there is no one in this community actually doing this so far, maybe
>> ok
>> > to
>> > > > not support the group 4 directly.
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Fang, Yan
>> > > > [email protected]
>> > > >
>> > > > On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps <[email protected]>
>> wrote:
>> > > >
>> > > > > Yeah I agree with this summary. I think there are kind of two
>> > questions
>> > > > > here:
>> > > > > 1. Technically does alignment/reliance on Kafka make sense
>> > > > > 2. Branding wise (naming, website, concepts, etc) does alignment
>> with
>> > > > Kafka
>> > > > > make sense
>> > > > >
>> > > > > Personally I do think both of these things would be really
>> valuable,
>> > > and
>> > > > > would dramatically alter the trajectory of the project.
>> > > > >
>> > > > > My preference would be to see if people can mostly agree on a
>> > direction
>> > > > > rather than splintering things off. From my point of view the
>> ideal
>> > > > outcome
>> > > > > of all the options discussed would be to make Samza a closely
>> aligned
>> > > > > subproject, maintained in a separate repository and retaining the
>> > > > existing
>> > > > > committership but sharing as much else as possible (website,
>> etc). No
>> > > > idea
>> > > > > about how these things work, Jacob, you probably know more.
>> > > > >
>> > > > > No discussion amongst the Kafka folks has happened on this, but
>> > likely
>> > > we
>> > > > > should figure out what the Samza community actually wants first.
>> > > > >
>> > > > > I admit that this is a fairly radical departure from how things
>> are.
>> > > > >
>> > > > > If that doesn't fly, I think, yeah we could leave Samza as it is
>> and
>> > do
>> > > > the
>> > > > > more radical reboot inside Kafka. From my point of view that does
>> > leave
>> > > > > things in a somewhat confusing state since now there are two
>> stream
>> > > > > processing systems more or less coupled to Kafka in large part
>> made
>> > by
>> > > > the
>> > > > > same people. But, arguably that might be a cleaner way to make the
>> > > > cut-over
>> > > > > and perhaps less risky for Samza community since if it works
>> people
>> > can
>> > > > > switch and if it doesn't nothing will have changed. Dunno, how do
>> > > people
>> > > > > feel about this?
>> > > > >
>> > > > > -Jay
>> > > > >
>> > > > > On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan <[email protected]>
>> > > wrote:
>> > > > >
>> > > > > > >  This leads me to thinking that merging projects and
>> communities
>> > > > might
>> > > > > > be a good idea: with the union of experience from both
>> communities,
>> > > we
>> > > > > will
>> > > > > > probably build a better system that is better for users.
>> > > > > > Is this what's being proposed though? Merging the projects seems
>> > like
>> > > > > > a consequence of at most one of the three directions under
>> > > discussion:
>> > > > > > 1) Samza 2.0: The Samza community relies more heavily on Kafka
>> for
>> > > > > > configuration, etc. (to a greater or lesser extent to be
>> > determined)
>> > > > > > but the Samza community would not automatically merge withe
>> Kafka
>> > > > > > community (the Phoenix/HBase example is a good one here).
>> > > > > > 2) Samza Reboot: The Samza community continues to exist with a
>> > > limited
>> > > > > > project scope, but similarly would not need to be part of the
>> Kafka
>> > > > > > community (ie given committership) to progress.  Here, maybe the
>> > > Samza
>> > > > > > team would become a subproject of Kafka (the Board frowns on
>> > > > > > subprojects at the moment, so I'm not sure if that's even
>> > feasible),
>> > > > > > but that would not be required.
>> > > > > > 3) Hey Samza! FYI, Kafka does streaming now: In this option the
>> > Kafka
>> > > > > > team builds its own streaming library, possibly off of Jay's
>> > > > > > prototype, which has not direct lineage to the Samza team.
>> There's
>> > > no
>> > > > > > reason for the Kafka team to bring in the Samza team.
>> > > > > >
>> > > > > > Is the Kafka community on board with this?
>> > > > > >
>> > > > > > To be clear, all three options under discussion are interesting,
>> > > > > > technically valid and likely healthy directions for the project.
>> > > > > > Also, they are not mutually exclusive.  The Samza community
>> could
>> > > > > > decide to pursue, say, 'Samza 2.0', while the Kafka community
>> went
>> > > > > > forward with 'Hey Samza!'  My points above are directed
>> entirely at
>> > > > > > the community aspect of these choices.
>> > > > > > -Jakob
>> > > > > >
>> > > > > > On 10 July 2015 at 09:10, Roger Hoover <[email protected]>
>> > > wrote:
>> > > > > > > That's great.  Thanks, Jay.
>> > > > > > >
>> > > > > > > On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps <[email protected]>
>> > > wrote:
>> > > > > > >
>> > > > > > >> Yeah totally agree. I think you have this issue even today,
>> > right?
>> > > > > I.e.
>> > > > > > if
>> > > > > > >> you need to make a simple config change and you're running in
>> > YARN
>> > > > > today
>> > > > > > >> you end up bouncing the job which then rebuilds state. I
>> think
>> > the
>> > > > fix
>> > > > > > is
>> > > > > > >> exactly what you described which is to have a long timeout on
>> > > > > partition
>> > > > > > >> movement for stateful jobs so that if a job is just getting
>> > > bounced,
>> > > > > and
>> > > > > > >> the cluster manager (or admin) is smart enough to restart it
>> on
>> > > the
>> > > > > same
>> > > > > > >> host when possible, it can optimistically reuse any existing
>> > state
>> > > > it
>> > > > > > finds
>> > > > > > >> on disk (if it is valid).
>> > > > > > >>
>> > > > > > >> So in this model the charter of the CM is to place processes
>> as
>> > > > > > stickily as
>> > > > > > >> possible and to restart or re-place failed processes. The
>> > charter
>> > > of
>> > > > > the
>> > > > > > >> partition management system is to control the assignment of
>> work
>> > > to
>> > > > > > these
>> > > > > > >> processes. The nice thing about this is that the work
>> > assignment,
>> > > > > > timeouts,
>> > > > > > >> behavior, configs, and code will all be the same across all
>> > > cluster
>> > > > > > >> managers.
>> > > > > > >>
>> > > > > > >> So I think that prototype would actually give you exactly
>> what
>> > you
>> > > > > want
>> > > > > > >> today for any cluster manager (or manual placement + restart
>> > > script)
>> > > > > > that
>> > > > > > >> was sticky in terms of host placement since there is already
>> a
>> > > > > > configurable
>> > > > > > >> partition movement timeout and task-by-task state reuse with
>> a
>> > > check
>> > > > > on
>> > > > > > >> state validity.
>> > > > > > >>
>> > > > > > >> -Jay
>> > > > > > >>
>> > > > > > >> On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover <
>> > > > [email protected]
>> > > > > >
>> > > > > > >> wrote:
>> > > > > > >>
>> > > > > > >> > That would be great to let Kafka do as much heavy lifting
>> as
>> > > > > possible
>> > > > > > and
>> > > > > > >> > make it easier for other languages to implement Samza apis.
>> > > > > > >> >
>> > > > > > >> > One thing to watch out for is the interplay between Kafka's
>> > > group
>> > > > > > >> > management and the external scheduler/process manager's
>> fault
>> > > > > > tolerance.
>> > > > > > >> > If a container dies, the Kafka group membership protocol
>> will
>> > > try
>> > > > to
>> > > > > > >> assign
>> > > > > > >> > it's tasks to other containers while at the same time the
>> > > process
>> > > > > > manager
>> > > > > > >> > is trying to relaunch the container.  Without some
>> > consideration
>> > > > for
>> > > > > > this
>> > > > > > >> > (like a configurable amount of time to wait before Kafka
>> > alters
>> > > > the
>> > > > > > group
>> > > > > > >> > membership), there may be thrashing going on which is
>> > especially
>> > > > bad
>> > > > > > for
>> > > > > > >> > containers with large amounts of local state.
>> > > > > > >> >
>> > > > > > >> > Someone else pointed this out already but I thought it
>> might
>> > be
>> > > > > worth
>> > > > > > >> > calling out again.
>> > > > > > >> >
>> > > > > > >> > Cheers,
>> > > > > > >> >
>> > > > > > >> > Roger
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> > On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps <
>> [email protected]>
>> > > > > wrote:
>> > > > > > >> >
>> > > > > > >> > > Hey Roger,
>> > > > > > >> > >
>> > > > > > >> > > I couldn't agree more. We spent a bunch of time talking
>> to
>> > > > people
>> > > > > > and
>> > > > > > >> > that
>> > > > > > >> > > is exactly the stuff we heard time and again. What makes
>> it
>> > > > hard,
>> > > > > of
>> > > > > > >> > > course, is that there is some tension between
>> compatibility
>> > > with
>> > > > > > what's
>> > > > > > >> > > there now and making things better for new users.
>> > > > > > >> > >
>> > > > > > >> > > I also strongly agree with the importance of
>> multi-language
>> > > > > > support. We
>> > > > > > >> > are
>> > > > > > >> > > talking now about Java, but for application development
>> use
>> > > > cases
>> > > > > > >> people
>> > > > > > >> > > want to work in whatever language they are using
>> elsewhere.
>> > I
>> > > > > think
>> > > > > > >> > moving
>> > > > > > >> > > to a model where Kafka itself does the group membership,
>> > > > lifecycle
>> > > > > > >> > control,
>> > > > > > >> > > and partition assignment has the advantage of putting all
>> > that
>> > > > > > complex
>> > > > > > >> > > stuff behind a clean api that the clients are already
>> going
>> > to
>> > > > be
>> > > > > > >> > > implementing for their consumer, so the added
>> functionality
>> > > for
>> > > > > > stream
>> > > > > > >> > > processing beyond a consumer becomes very minor.
>> > > > > > >> > >
>> > > > > > >> > > -Jay
>> > > > > > >> > >
>> > > > > > >> > > On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover <
>> > > > > > [email protected]>
>> > > > > > >> > > wrote:
>> > > > > > >> > >
>> > > > > > >> > > > Metamorphosis...nice. :)
>> > > > > > >> > > >
>> > > > > > >> > > > This has been a great discussion.  As a user of Samza
>> > who's
>> > > > > > recently
>> > > > > > >> > > > integrated it into a relatively large organization, I
>> just
>> > > > want
>> > > > > to
>> > > > > > >> add
>> > > > > > >> > > > support to a few points already made.
>> > > > > > >> > > >
>> > > > > > >> > > > The biggest hurdles to adoption of Samza as it
>> currently
>> > > > exists
>> > > > > > that
>> > > > > > >> > I've
>> > > > > > >> > > > experienced are:
>> > > > > > >> > > > 1) YARN - YARN is overly complex in many environments
>> > where
>> > > > > Puppet
>> > > > > > >> > would
>> > > > > > >> > > do
>> > > > > > >> > > > just fine but it was the only mechanism to get fault
>> > > > tolerance.
>> > > > > > >> > > > 2) Configuration - I think I like the idea of
>> configuring
>> > > most
>> > > > > of
>> > > > > > the
>> > > > > > >> > job
>> > > > > > >> > > > in code rather than config files.  In general, I think
>> the
>> > > > goal
>> > > > > > >> should
>> > > > > > >> > be
>> > > > > > >> > > > to make it harder to make mistakes, especially of the
>> kind
>> > > > where
>> > > > > > the
>> > > > > > >> > code
>> > > > > > >> > > > expects something and the config doesn't match.  The
>> > current
>> > > > > > config
>> > > > > > >> is
>> > > > > > >> > > > quite intricate and error-prone.  For example, the
>> > > application
>> > > > > > logic
>> > > > > > >> > may
>> > > > > > >> > > > depend on bootstrapping a topic but rather than
>> asserting
>> > > that
>> > > > > in
>> > > > > > the
>> > > > > > >> > > code,
>> > > > > > >> > > > you have to rely on getting the config right.  Likewise
>> > with
>> > > > > > serdes,
>> > > > > > >> > the
>> > > > > > >> > > > Java representations produced by various serdes (JSON,
>> > Avro,
>> > > > > etc.)
>> > > > > > >> are
>> > > > > > >> > > not
>> > > > > > >> > > > equivalent so you cannot just reconfigure a serde
>> without
>> > > > > changing
>> > > > > > >> the
>> > > > > > >> > > > code.   It would be nice for jobs to be able to assert
>> > what
>> > > > they
>> > > > > > >> expect
>> > > > > > >> > > > from their input topics in terms of partitioning.
>> This is
>> > > > > > getting a
>> > > > > > >> > > little
>> > > > > > >> > > > off topic but I was even thinking about creating a
>> "Samza
>> > > > config
>> > > > > > >> > linter"
>> > > > > > >> > > > that would sanity check a set of configs.  Especially
>> in
>> > > > > > >> organizations
>> > > > > > >> > > > where config is managed by a different team than the
>> > > > application
>> > > > > > >> > > developer,
>> > > > > > >> > > > it's very hard to get avoid config mistakes.
>> > > > > > >> > > > 3) Java/Scala centric - for many teams (especially
>> > > DevOps-type
>> > > > > > >> folks),
>> > > > > > >> > > the
>> > > > > > >> > > > pain of the Java toolchain (maven, slow builds, weak
>> > command
>> > > > > line
>> > > > > > >> > > support,
>> > > > > > >> > > > configuration over convention) really inhibits
>> > productivity.
>> > > > As
>> > > > > > more
>> > > > > > >> > and
>> > > > > > >> > > > more high-quality clients become available for Kafka, I
>> > hope
>> > > > > > they'll
>> > > > > > >> > > follow
>> > > > > > >> > > > Samza's model.  Not sure how much it affects the
>> proposals
>> > > in
>> > > > > this
>> > > > > > >> > thread
>> > > > > > >> > > > but please consider other languages in the ecosystem as
>> > > well.
>> > > > > > From
>> > > > > > >> > what
>> > > > > > >> > > > I've heard, Spark has more Python users than
>> Java/Scala.
>> > > > > > >> > > > (FYI, we added a Jython wrapper for the Samza API
>> > > > > > >> > > >
>> > > > > > >> > > >
>> > > > > > >> > >
>> > > > > > >> >
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza
>> > > > > > >> > > > and are working on a Yeoman generator
>> > > > > > >> > > > https://github.com/Quantiply/generator-rico for
>> > > Jython/Samza
>> > > > > > >> projects
>> > > > > > >> > to
>> > > > > > >> > > > alleviate some of the pain)
>> > > > > > >> > > >
>> > > > > > >> > > > I also want to underscore Jay's point about improving
>> the
>> > > user
>> > > > > > >> > > experience.
>> > > > > > >> > > > That's a very important factor for adoption.  I think
>> the
>> > > goal
>> > > > > > should
>> > > > > > >> > be
>> > > > > > >> > > to
>> > > > > > >> > > > make Samza as easy to get started with as something
>> like
>> > > > > Logstash.
>> > > > > > >> > > > Logstash is vastly inferior in terms of capabilities to
>> > > Samza
>> > > > > but
>> > > > > > >> it's
>> > > > > > >> > > easy
>> > > > > > >> > > > to get started and that makes a big difference.
>> > > > > > >> > > >
>> > > > > > >> > > > Cheers,
>> > > > > > >> > > >
>> > > > > > >> > > > Roger
>> > > > > > >> > > >
>> > > > > > >> > > >
>> > > > > > >> > > >
>> > > > > > >> > > >
>> > > > > > >> > > >
>> > > > > > >> > > > On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De Francisci
>> > > > Morales <
>> > > > > > >> > > > [email protected]> wrote:
>> > > > > > >> > > >
>> > > > > > >> > > > > Forgot to add. On the naming issues, Kafka
>> Metamorphosis
>> > > is
>> > > > a
>> > > > > > clear
>> > > > > > >> > > > winner
>> > > > > > >> > > > > :)
>> > > > > > >> > > > >
>> > > > > > >> > > > > --
>> > > > > > >> > > > > Gianmarco
>> > > > > > >> > > > >
>> > > > > > >> > > > > On 7 July 2015 at 13:26, Gianmarco De Francisci
>> Morales
>> > <
>> > > > > > >> > > [email protected]
>> > > > > > >> > > > >
>> > > > > > >> > > > > wrote:
>> > > > > > >> > > > >
>> > > > > > >> > > > > > Hi,
>> > > > > > >> > > > > >
>> > > > > > >> > > > > > @Martin, thanks for you comments.
>> > > > > > >> > > > > > Maybe I'm missing some important point, but I think
>> > > > coupling
>> > > > > > the
>> > > > > > >> > > > releases
>> > > > > > >> > > > > > is actually a *good* thing.
>> > > > > > >> > > > > > To make an example, would it be better if the MR
>> and
>> > > HDFS
>> > > > > > >> > components
>> > > > > > >> > > of
>> > > > > > >> > > > > > Hadoop had different release schedules?
>> > > > > > >> > > > > >
>> > > > > > >> > > > > > Actually, keeping the discussion in a single place
>> > would
>> > > > > make
>> > > > > > >> > > agreeing
>> > > > > > >> > > > on
>> > > > > > >> > > > > > releases (and backwards compatibility) much
>> easier, as
>> > > > > > everybody
>> > > > > > >> > > would
>> > > > > > >> > > > be
>> > > > > > >> > > > > > responsible for the whole codebase.
>> > > > > > >> > > > > >
>> > > > > > >> > > > > > That said, I like the idea of absorbing samza-core
>> as
>> > a
>> > > > > > >> > sub-project,
>> > > > > > >> > > > and
>> > > > > > >> > > > > > leave the fancy stuff separate.
>> > > > > > >> > > > > > It probably gives 90% of the benefits we have been
>> > > > > discussing
>> > > > > > >> here.
>> > > > > > >> > > > > >
>> > > > > > >> > > > > > Cheers,
>> > > > > > >> > > > > >
>> > > > > > >> > > > > > --
>> > > > > > >> > > > > > Gianmarco
>> > > > > > >> > > > > >
>> > > > > > >> > > > > > On 7 July 2015 at 02:30, Jay Kreps <
>> > [email protected]
>> > > >
>> > > > > > wrote:
>> > > > > > >> > > > > >
>> > > > > > >> > > > > >> Hey Martin,
>> > > > > > >> > > > > >>
>> > > > > > >> > > > > >> I agree coupling release schedules is a downside.
>> > > > > > >> > > > > >>
>> > > > > > >> > > > > >> Definitely we can try to solve some of the
>> > integration
>> > > > > > problems
>> > > > > > >> in
>> > > > > > >> > > > > >> Confluent Platform or in other distributions. But
>> I
>> > > think
>> > > > > > this
>> > > > > > >> > ends
>> > > > > > >> > > up
>> > > > > > >> > > > > >> being really shallow. I guess I feel to really
>> get a
>> > > good
>> > > > > > user
>> > > > > > >> > > > > experience
>> > > > > > >> > > > > >> the two systems have to kind of feel like part of
>> the
>> > > > same
>> > > > > > thing
>> > > > > > >> > and
>> > > > > > >> > > > you
>> > > > > > >> > > > > >> can't really add that in later--you can put both
>> in
>> > the
>> > > > > same
>> > > > > > >> > > > > downloadable
>> > > > > > >> > > > > >> tar file but it doesn't really give a very
>> cohesive
>> > > > > feeling.
>> > > > > > I
>> > > > > > >> > agree
>> > > > > > >> > > > > that
>> > > > > > >> > > > > >> ultimately any of the project stuff is as much
>> social
>> > > and
>> > > > > > naming
>> > > > > > >> > as
>> > > > > > >> > > > > >> anything else--theoretically two totally
>> independent
>> > > > > projects
>> > > > > > >> > could
>> > > > > > >> > > > work
>> > > > > > >> > > > > >> to
>> > > > > > >> > > > > >> tightly align. In practice this seems to be quite
>> > > > difficult
>> > > > > > >> > though.
>> > > > > > >> > > > > >>
>> > > > > > >> > > > > >> For the frameworks--totally agree it would be
>> good to
>> > > > > > maintain
>> > > > > > >> the
>> > > > > > >> > > > > >> framework support with the project. In some cases
>> > there
>> > > > may
>> > > > > > not
>> > > > > > >> be
>> > > > > > >> > > too
>> > > > > > >> > > > > >> much
>> > > > > > >> > > > > >> there since the integration gets lighter but I
>> think
>> > > > > whatever
>> > > > > > >> > stubs
>> > > > > > >> > > > you
>> > > > > > >> > > > > >> need should be included. So no I definitely wasn't
>> > > trying
>> > > > > to
>> > > > > > >> imply
>> > > > > > >> > > > > >> dropping
>> > > > > > >> > > > > >> support for these frameworks, just making the
>> > > integration
>> > > > > > >> lighter
>> > > > > > >> > by
>> > > > > > >> > > > > >> separating process management from partition
>> > > management.
>> > > > > > >> > > > > >>
>> > > > > > >> > > > > >> You raise two good points we would have to figure
>> out
>> > > if
>> > > > we
>> > > > > > went
>> > > > > > >> > > down
>> > > > > > >> > > > > the
>> > > > > > >> > > > > >> alignment path:
>> > > > > > >> > > > > >> 1. With respect to the name, yeah I think the
>> first
>> > > > > question
>> > > > > > is
>> > > > > > >> > > > whether
>> > > > > > >> > > > > >> some "re-branding" would be worth it. If so then I
>> > > think
>> > > > we
>> > > > > > can
>> > > > > > >> > > have a
>> > > > > > >> > > > > big
>> > > > > > >> > > > > >> thread on the name. I'm definitely not set on
>> Kafka
>> > > > > > Streaming or
>> > > > > > >> > > Kafka
>> > > > > > >> > > > > >> Streams I was just using them to be kind of
>> > > > illustrative. I
>> > > > > > >> agree
>> > > > > > >> > > with
>> > > > > > >> > > > > >> your
>> > > > > > >> > > > > >> critique of these names, though I think people
>> would
>> > > get
>> > > > > the
>> > > > > > >> idea.
>> > > > > > >> > > > > >> 2. Yeah you also raise a good point about how to
>> > > "factor"
>> > > > > it.
>> > > > > > >> Here
>> > > > > > >> > > are
>> > > > > > >> > > > > the
>> > > > > > >> > > > > >> options I see (I could get enthusiastic about any
>> of
>> > > > them):
>> > > > > > >> > > > > >>    a. One repo for both Kafka and Samza
>> > > > > > >> > > > > >>    b. Two repos, retaining the current seperation
>> > > > > > >> > > > > >>    c. Two repos, the equivalent of samza-api and
>> > > > samza-core
>> > > > > > is
>> > > > > > >> > > > absorbed
>> > > > > > >> > > > > >> almost like a third client
>> > > > > > >> > > > > >>
>> > > > > > >> > > > > >> Cheers,
>> > > > > > >> > > > > >>
>> > > > > > >> > > > > >> -Jay
>> > > > > > >> > > > > >>
>> > > > > > >> > > > > >> On Mon, Jul 6, 2015 at 1:18 PM, Martin Kleppmann <
>> > > > > > >> > > > [email protected]>
>> > > > > > >> > > > > >> wrote:
>> > > > > > >> > > > > >>
>> > > > > > >> > > > > >> > Ok, thanks for the clarifications. Just a few
>> > > follow-up
>> > > > > > >> > comments.
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > - I see the appeal of merging with Kafka or
>> > becoming
>> > > a
>> > > > > > >> > subproject:
>> > > > > > >> > > > the
>> > > > > > >> > > > > >> > reasons you mention are good. The risk I see is
>> > that
>> > > > > > release
>> > > > > > >> > > > schedules
>> > > > > > >> > > > > >> > become coupled to each other, which can slow
>> > everyone
>> > > > > down,
>> > > > > > >> and
>> > > > > > >> > > > large
>> > > > > > >> > > > > >> > projects with many contributors are harder to
>> > manage.
>> > > > > > (Jakob,
>> > > > > > >> > can
>> > > > > > >> > > > you
>> > > > > > >> > > > > >> speak
>> > > > > > >> > > > > >> > from experience, having seen a wider range of
>> > Hadoop
>> > > > > > ecosystem
>> > > > > > >> > > > > >> projects?)
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > Some of the goals of a better unified developer
>> > > > > experience
>> > > > > > >> could
>> > > > > > >> > > > also
>> > > > > > >> > > > > be
>> > > > > > >> > > > > >> > solved by integrating Samza nicely into a Kafka
>> > > > > > distribution
>> > > > > > >> > (such
>> > > > > > >> > > > as
>> > > > > > >> > > > > >> > Confluent's). I'm not against merging projects
>> if
>> > we
>> > > > > decide
>> > > > > > >> > that's
>> > > > > > >> > > > the
>> > > > > > >> > > > > >> way
>> > > > > > >> > > > > >> > to go, just pointing out the same goals can
>> perhaps
>> > > > also
>> > > > > be
>> > > > > > >> > > achieved
>> > > > > > >> > > > > in
>> > > > > > >> > > > > >> > other ways.
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > - With regard to dropping the YARN dependency:
>> are
>> > > you
>> > > > > > >> proposing
>> > > > > > >> > > > that
>> > > > > > >> > > > > >> > Samza doesn't give any help to people wanting to
>> > run
>> > > on
>> > > > > > >> > > > > >> YARN/Mesos/AWS/etc?
>> > > > > > >> > > > > >> > So the docs would basically have a link to
>> Slider
>> > and
>> > > > > > nothing
>> > > > > > >> > > else?
>> > > > > > >> > > > Or
>> > > > > > >> > > > > >> > would we maintain integrations with a bunch of
>> > > popular
>> > > > > > >> > deployment
>> > > > > > >> > > > > >> methods
>> > > > > > >> > > > > >> > (e.g. the necessary glue and shell scripts to
>> make
>> > > > Samza
>> > > > > > work
>> > > > > > >> > with
>> > > > > > >> > > > > >> Slider)?
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > I absolutely think it's a good idea to have the
>> > "as a
>> > > > > > library"
>> > > > > > >> > and
>> > > > > > >> > > > > "as a
>> > > > > > >> > > > > >> > process" (using Yi's taxonomy) options for
>> people
>> > who
>> > > > > want
>> > > > > > >> them,
>> > > > > > >> > > > but I
>> > > > > > >> > > > > >> > think there should also be a low-friction path
>> for
>> > > > common
>> > > > > > "as
>> > > > > > >> a
>> > > > > > >> > > > > service"
>> > > > > > >> > > > > >> > deployment methods, for which we probably need
>> to
>> > > > > maintain
>> > > > > > >> > > > > integrations.
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > - Project naming: "Kafka Streams" seems odd to
>> me,
>> > > > > because
>> > > > > > >> Kafka
>> > > > > > >> > > is
>> > > > > > >> > > > > all
>> > > > > > >> > > > > >> > about streams already. Perhaps "Kafka
>> Transformers"
>> > > or
>> > > > > > "Kafka
>> > > > > > >> > > > Filters"
>> > > > > > >> > > > > >> > would be more apt?
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > One suggestion: perhaps the core of Samza
>> (stream
>> > > > > > >> transformation
>> > > > > > >> > > > with
>> > > > > > >> > > > > >> > state management -- i.e. the "Samza as a
>> library"
>> > > bit)
>> > > > > > could
>> > > > > > >> > > become
>> > > > > > >> > > > > >> part of
>> > > > > > >> > > > > >> > Kafka, while higher-level tools such as
>> streaming
>> > SQL
>> > > > and
>> > > > > > >> > > > integrations
>> > > > > > >> > > > > >> with
>> > > > > > >> > > > > >> > deployment frameworks remain in a separate
>> project?
>> > > In
>> > > > > > other
>> > > > > > >> > > words,
>> > > > > > >> > > > > >> Kafka
>> > > > > > >> > > > > >> > would absorb the proven, stable core of Samza,
>> > which
>> > > > > would
>> > > > > > >> > become
>> > > > > > >> > > > the
>> > > > > > >> > > > > >> > "third Kafka client" mentioned early in this
>> > thread.
>> > > > The
>> > > > > > Samza
>> > > > > > >> > > > project
>> > > > > > >> > > > > >> > would then target that third Kafka client as its
>> > base
>> > > > > API,
>> > > > > > and
>> > > > > > >> > the
>> > > > > > >> > > > > >> project
>> > > > > > >> > > > > >> > would be freed up to explore more experimental
>> new
>> > > > > > horizons.
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > Martin
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > On 6 Jul 2015, at 18:51, Jay Kreps <
>> > > > [email protected]>
>> > > > > > >> wrote:
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > > Hey Martin,
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > > For the YARN/Mesos/etc decoupling I actually
>> > don't
>> > > > > think
>> > > > > > it
>> > > > > > >> > ties
>> > > > > > >> > > > our
>> > > > > > >> > > > > >> > hands
>> > > > > > >> > > > > >> > > at all, all it does is refactor things. The
>> > > division
>> > > > of
>> > > > > > >> > > > > >> responsibility is
>> > > > > > >> > > > > >> > > that Samza core is responsible for task
>> > lifecycle,
>> > > > > state,
>> > > > > > >> and
>> > > > > > >> > > > > >> partition
>> > > > > > >> > > > > >> > > management (using the Kafka co-ordinator) but
>> it
>> > is
>> > > > NOT
>> > > > > > >> > > > responsible
>> > > > > > >> > > > > >> for
>> > > > > > >> > > > > >> > > packaging, configuration deployment or
>> execution
>> > of
>> > > > > > >> processes.
>> > > > > > >> > > The
>> > > > > > >> > > > > >> > problem
>> > > > > > >> > > > > >> > > of packaging and starting these processes is
>> > > > > > >> > > > > >> > > framework/environment-specific. This leaves
>> > > > individual
>> > > > > > >> > > frameworks
>> > > > > > >> > > > to
>> > > > > > >> > > > > >> be
>> > > > > > >> > > > > >> > as
>> > > > > > >> > > > > >> > > fancy or vanilla as they like. So you can get
>> > > simple
>> > > > > > >> stateless
>> > > > > > >> > > > > >> support in
>> > > > > > >> > > > > >> > > YARN, Mesos, etc using their off-the-shelf app
>> > > > > framework
>> > > > > > >> > > (Slider,
>> > > > > > >> > > > > >> > Marathon,
>> > > > > > >> > > > > >> > > etc). These are well known by people and have
>> > nice
>> > > > UIs
>> > > > > > and a
>> > > > > > >> > lot
>> > > > > > >> > > > of
>> > > > > > >> > > > > >> > > flexibility. I don't think they have node
>> > affinity
>> > > > as a
>> > > > > > >> built
>> > > > > > >> > in
>> > > > > > >> > > > > >> option
>> > > > > > >> > > > > >> > > (though I could be wrong). So if we want that
>> we
>> > > can
>> > > > > > either
>> > > > > > >> > wait
>> > > > > > >> > > > for
>> > > > > > >> > > > > >> them
>> > > > > > >> > > > > >> > > to add it or do a custom framework to add that
>> > > > feature
>> > > > > > (as
>> > > > > > >> > now).
>> > > > > > >> > > > > >> > Obviously
>> > > > > > >> > > > > >> > > if you manage things with old-school ops tools
>> > > > > > >> > (puppet/chef/etc)
>> > > > > > >> > > > you
>> > > > > > >> > > > > >> get
>> > > > > > >> > > > > >> > > locality easily. The nice thing, though, is
>> that
>> > > all
>> > > > > the
>> > > > > > >> samza
>> > > > > > >> > > > > >> "business
>> > > > > > >> > > > > >> > > logic" around partition management and fault
>> > > > tolerance
>> > > > > > is in
>> > > > > > >> > > Samza
>> > > > > > >> > > > > >> core
>> > > > > > >> > > > > >> > so
>> > > > > > >> > > > > >> > > it is shared across frameworks and the
>> framework
>> > > > > specific
>> > > > > > >> bit
>> > > > > > >> > is
>> > > > > > >> > > > > just
>> > > > > > >> > > > > >> > > whether it is smart enough to try to get the
>> same
>> > > > host
>> > > > > > when
>> > > > > > >> a
>> > > > > > >> > > job
>> > > > > > >> > > > is
>> > > > > > >> > > > > >> > > restarted.
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > > With respect to the Kafka-alignment, yeah I
>> think
>> > > the
>> > > > > > goal
>> > > > > > >> > would
>> > > > > > >> > > > be
>> > > > > > >> > > > > >> (a)
>> > > > > > >> > > > > >> > > actually get better alignment in user
>> experience,
>> > > and
>> > > > > (b)
>> > > > > > >> > > express
>> > > > > > >> > > > > >> this in
>> > > > > > >> > > > > >> > > the naming and project branding. Specifically:
>> > > > > > >> > > > > >> > > 1. Website/docs, it would be nice for the
>> > > > > > "transformation"
>> > > > > > >> api
>> > > > > > >> > > to
>> > > > > > >> > > > be
>> > > > > > >> > > > > >> > > discoverable in the main Kafka docs--i.e. be
>> able
>> > > to
>> > > > > > explain
>> > > > > > >> > > when
>> > > > > > >> > > > to
>> > > > > > >> > > > > >> use
>> > > > > > >> > > > > >> > > the consumer and when to use the stream
>> > processing
>> > > > > > >> > functionality
>> > > > > > >> > > > and
>> > > > > > >> > > > > >> lead
>> > > > > > >> > > > > >> > > people into that experience.
>> > > > > > >> > > > > >> > > 2. Align releases so if you get Kafkza 1.4.2
>> (or
>> > > > > > whatever)
>> > > > > > >> > that
>> > > > > > >> > > > has
>> > > > > > >> > > > > >> both
>> > > > > > >> > > > > >> > > Kafka and the stream processing part and they
>> > > > actually
>> > > > > > work
>> > > > > > >> > > > > together.
>> > > > > > >> > > > > >> > > 3. Unify the programming experience so the
>> client
>> > > and
>> > > > > > Samza
>> > > > > > >> > api
>> > > > > > >> > > > > share
>> > > > > > >> > > > > >> > > config/monitoring/naming/packaging/etc.
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > > I think sub-projects keep separate committers
>> and
>> > > can
>> > > > > > have a
>> > > > > > >> > > > > separate
>> > > > > > >> > > > > >> > repo,
>> > > > > > >> > > > > >> > > but I'm actually not really sure (I can't
>> find a
>> > > > > > definition
>> > > > > > >> > of a
>> > > > > > >> > > > > >> > subproject
>> > > > > > >> > > > > >> > > in Apache).
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > > Basically at a high-level you want the
>> experience
>> > > to
>> > > > > > "feel"
>> > > > > > >> > > like a
>> > > > > > >> > > > > >> single
>> > > > > > >> > > > > >> > > system, not to relatively independent things
>> that
>> > > are
>> > > > > > kind
>> > > > > > >> of
>> > > > > > >> > > > > >> awkwardly
>> > > > > > >> > > > > >> > > glued together.
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > > I think if we did that they having naming or
>> > > branding
>> > > > > > like
>> > > > > > >> > > "kafka
>> > > > > > >> > > > > >> > > streaming" or "kafka streams" or something
>> like
>> > > that
>> > > > > > would
>> > > > > > >> > > > actually
>> > > > > > >> > > > > >> do a
>> > > > > > >> > > > > >> > > good job of conveying what it is. I do that
>> this
>> > > > would
>> > > > > > help
>> > > > > > >> > > > adoption
>> > > > > > >> > > > > >> > quite
>> > > > > > >> > > > > >> > > a lot as it would correctly convey that using
>> > Kafka
>> > > > > > >> Streaming
>> > > > > > >> > > with
>> > > > > > >> > > > > >> Kafka
>> > > > > > >> > > > > >> > is
>> > > > > > >> > > > > >> > > a fairly seamless experience and Kafka is
>> pretty
>> > > > > heavily
>> > > > > > >> > adopted
>> > > > > > >> > > > at
>> > > > > > >> > > > > >> this
>> > > > > > >> > > > > >> > > point.
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > > Fwiw we actually considered this model
>> originally
>> > > > when
>> > > > > > open
>> > > > > > >> > > > sourcing
>> > > > > > >> > > > > >> > Samza,
>> > > > > > >> > > > > >> > > however at that time Kafka was relatively
>> unknown
>> > > and
>> > > > > we
>> > > > > > >> > decided
>> > > > > > >> > > > not
>> > > > > > >> > > > > >> to
>> > > > > > >> > > > > >> > do
>> > > > > > >> > > > > >> > > it since we felt it would be limiting. From my
>> > > point
>> > > > of
>> > > > > > view
>> > > > > > >> > the
>> > > > > > >> > > > > three
>> > > > > > >> > > > > >> > > things have changed (1) Kafka is now really
>> > heavily
>> > > > > used
>> > > > > > for
>> > > > > > >> > > > stream
>> > > > > > >> > > > > >> > > processing, (2) we learned that abstracting
>> out
>> > the
>> > > > > > stream
>> > > > > > >> > well
>> > > > > > >> > > is
>> > > > > > >> > > > > >> > > basically impossible, (3) we learned it is
>> really
>> > > > hard
>> > > > > to
>> > > > > > >> keep
>> > > > > > >> > > the
>> > > > > > >> > > > > two
>> > > > > > >> > > > > >> > > things feeling like a single product.
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > > -Jay
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > > On Mon, Jul 6, 2015 at 3:37 AM, Martin
>> Kleppmann
>> > <
>> > > > > > >> > > > > >> [email protected]>
>> > > > > > >> > > > > >> > > wrote:
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > >> Hi all,
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >> Lots of good thoughts here.
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >> I agree with the general philosophy of tying
>> > Samza
>> > > > > more
>> > > > > > >> > firmly
>> > > > > > >> > > to
>> > > > > > >> > > > > >> Kafka.
>> > > > > > >> > > > > >> > >> After I spent a while looking at integrating
>> > other
>> > > > > > message
>> > > > > > >> > > > brokers
>> > > > > > >> > > > > >> (e.g.
>> > > > > > >> > > > > >> > >> Kinesis) with SystemConsumer, I came to the
>> > > > conclusion
>> > > > > > that
>> > > > > > >> > > > > >> > SystemConsumer
>> > > > > > >> > > > > >> > >> tacitly assumes a model so much like Kafka's
>> > that
>> > > > > pretty
>> > > > > > >> much
>> > > > > > >> > > > > nobody
>> > > > > > >> > > > > >> but
>> > > > > > >> > > > > >> > >> Kafka actually implements it. (Databus is
>> > perhaps
>> > > an
>> > > > > > >> > exception,
>> > > > > > >> > > > but
>> > > > > > >> > > > > >> it
>> > > > > > >> > > > > >> > >> isn't widely used outside of LinkedIn.) Thus,
>> > > making
>> > > > > > Samza
>> > > > > > >> > > fully
>> > > > > > >> > > > > >> > dependent
>> > > > > > >> > > > > >> > >> on Kafka acknowledges that the
>> > system-independence
>> > > > was
>> > > > > > >> never
>> > > > > > >> > as
>> > > > > > >> > > > > real
>> > > > > > >> > > > > >> as
>> > > > > > >> > > > > >> > we
>> > > > > > >> > > > > >> > >> perhaps made it out to be. The gains of code
>> > reuse
>> > > > are
>> > > > > > >> real.
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >> The idea of decoupling Samza from YARN has
>> also
>> > > > always
>> > > > > > been
>> > > > > > >> > > > > >> appealing to
>> > > > > > >> > > > > >> > >> me, for various reasons already mentioned in
>> > this
>> > > > > > thread.
>> > > > > > >> > > > Although
>> > > > > > >> > > > > >> > making
>> > > > > > >> > > > > >> > >> Samza jobs deployable on anything
>> > > > (YARN/Mesos/AWS/etc)
>> > > > > > >> seems
>> > > > > > >> > > > > >> laudable,
>> > > > > > >> > > > > >> > I am
>> > > > > > >> > > > > >> > >> a little concerned that it will restrict us
>> to a
>> > > > > lowest
>> > > > > > >> > common
>> > > > > > >> > > > > >> > denominator.
>> > > > > > >> > > > > >> > >> For example, would host affinity (SAMZA-617)
>> > still
>> > > > be
>> > > > > > >> > possible?
>> > > > > > >> > > > For
>> > > > > > >> > > > > >> jobs
>> > > > > > >> > > > > >> > >> with large amounts of state, I think
>> SAMZA-617
>> > > would
>> > > > > be
>> > > > > > a
>> > > > > > >> big
>> > > > > > >> > > > boon,
>> > > > > > >> > > > > >> > since
>> > > > > > >> > > > > >> > >> restoring state off the changelog on every
>> > single
>> > > > > > restart
>> > > > > > >> is
>> > > > > > >> > > > > painful,
>> > > > > > >> > > > > >> > due
>> > > > > > >> > > > > >> > >> to long recovery times. It would be a shame
>> if
>> > the
>> > > > > > >> decoupling
>> > > > > > >> > > > from
>> > > > > > >> > > > > >> YARN
>> > > > > > >> > > > > >> > >> made host affinity impossible.
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >> Jay, a question about the proposed API for
>> > > > > > instantiating a
>> > > > > > >> > job
>> > > > > > >> > > in
>> > > > > > >> > > > > >> code
>> > > > > > >> > > > > >> > >> (rather than a properties file): when
>> > submitting a
>> > > > job
>> > > > > > to a
>> > > > > > >> > > > > cluster,
>> > > > > > >> > > > > >> is
>> > > > > > >> > > > > >> > the
>> > > > > > >> > > > > >> > >> idea that the instantiation code runs on a
>> > client
>> > > > > > >> somewhere,
>> > > > > > >> > > > which
>> > > > > > >> > > > > >> then
>> > > > > > >> > > > > >> > >> pokes the necessary endpoints on
>> > > YARN/Mesos/AWS/etc?
>> > > > > Or
>> > > > > > >> does
>> > > > > > >> > > that
>> > > > > > >> > > > > >> code
>> > > > > > >> > > > > >> > run
>> > > > > > >> > > > > >> > >> on each container that is part of the job (in
>> > > which
>> > > > > > case,
>> > > > > > >> how
>> > > > > > >> > > > does
>> > > > > > >> > > > > >> the
>> > > > > > >> > > > > >> > job
>> > > > > > >> > > > > >> > >> submission to the cluster work)?
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >> I agree with Garry that it doesn't feel
>> right to
>> > > > make
>> > > > > a
>> > > > > > 1.0
>> > > > > > >> > > > release
>> > > > > > >> > > > > >> > with a
>> > > > > > >> > > > > >> > >> plan for it to be immediately obsolete. So if
>> > this
>> > > > is
>> > > > > > going
>> > > > > > >> > to
>> > > > > > >> > > > > >> happen, I
>> > > > > > >> > > > > >> > >> think it would be more honest to stick with
>> 0.*
>> > > > > version
>> > > > > > >> > numbers
>> > > > > > >> > > > > until
>> > > > > > >> > > > > >> > the
>> > > > > > >> > > > > >> > >> library-ified Samza has been implemented, is
>> > > stable
>> > > > > and
>> > > > > > >> > widely
>> > > > > > >> > > > > used.
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >> Should the new Samza be a subproject of
>> Kafka?
>> > > There
>> > > > > is
>> > > > > > >> > > precedent
>> > > > > > >> > > > > for
>> > > > > > >> > > > > >> > >> tight coupling between different Apache
>> projects
>> > > > (e.g.
>> > > > > > >> > Curator
>> > > > > > >> > > > and
>> > > > > > >> > > > > >> > >> Zookeeper, or Slider and YARN), so I think
>> > > remaining
>> > > > > > >> separate
>> > > > > > >> > > > would
>> > > > > > >> > > > > >> be
>> > > > > > >> > > > > >> > ok.
>> > > > > > >> > > > > >> > >> Even if Samza is fully dependent on Kafka,
>> there
>> > > is
>> > > > > > enough
>> > > > > > >> > > > > substance
>> > > > > > >> > > > > >> in
>> > > > > > >> > > > > >> > >> Samza that it warrants being a separate
>> project.
>> > > An
>> > > > > > >> argument
>> > > > > > >> > in
>> > > > > > >> > > > > >> favour
>> > > > > > >> > > > > >> > of
>> > > > > > >> > > > > >> > >> merging would be if we think Kafka has a much
>> > > > stronger
>> > > > > > >> "brand
>> > > > > > >> > > > > >> presence"
>> > > > > > >> > > > > >> > >> than Samza; I'm ambivalent on that one. If
>> the
>> > > Kafka
>> > > > > > >> project
>> > > > > > >> > is
>> > > > > > >> > > > > >> willing
>> > > > > > >> > > > > >> > to
>> > > > > > >> > > > > >> > >> endorse Samza as the "official" way of doing
>> > > > stateful
>> > > > > > >> stream
>> > > > > > >> > > > > >> > >> transformations, that would probably have
>> much
>> > the
>> > > > > same
>> > > > > > >> > effect
>> > > > > > >> > > as
>> > > > > > >> > > > > >> > >> re-branding Samza as "Kafka Stream
>> Processors"
>> > or
>> > > > > > suchlike.
>> > > > > > >> > > Close
>> > > > > > >> > > > > >> > >> collaboration between the two projects will
>> be
>> > > > needed
>> > > > > in
>> > > > > > >> any
>> > > > > > >> > > > case.
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >> From a project management perspective, I
>> guess
>> > the
>> > > > > "new
>> > > > > > >> > Samza"
>> > > > > > >> > > > > would
>> > > > > > >> > > > > >> > have
>> > > > > > >> > > > > >> > >> to be developed on a branch alongside ongoing
>> > > > > > maintenance
>> > > > > > >> of
>> > > > > > >> > > the
>> > > > > > >> > > > > >> current
>> > > > > > >> > > > > >> > >> line of development? I think it would be
>> > important
>> > > > to
>> > > > > > >> > continue
>> > > > > > >> > > > > >> > supporting
>> > > > > > >> > > > > >> > >> existing users, and provide a graceful
>> migration
>> > > > path
>> > > > > to
>> > > > > > >> the
>> > > > > > >> > > new
>> > > > > > >> > > > > >> > version.
>> > > > > > >> > > > > >> > >> Leaving the current versions unsupported and
>> > > forcing
>> > > > > > people
>> > > > > > >> > to
>> > > > > > >> > > > > >> rewrite
>> > > > > > >> > > > > >> > >> their jobs would send a bad signal.
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >> Best,
>> > > > > > >> > > > > >> > >> Martin
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >> On 2 Jul 2015, at 16:59, Jay Kreps <
>> > > > [email protected]>
>> > > > > > >> wrote:
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >>> Hey Garry,
>> > > > > > >> > > > > >> > >>>
>> > > > > > >> > > > > >> > >>> Yeah that's super frustrating. I'd be happy
>> to
>> > > chat
>> > > > > > more
>> > > > > > >> > about
>> > > > > > >> > > > > this
>> > > > > > >> > > > > >> if
>> > > > > > >> > > > > >> > >>> you'd be interested. I think Chris and I
>> > started
>> > > > with
>> > > > > > the
>> > > > > > >> > idea
>> > > > > > >> > > > of
>> > > > > > >> > > > > >> "what
>> > > > > > >> > > > > >> > >>> would it take to make Samza a kick-ass
>> > ingestion
>> > > > > tool"
>> > > > > > but
>> > > > > > >> > > > > >> ultimately
>> > > > > > >> > > > > >> > we
>> > > > > > >> > > > > >> > >>> kind of came around to the idea that
>> ingestion
>> > > and
>> > > > > > >> > > > transformation
>> > > > > > >> > > > > >> had
>> > > > > > >> > > > > >> > >>> pretty different needs and coupling the two
>> > made
>> > > > > things
>> > > > > > >> > hard.
>> > > > > > >> > > > > >> > >>>
>> > > > > > >> > > > > >> > >>> For what it's worth I think copycat (KIP-26)
>> > > > actually
>> > > > > > will
>> > > > > > >> > do
>> > > > > > >> > > > what
>> > > > > > >> > > > > >> you
>> > > > > > >> > > > > >> > >> are
>> > > > > > >> > > > > >> > >>> looking for.
>> > > > > > >> > > > > >> > >>>
>> > > > > > >> > > > > >> > >>> With regard to your point about slider, I
>> don't
>> > > > > > >> necessarily
>> > > > > > >> > > > > >> disagree.
>> > > > > > >> > > > > >> > >> But I
>> > > > > > >> > > > > >> > >>> think getting good YARN support is quite
>> doable
>> > > > and I
>> > > > > > >> think
>> > > > > > >> > we
>> > > > > > >> > > > can
>> > > > > > >> > > > > >> make
>> > > > > > >> > > > > >> > >>> that work well. I think the issue this
>> proposal
>> > > > > solves
>> > > > > > is
>> > > > > > >> > that
>> > > > > > >> > > > > >> > >> technically
>> > > > > > >> > > > > >> > >>> it is pretty hard to support multiple
>> cluster
>> > > > > > management
>> > > > > > >> > > systems
>> > > > > > >> > > > > the
>> > > > > > >> > > > > >> > way
>> > > > > > >> > > > > >> > >>> things are now, you need to write an "app
>> > master"
>> > > > or
>> > > > > > >> > > "framework"
>> > > > > > >> > > > > for
>> > > > > > >> > > > > >> > each
>> > > > > > >> > > > > >> > >>> and they are all a little different so
>> testing
>> > is
>> > > > > > really
>> > > > > > >> > hard.
>> > > > > > >> > > > In
>> > > > > > >> > > > > >> the
>> > > > > > >> > > > > >> > >>> absence of this we have been stuck with just
>> > YARN
>> > > > > which
>> > > > > > >> has
>> > > > > > >> > > > > >> fantastic
>> > > > > > >> > > > > >> > >>> penetration in the Hadoopy part of the org,
>> but
>> > > > zero
>> > > > > > >> > > penetration
>> > > > > > >> > > > > >> > >> elsewhere.
>> > > > > > >> > > > > >> > >>> Given the huge amount of work being put in
>> to
>> > > > slider,
>> > > > > > >> > > marathon,
>> > > > > > >> > > > > aws
>> > > > > > >> > > > > >> > >>> tooling, not to mention the umpteen related
>> > > > packaging
>> > > > > > >> > > > technologies
>> > > > > > >> > > > > >> > people
>> > > > > > >> > > > > >> > >>> want to use (Docker, Kubernetes, various
>> > > > > cloud-specific
>> > > > > > >> > deploy
>> > > > > > >> > > > > >> tools,
>> > > > > > >> > > > > >> > >> etc)
>> > > > > > >> > > > > >> > >>> I really think it is important to get this
>> > right.
>> > > > > > >> > > > > >> > >>>
>> > > > > > >> > > > > >> > >>> -Jay
>> > > > > > >> > > > > >> > >>>
>> > > > > > >> > > > > >> > >>> On Thu, Jul 2, 2015 at 4:17 AM, Garry
>> > Turkington
>> > > <
>> > > > > > >> > > > > >> > >>> [email protected]> wrote:
>> > > > > > >> > > > > >> > >>>
>> > > > > > >> > > > > >> > >>>> Hi all,
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> I think the question below re does Samza
>> > become
>> > > a
>> > > > > > >> > sub-project
>> > > > > > >> > > > of
>> > > > > > >> > > > > >> Kafka
>> > > > > > >> > > > > >> > >>>> highlights the broader point around
>> migration.
>> > > > Chris
>> > > > > > >> > mentions
>> > > > > > >> > > > > >> Samza's
>> > > > > > >> > > > > >> > >>>> maturity is heading towards a v1 release
>> but
>> > I'm
>> > > > not
>> > > > > > sure
>> > > > > > >> > it
>> > > > > > >> > > > > feels
>> > > > > > >> > > > > >> > >> right to
>> > > > > > >> > > > > >> > >>>> launch a v1 then immediately plan to
>> deprecate
>> > > > most
>> > > > > of
>> > > > > > >> it.
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> From a selfish perspective I have some guys
>> > who
>> > > > have
>> > > > > > >> > started
>> > > > > > >> > > > > >> working
>> > > > > > >> > > > > >> > >> with
>> > > > > > >> > > > > >> > >>>> Samza and building some new
>> > consumers/producers
>> > > > was
>> > > > > > next
>> > > > > > >> > up.
>> > > > > > >> > > > > Sounds
>> > > > > > >> > > > > >> > like
>> > > > > > >> > > > > >> > >>>> that is absolutely not the direction to
>> go. I
>> > > need
>> > > > > to
>> > > > > > >> look
>> > > > > > >> > > into
>> > > > > > >> > > > > the
>> > > > > > >> > > > > >> > KIP
>> > > > > > >> > > > > >> > >> in
>> > > > > > >> > > > > >> > >>>> more detail but for me the attractiveness
>> of
>> > > > adding
>> > > > > > new
>> > > > > > >> > Samza
>> > > > > > >> > > > > >> > >>>> consumer/producers -- even if yes all they
>> > were
>> > > > > doing
>> > > > > > was
>> > > > > > >> > > > really
>> > > > > > >> > > > > >> > getting
>> > > > > > >> > > > > >> > >>>> data into and out of Kafka --  was to avoid
>> > > > having
>> > > > > to
>> > > > > > >> > worry
>> > > > > > >> > > > > about
>> > > > > > >> > > > > >> the
>> > > > > > >> > > > > >> > >>>> lifecycle management of external clients.
>> If
>> > > there
>> > > > > is
>> > > > > > a
>> > > > > > >> > > generic
>> > > > > > >> > > > > >> Kafka
>> > > > > > >> > > > > >> > >>>> ingress/egress layer that I can plug a new
>> > > > connector
>> > > > > > into
>> > > > > > >> > and
>> > > > > > >> > > > > have
>> > > > > > >> > > > > >> a
>> > > > > > >> > > > > >> > >> lot of
>> > > > > > >> > > > > >> > >>>> the heavy lifting re scale and reliability
>> > done
>> > > > for
>> > > > > me
>> > > > > > >> then
>> > > > > > >> > > it
>> > > > > > >> > > > > >> gives
>> > > > > > >> > > > > >> > me
>> > > > > > >> > > > > >> > >> all
>> > > > > > >> > > > > >> > >>>> the pushing new consumers/producers would.
>> If
>> > > not
>> > > > > > then it
>> > > > > > >> > > > > >> complicates
>> > > > > > >> > > > > >> > my
>> > > > > > >> > > > > >> > >>>> operational deployments.
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> Which is similar to my other question with
>> the
>> > > > > > proposal
>> > > > > > >> --
>> > > > > > >> > if
>> > > > > > >> > > > we
>> > > > > > >> > > > > >> > build a
>> > > > > > >> > > > > >> > >>>> fully available/stand-alone Samza plus the
>> > > > requisite
>> > > > > > >> shims
>> > > > > > >> > to
>> > > > > > >> > > > > >> > integrate
>> > > > > > >> > > > > >> > >>>> with Slider etc I suspect the former may
>> be a
>> > > lot
>> > > > > more
>> > > > > > >> work
>> > > > > > >> > > > than
>> > > > > > >> > > > > we
>> > > > > > >> > > > > >> > >> think.
>> > > > > > >> > > > > >> > >>>> We may make it much easier for a newcomer
>> to
>> > get
>> > > > > > >> something
>> > > > > > >> > > > > running
>> > > > > > >> > > > > >> but
>> > > > > > >> > > > > >> > >>>> having them step up and get a reliable
>> > > production
>> > > > > > >> > deployment
>> > > > > > >> > > > may
>> > > > > > >> > > > > >> still
>> > > > > > >> > > > > >> > >>>> dominate mailing list  traffic, if for
>> > different
>> > > > > > reasons
>> > > > > > >> > than
>> > > > > > >> > > > > >> today.
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> Don't get me wrong -- I'm comfortable with
>> > > making
>> > > > > the
>> > > > > > >> Samza
>> > > > > > >> > > > > >> dependency
>> > > > > > >> > > > > >> > >> on
>> > > > > > >> > > > > >> > >>>> Kafka much more explicit and I absolutely
>> see
>> > > the
>> > > > > > >> benefits
>> > > > > > >> > > in
>> > > > > > >> > > > > the
>> > > > > > >> > > > > >> > >>>> reduction of duplication and clashing
>> > > > > > >> > > > terminologies/abstractions
>> > > > > > >> > > > > >> that
>> > > > > > >> > > > > >> > >>>> Chris/Jay describe. Samza as a library
>> would
>> > > > likely
>> > > > > > be a
>> > > > > > >> > very
>> > > > > > >> > > > > nice
>> > > > > > >> > > > > >> > tool
>> > > > > > >> > > > > >> > >> to
>> > > > > > >> > > > > >> > >>>> add to the Kafka ecosystem. I just have the
>> > > > concerns
>> > > > > > >> above
>> > > > > > >> > re
>> > > > > > >> > > > the
>> > > > > > >> > > > > >> > >>>> operational side.
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> Garry
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> -----Original Message-----
>> > > > > > >> > > > > >> > >>>> From: Gianmarco De Francisci Morales
>> [mailto:
>> > > > > > >> > [email protected]
>> > > > > > >> > > ]
>> > > > > > >> > > > > >> > >>>> Sent: 02 July 2015 12:56
>> > > > > > >> > > > > >> > >>>> To: [email protected]
>> > > > > > >> > > > > >> > >>>> Subject: Re: Thoughts and obesrvations on
>> > Samza
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> Very interesting thoughts.
>> > > > > > >> > > > > >> > >>>> From outside, I have always perceived Samza
>> > as a
>> > > > > > >> computing
>> > > > > > >> > > > layer
>> > > > > > >> > > > > >> over
>> > > > > > >> > > > > >> > >>>> Kafka.
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> The question, maybe a bit provocative, is
>> > > "should
>> > > > > > Samza
>> > > > > > >> be
>> > > > > > >> > a
>> > > > > > >> > > > > >> > sub-project
>> > > > > > >> > > > > >> > >>>> of Kafka then?"
>> > > > > > >> > > > > >> > >>>> Or does it make sense to keep it as a
>> separate
>> > > > > project
>> > > > > > >> > with a
>> > > > > > >> > > > > >> separate
>> > > > > > >> > > > > >> > >>>> governance?
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> Cheers,
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> --
>> > > > > > >> > > > > >> > >>>> Gianmarco
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> On 2 July 2015 at 08:59, Yan Fang <
>> > > > > > [email protected]>
>> > > > > > >> > > > wrote:
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>>> Overall, I agree to couple with Kafka more
>> > > > tightly.
>> > > > > > >> > Because
>> > > > > > >> > > > > Samza
>> > > > > > >> > > > > >> de
>> > > > > > >> > > > > >> > >>>>> facto is based on Kafka, and it should
>> > leverage
>> > > > > what
>> > > > > > >> Kafka
>> > > > > > >> > > > has.
>> > > > > > >> > > > > At
>> > > > > > >> > > > > >> > the
>> > > > > > >> > > > > >> > >>>>> same time, Kafka does not need to reinvent
>> > what
>> > > > > Samza
>> > > > > > >> > > already
>> > > > > > >> > > > > >> has. I
>> > > > > > >> > > > > >> > >>>>> also like the idea of separating the
>> > ingestion
>> > > > and
>> > > > > > >> > > > > transformation.
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> But it is a little difficult for me to
>> image
>> > > how
>> > > > > the
>> > > > > > >> Samza
>> > > > > > >> > > > will
>> > > > > > >> > > > > >> look
>> > > > > > >> > > > > >> > >>>> like.
>> > > > > > >> > > > > >> > >>>>> And I feel Chris and Jay have a little
>> > > difference
>> > > > > in
>> > > > > > >> terms
>> > > > > > >> > > of
>> > > > > > >> > > > > how
>> > > > > > >> > > > > >> > >>>>> Samza should look like.
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> *** Will it look like what Jay's code
>> shows
>> > (A
>> > > > > > client of
>> > > > > > >> > > > Kakfa)
>> > > > > > >> > > > > ?
>> > > > > > >> > > > > >> And
>> > > > > > >> > > > > >> > >>>>> user's application code calls this client?
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> 1. If we make Samza be a library of Kafka
>> > (like
>> > > > > what
>> > > > > > the
>> > > > > > >> > > code
>> > > > > > >> > > > > >> shows),
>> > > > > > >> > > > > >> > >>>>> how do we implement auto-balance and
>> > > > > fault-tolerance?
>> > > > > > >> Are
>> > > > > > >> > > they
>> > > > > > >> > > > > >> taken
>> > > > > > >> > > > > >> > >>>>> care by the Kafka broker or other
>> mechanism,
>> > > such
>> > > > > as
>> > > > > > >> > "Samza
>> > > > > > >> > > > > >> worker"
>> > > > > > >> > > > > >> > >>>>> (just make up the name) ?
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> 2. What about other features, such as
>> > > > auto-scaling,
>> > > > > > >> shared
>> > > > > > >> > > > > state,
>> > > > > > >> > > > > >> > >>>>> monitoring?
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> *** If we have Samza standalone, (is this
>> > what
>> > > > > Chris
>> > > > > > >> > > > suggests?)
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> 1. we still need to ingest data from Kakfa
>> > and
>> > > > > > produce
>> > > > > > >> to
>> > > > > > >> > > it.
>> > > > > > >> > > > > >> Then it
>> > > > > > >> > > > > >> > >>>>> becomes the same as what Samza looks like
>> > now,
>> > > > > > except it
>> > > > > > >> > > does
>> > > > > > >> > > > > not
>> > > > > > >> > > > > >> > rely
>> > > > > > >> > > > > >> > >>>>> on Yarn anymore.
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> 2. if it is standalone, how can it
>> leverage
>> > > > Kafka's
>> > > > > > >> > metrics,
>> > > > > > >> > > > > logs,
>> > > > > > >> > > > > >> > >>>>> etc? Use Kafka code as the dependency?
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> Thanks,
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> Fang, Yan
>> > > > > > >> > > > > >> > >>>>> [email protected]
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> On Wed, Jul 1, 2015 at 5:46 PM, Guozhang
>> > Wang <
>> > > > > > >> > > > > [email protected]
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > >>>> wrote:
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>>> Read through the code example and it
>> looks
>> > > good
>> > > > to
>> > > > > > me.
>> > > > > > >> A
>> > > > > > >> > > few
>> > > > > > >> > > > > >> > >>>>>> thoughts regarding deployment:
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>> Today Samza deploys as executable
>> runnable
>> > > like:
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>> deploy/samza/bin/run-job.sh
>> > > --config-factory=...
>> > > > > > >> > > > > >> > >>>> --config-path=file://...
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>> And this proposal advocate for deploying
>> > Samza
>> > > > > more
>> > > > > > as
>> > > > > > >> > > > embedded
>> > > > > > >> > > > > >> > >>>>>> libraries in user application code
>> (ignoring
>> > > the
>> > > > > > >> > > terminology
>> > > > > > >> > > > > >> since
>> > > > > > >> > > > > >> > >>>>>> it is not the
>> > > > > > >> > > > > >> > >>>>> same
>> > > > > > >> > > > > >> > >>>>>> as the prototype code):
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>> StreamTask task = new
>> MyStreamTask(configs);
>> > > > > Thread
>> > > > > > >> > thread
>> > > > > > >> > > =
>> > > > > > >> > > > > new
>> > > > > > >> > > > > >> > >>>>>> Thread(task); thread.start();
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>> I think both of these deployment modes
>> are
>> > > > > important
>> > > > > > >> for
>> > > > > > >> > > > > >> different
>> > > > > > >> > > > > >> > >>>>>> types
>> > > > > > >> > > > > >> > >>>>> of
>> > > > > > >> > > > > >> > >>>>>> users. That said, I think making Samza
>> > purely
>> > > > > > >> standalone
>> > > > > > >> > is
>> > > > > > >> > > > > still
>> > > > > > >> > > > > >> > >>>>>> sufficient for either runnable or library
>> > > modes.
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>> Guozhang
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>> On Tue, Jun 30, 2015 at 11:33 PM, Jay
>> Kreps
>> > <
>> > > > > > >> > > > [email protected]>
>> > > > > > >> > > > > >> > wrote:
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>>> Looks like gmail mangled the code
>> example,
>> > it
>> > > > was
>> > > > > > >> > supposed
>> > > > > > >> > > > to
>> > > > > > >> > > > > >> look
>> > > > > > >> > > > > >> > >>>>>>> like
>> > > > > > >> > > > > >> > >>>>>>> this:
>> > > > > > >> > > > > >> > >>>>>>>
>> > > > > > >> > > > > >> > >>>>>>> Properties props = new Properties();
>> > > > > > >> > > > > >> > >>>>>>> props.put("bootstrap.servers",
>> > > > "localhost:4242");
>> > > > > > >> > > > > >> StreamingConfig
>> > > > > > >> > > > > >> > >>>>>>> config = new StreamingConfig(props);
>> > > > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1",
>> > > > "test-topic-2");
>> > > > > > >> > > > > >> > >>>>>>>
>> > > config.processor(ExampleStreamProcessor.class);
>> > > > > > >> > > > > >> > >>>>>>> config.serialization(new
>> > StringSerializer(),
>> > > > new
>> > > > > > >> > > > > >> > >>>>>>> StringDeserializer()); KafkaStreaming
>> > > > container =
>> > > > > > new
>> > > > > > >> > > > > >> > >>>>>>> KafkaStreaming(config); container.run();
>> > > > > > >> > > > > >> > >>>>>>>
>> > > > > > >> > > > > >> > >>>>>>> -Jay
>> > > > > > >> > > > > >> > >>>>>>>
>> > > > > > >> > > > > >> > >>>>>>> On Tue, Jun 30, 2015 at 11:32 PM, Jay
>> > Kreps <
>> > > > > > >> > > > [email protected]
>> > > > > > >> > > > > >
>> > > > > > >> > > > > >> > >>>> wrote:
>> > > > > > >> > > > > >> > >>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> Hey guys,
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> This came out of some conversations
>> Chris
>> > > and
>> > > > I
>> > > > > > were
>> > > > > > >> > > having
>> > > > > > >> > > > > >> > >>>>>>>> around
>> > > > > > >> > > > > >> > >>>>>>> whether
>> > > > > > >> > > > > >> > >>>>>>>> it would make sense to use Samza as a
>> kind
>> > > of
>> > > > > data
>> > > > > > >> > > > ingestion
>> > > > > > >> > > > > >> > >>>>> framework
>> > > > > > >> > > > > >> > >>>>>>> for
>> > > > > > >> > > > > >> > >>>>>>>> Kafka (which ultimately lead to KIP-26
>> > > > > "copycat").
>> > > > > > >> This
>> > > > > > >> > > > kind
>> > > > > > >> > > > > of
>> > > > > > >> > > > > >> > >>>>>> combined
>> > > > > > >> > > > > >> > >>>>>>>> with complaints around config and YARN
>> and
>> > > the
>> > > > > > >> > discussion
>> > > > > > >> > > > > >> around
>> > > > > > >> > > > > >> > >>>>>>>> how
>> > > > > > >> > > > > >> > >>>>> to
>> > > > > > >> > > > > >> > >>>>>>>> best do a standalone mode.
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> So the thought experiment was, given
>> that
>> > > > Samza
>> > > > > > was
>> > > > > > >> > > > basically
>> > > > > > >> > > > > >> > >>>>>>>> already totally Kafka specific, what if
>> > you
>> > > > just
>> > > > > > >> > embraced
>> > > > > > >> > > > > that
>> > > > > > >> > > > > >> > >>>>>>>> and turned it
>> > > > > > >> > > > > >> > >>>>>> into
>> > > > > > >> > > > > >> > >>>>>>>> something less like a heavyweight
>> > framework
>> > > > and
>> > > > > > more
>> > > > > > >> > > like a
>> > > > > > >> > > > > >> > >>>>>>>> third
>> > > > > > >> > > > > >> > >>>>> Kafka
>> > > > > > >> > > > > >> > >>>>>>>> client--a kind of "producing consumer"
>> > with
>> > > > > state
>> > > > > > >> > > > management
>> > > > > > >> > > > > >> > >>>>>> facilities.
>> > > > > > >> > > > > >> > >>>>>>>> Basically a library. Instead of a
>> complex
>> > > > stream
>> > > > > > >> > > processing
>> > > > > > >> > > > > >> > >>>>>>>> framework
>> > > > > > >> > > > > >> > >>>>>>> this
>> > > > > > >> > > > > >> > >>>>>>>> would actually be a very simple thing,
>> not
>> > > > much
>> > > > > > more
>> > > > > > >> > > > > >> complicated
>> > > > > > >> > > > > >> > >>>>>>>> to
>> > > > > > >> > > > > >> > >>>>> use
>> > > > > > >> > > > > >> > >>>>>>> or
>> > > > > > >> > > > > >> > >>>>>>>> operate than a Kafka consumer. As Chris
>> > said
>> > > > we
>> > > > > > >> thought
>> > > > > > >> > > > about
>> > > > > > >> > > > > >> it
>> > > > > > >> > > > > >> > >>>>>>>> a
>> > > > > > >> > > > > >> > >>>>> lot
>> > > > > > >> > > > > >> > >>>>>> of
>> > > > > > >> > > > > >> > >>>>>>>> what Samza (and the other stream
>> > processing
>> > > > > > systems
>> > > > > > >> > were
>> > > > > > >> > > > > doing)
>> > > > > > >> > > > > >> > >>>>> seemed
>> > > > > > >> > > > > >> > >>>>>>> like
>> > > > > > >> > > > > >> > >>>>>>>> kind of a hangover from MapReduce.
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> Of course you need to ingest/output
>> data
>> > to
>> > > > and
>> > > > > > from
>> > > > > > >> > the
>> > > > > > >> > > > > stream
>> > > > > > >> > > > > >> > >>>>>>>> processing. But when we actually looked
>> > into
>> > > > how
>> > > > > > that
>> > > > > > >> > > would
>> > > > > > >> > > > > >> > >>>>>>>> work,
>> > > > > > >> > > > > >> > >>>>> Samza
>> > > > > > >> > > > > >> > >>>>>>>> isn't really an ideal data ingestion
>> > > framework
>> > > > > > for a
>> > > > > > >> > > bunch
>> > > > > > >> > > > of
>> > > > > > >> > > > > >> > >>>>> reasons.
>> > > > > > >> > > > > >> > >>>>>> To
>> > > > > > >> > > > > >> > >>>>>>>> really do that right you need a pretty
>> > > > different
>> > > > > > >> > internal
>> > > > > > >> > > > > data
>> > > > > > >> > > > > >> > >>>>>>>> model
>> > > > > > >> > > > > >> > >>>>>> and
>> > > > > > >> > > > > >> > >>>>>>>> set of apis. So what if you split them
>> and
>> > > had
>> > > > > an
>> > > > > > api
>> > > > > > >> > for
>> > > > > > >> > > > > Kafka
>> > > > > > >> > > > > >> > >>>>>>>> ingress/egress (copycat AKA KIP-26)
>> and a
>> > > > > separate
>> > > > > > >> api
>> > > > > > >> > > for
>> > > > > > >> > > > > >> Kafka
>> > > > > > >> > > > > >> > >>>>>>>> transformation (Samza).
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> This would also allow really embracing
>> the
>> > > > same
>> > > > > > >> > > terminology
>> > > > > > >> > > > > and
>> > > > > > >> > > > > >> > >>>>>>>> conventions. One complaint about the
>> > current
>> > > > > > state is
>> > > > > > >> > > that
>> > > > > > >> > > > > the
>> > > > > > >> > > > > >> > >>>>>>>> two
>> > > > > > >> > > > > >> > >>>>>>> systems
>> > > > > > >> > > > > >> > >>>>>>>> kind of feel bolted on. Terminology
>> like
>> > > > > "stream"
>> > > > > > vs
>> > > > > > >> > > > "topic"
>> > > > > > >> > > > > >> and
>> > > > > > >> > > > > >> > >>>>>>> different
>> > > > > > >> > > > > >> > >>>>>>>> config and monitoring systems means you
>> > kind
>> > > > of
>> > > > > > have
>> > > > > > >> to
>> > > > > > >> > > > learn
>> > > > > > >> > > > > >> > >>>>>>>> Kafka's
>> > > > > > >> > > > > >> > >>>>>>> way,
>> > > > > > >> > > > > >> > >>>>>>>> then learn Samza's slightly different
>> way,
>> > > > then
>> > > > > > kind
>> > > > > > >> of
>> > > > > > >> > > > > >> > >>>>>>>> understand
>> > > > > > >> > > > > >> > >>>>> how
>> > > > > > >> > > > > >> > >>>>>>> they
>> > > > > > >> > > > > >> > >>>>>>>> map to each other, which having walked
>> a
>> > few
>> > > > > > people
>> > > > > > >> > > through
>> > > > > > >> > > > > >> this
>> > > > > > >> > > > > >> > >>>>>>>> is surprisingly tricky for folks to
>> get.
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> Since I have been spending a lot of
>> time
>> > on
>> > > > > > >> airplanes I
>> > > > > > >> > > > > hacked
>> > > > > > >> > > > > >> > >>>>>>>> up an ernest but still somewhat
>> incomplete
>> > > > > > prototype
>> > > > > > >> of
>> > > > > > >> > > > what
>> > > > > > >> > > > > >> > >>>>>>>> this would
>> > > > > > >> > > > > >> > >>>>> look
>> > > > > > >> > > > > >> > >>>>>>>> like. This is just unceremoniously
>> dumped
>> > > into
>> > > > > > Kafka
>> > > > > > >> as
>> > > > > > >> > > it
>> > > > > > >> > > > > >> > >>>>>>>> required a
>> > > > > > >> > > > > >> > >>>>>> few
>> > > > > > >> > > > > >> > >>>>>>>> changes to the new consumer. Here is
>> the
>> > > code:
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > >
>> > > > > > >> >
>> > > > > >
>> > >
>> https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org
>> > > > > > >> > > > > >> > >>>>> /apache/kafka/clients/streaming
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> For the purpose of the prototype I just
>> > > > > liberally
>> > > > > > >> > renamed
>> > > > > > >> > > > > >> > >>>>>>>> everything
>> > > > > > >> > > > > >> > >>>>> to
>> > > > > > >> > > > > >> > >>>>>>>> try to align it with Kafka with no
>> regard
>> > > for
>> > > > > > >> > > > compatibility.
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> To use this would be something like
>> this:
>> > > > > > >> > > > > >> > >>>>>>>> Properties props = new Properties();
>> > > > > > >> > > > > >> > >>>>>>>> props.put("bootstrap.servers",
>> > > > > "localhost:4242");
>> > > > > > >> > > > > >> > >>>>>>>> StreamingConfig config = new
>> > > > > > >> > > > > >> > >>>>> StreamingConfig(props);
>> > > > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1",
>> > > > > > >> > > > > >> > >>>>>>>> "test-topic-2");
>> > > > > > >> > > > > >> config.processor(ExampleStreamProcessor.class);
>> > > > > > >> > > > > >> > >>>>>>> config.serialization(new
>> > > > > > >> > > > > >> > >>>>>>>> StringSerializer(), new
>> > > StringDeserializer());
>> > > > > > >> > > > KafkaStreaming
>> > > > > > >> > > > > >> > >>>>>> container =
>> > > > > > >> > > > > >> > >>>>>>>> new KafkaStreaming(config);
>> > container.run();
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> KafkaStreaming is basically the
>> > > > SamzaContainer;
>> > > > > > >> > > > > StreamProcessor
>> > > > > > >> > > > > >> > >>>>>>>> is basically StreamTask.
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> So rather than putting all the class
>> names
>> > > in
>> > > > a
>> > > > > > file
>> > > > > > >> > and
>> > > > > > >> > > > then
>> > > > > > >> > > > > >> > >>>>>>>> having
>> > > > > > >> > > > > >> > >>>>>> the
>> > > > > > >> > > > > >> > >>>>>>>> job assembled by reflection, you just
>> > > > > instantiate
>> > > > > > the
>> > > > > > >> > > > > container
>> > > > > > >> > > > > >> > >>>>>>>> programmatically. Work is balanced over
>> > > > however
>> > > > > > many
>> > > > > > >> > > > > instances
>> > > > > > >> > > > > >> > >>>>>>>> of
>> > > > > > >> > > > > >> > >>>>> this
>> > > > > > >> > > > > >> > >>>>>>> are
>> > > > > > >> > > > > >> > >>>>>>>> alive at any time (i.e. if an instance
>> > dies,
>> > > > new
>> > > > > > >> tasks
>> > > > > > >> > > are
>> > > > > > >> > > > > >> added
>> > > > > > >> > > > > >> > >>>>>>>> to
>> > > > > > >> > > > > >> > >>>>> the
>> > > > > > >> > > > > >> > >>>>>>>> existing containers without shutting
>> them
>> > > > down).
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> We would provide some glue for running
>> > this
>> > > > > stuff
>> > > > > > in
>> > > > > > >> > YARN
>> > > > > > >> > > > via
>> > > > > > >> > > > > >> > >>>>>>>> Slider, Mesos via Marathon, and AWS
>> using
>> > > some
>> > > > > of
>> > > > > > >> their
>> > > > > > >> > > > tools
>> > > > > > >> > > > > >> > >>>>>>>> but from the
>> > > > > > >> > > > > >> > >>>>>> point
>> > > > > > >> > > > > >> > >>>>>>> of
>> > > > > > >> > > > > >> > >>>>>>>> view of these frameworks these stream
>> > > > processing
>> > > > > > jobs
>> > > > > > >> > are
>> > > > > > >> > > > > just
>> > > > > > >> > > > > >> > >>>>>> stateless
>> > > > > > >> > > > > >> > >>>>>>>> services that can come and go and
>> expand
>> > and
>> > > > > > contract
>> > > > > > >> > at
>> > > > > > >> > > > > will.
>> > > > > > >> > > > > >> > >>>>>>>> There
>> > > > > > >> > > > > >> > >>>>> is
>> > > > > > >> > > > > >> > >>>>>>> no
>> > > > > > >> > > > > >> > >>>>>>>> more custom scheduler.
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> Here are some relevant details:
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>  1. It is only ~1300 lines of code, it
>> > would
>> > > > get
>> > > > > > >> larger
>> > > > > > >> > > if
>> > > > > > >> > > > we
>> > > > > > >> > > > > >> > >>>>>>>>  productionized but not vastly larger.
>> We
>> > > > really
>> > > > > > do
>> > > > > > >> > get a
>> > > > > > >> > > > ton
>> > > > > > >> > > > > >> > >>>>>>>> of
>> > > > > > >> > > > > >> > >>>>>>> leverage
>> > > > > > >> > > > > >> > >>>>>>>>  out of Kafka.
>> > > > > > >> > > > > >> > >>>>>>>>  2. Partition management is fully
>> > delegated
>> > > to
>> > > > > the
>> > > > > > >> new
>> > > > > > >> > > > > >> consumer.
>> > > > > > >> > > > > >> > >>>>> This
>> > > > > > >> > > > > >> > >>>>>>>>  is nice since now any partition
>> > management
>> > > > > > strategy
>> > > > > > >> > > > > available
>> > > > > > >> > > > > >> > >>>>>>>> to
>> > > > > > >> > > > > >> > >>>>>> Kafka
>> > > > > > >> > > > > >> > >>>>>>>>  consumer is also available to Samza
>> (and
>> > > vice
>> > > > > > versa)
>> > > > > > >> > and
>> > > > > > >> > > > > with
>> > > > > > >> > > > > >> > >>>>>>>> the
>> > > > > > >> > > > > >> > >>>>>>> exact
>> > > > > > >> > > > > >> > >>>>>>>>  same configs.
>> > > > > > >> > > > > >> > >>>>>>>>  3. It supports state as well as state
>> > reuse
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> Anyhow take a look, hopefully it is
>> > thought
>> > > > > > >> provoking.
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> -Jay
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> On Tue, Jun 30, 2015 at 6:55 PM, Chris
>> > > > > Riccomini <
>> > > > > > >> > > > > >> > >>>>>> [email protected]>
>> > > > > > >> > > > > >> > >>>>>>>> wrote:
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Hey all,
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> I have had some discussions with Samza
>> > > > > engineers
>> > > > > > at
>> > > > > > >> > > > LinkedIn
>> > > > > > >> > > > > >> > >>>>>>>>> and
>> > > > > > >> > > > > >> > >>>>>>> Confluent
>> > > > > > >> > > > > >> > >>>>>>>>> and we came up with a few observations
>> > and
>> > > > > would
>> > > > > > >> like
>> > > > > > >> > to
>> > > > > > >> > > > > >> > >>>>>>>>> propose
>> > > > > > >> > > > > >> > >>>>> some
>> > > > > > >> > > > > >> > >>>>>>>>> changes.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> We've observed some things that I
>> want to
>> > > > call
>> > > > > > out
>> > > > > > >> > about
>> > > > > > >> > > > > >> > >>>>>>>>> Samza's
>> > > > > > >> > > > > >> > >>>>>> design,
>> > > > > > >> > > > > >> > >>>>>>>>> and I'd like to propose some changes.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> * Samza is dependent upon a dynamic
>> > > > deployment
>> > > > > > >> system.
>> > > > > > >> > > > > >> > >>>>>>>>> * Samza is too pluggable.
>> > > > > > >> > > > > >> > >>>>>>>>> * Samza's
>> SystemConsumer/SystemProducer
>> > and
>> > > > > > Kafka's
>> > > > > > >> > > > consumer
>> > > > > > >> > > > > >> > >>>>>>>>> APIs
>> > > > > > >> > > > > >> > >>>>> are
>> > > > > > >> > > > > >> > >>>>>>>>> trying to solve a lot of the same
>> > problems.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> All three of these issues are related,
>> > but
>> > > > I'll
>> > > > > > >> > address
>> > > > > > >> > > > them
>> > > > > > >> > > > > >> in
>> > > > > > >> > > > > >> > >>>>> order.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Deployment
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Samza strongly depends on the use of a
>> > > > dynamic
>> > > > > > >> > > deployment
>> > > > > > >> > > > > >> > >>>>>>>>> scheduler
>> > > > > > >> > > > > >> > >>>>>> such
>> > > > > > >> > > > > >> > >>>>>>>>> as
>> > > > > > >> > > > > >> > >>>>>>>>> YARN, Mesos, etc. When we initially
>> built
>> > > > > Samza,
>> > > > > > we
>> > > > > > >> > bet
>> > > > > > >> > > > that
>> > > > > > >> > > > > >> > >>>>>>>>> there
>> > > > > > >> > > > > >> > >>>>>> would
>> > > > > > >> > > > > >> > >>>>>>>>> be
>> > > > > > >> > > > > >> > >>>>>>>>> one or two winners in this area, and
>> we
>> > > could
>> > > > > > >> support
>> > > > > > >> > > > them,
>> > > > > > >> > > > > >> and
>> > > > > > >> > > > > >> > >>>>>>>>> the
>> > > > > > >> > > > > >> > >>>>>> rest
>> > > > > > >> > > > > >> > >>>>>>>>> would go away. In reality, there are
>> many
>> > > > > > >> variations.
>> > > > > > >> > > > > >> > >>>>>>>>> Furthermore,
>> > > > > > >> > > > > >> > >>>>>> many
>> > > > > > >> > > > > >> > >>>>>>>>> people still prefer to just start
>> their
>> > > > > > processors
>> > > > > > >> > like
>> > > > > > >> > > > > normal
>> > > > > > >> > > > > >> > >>>>>>>>> Java processes, and use traditional
>> > > > deployment
>> > > > > > >> scripts
>> > > > > > >> > > > such
>> > > > > > >> > > > > as
>> > > > > > >> > > > > >> > >>>>>>>>> Fabric,
>> > > > > > >> > > > > >> > >>>>>> Chef,
>> > > > > > >> > > > > >> > >>>>>>>>> Ansible, etc. Forcing a deployment
>> system
>> > > on
>> > > > > > users
>> > > > > > >> > makes
>> > > > > > >> > > > the
>> > > > > > >> > > > > >> > >>>>>>>>> Samza start-up process really painful
>> for
>> > > > first
>> > > > > > time
>> > > > > > >> > > > users.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Dynamic deployment as a requirement
>> was
>> > > also
>> > > > a
>> > > > > > bit
>> > > > > > >> of
>> > > > > > >> > a
>> > > > > > >> > > > > >> > >>>>>>>>> mis-fire
>> > > > > > >> > > > > >> > >>>>>> because
>> > > > > > >> > > > > >> > >>>>>>>>> of
>> > > > > > >> > > > > >> > >>>>>>>>> a fundamental misunderstanding between
>> > the
>> > > > > > nature of
>> > > > > > >> > > batch
>> > > > > > >> > > > > >> jobs
>> > > > > > >> > > > > >> > >>>>>>>>> and
>> > > > > > >> > > > > >> > >>>>>>> stream
>> > > > > > >> > > > > >> > >>>>>>>>> processing jobs. Early on, we made
>> > > conscious
>> > > > > > effort
>> > > > > > >> to
>> > > > > > >> > > > favor
>> > > > > > >> > > > > >> > >>>>>>>>> the
>> > > > > > >> > > > > >> > >>>>>> Hadoop
>> > > > > > >> > > > > >> > >>>>>>>>> (Map/Reduce) way of doing things,
>> since
>> > it
>> > > > > worked
>> > > > > > >> and
>> > > > > > >> > > was
>> > > > > > >> > > > > well
>> > > > > > >> > > > > >> > >>>>>>> understood.
>> > > > > > >> > > > > >> > >>>>>>>>> One thing that we missed was that
>> batch
>> > > jobs
>> > > > > > have a
>> > > > > > >> > > > definite
>> > > > > > >> > > > > >> > >>>>>> beginning,
>> > > > > > >> > > > > >> > >>>>>>>>> and
>> > > > > > >> > > > > >> > >>>>>>>>> end, and stream processing jobs don't
>> > > > > (usually).
>> > > > > > >> This
>> > > > > > >> > > > leads
>> > > > > > >> > > > > to
>> > > > > > >> > > > > >> > >>>>>>>>> a
>> > > > > > >> > > > > >> > >>>>> much
>> > > > > > >> > > > > >> > >>>>>>>>> simpler scheduling problem for stream
>> > > > > processors.
>> > > > > > >> You
>> > > > > > >> > > > > >> basically
>> > > > > > >> > > > > >> > >>>>>>>>> just
>> > > > > > >> > > > > >> > >>>>>>> need
>> > > > > > >> > > > > >> > >>>>>>>>> to find a place to start the
>> processor,
>> > and
>> > > > > start
>> > > > > > >> it.
>> > > > > > >> > > The
>> > > > > > >> > > > > way
>> > > > > > >> > > > > >> > >>>>>>>>> we run grids, at LinkedIn, there's no
>> > > concept
>> > > > > of
>> > > > > > a
>> > > > > > >> > > cluster
>> > > > > > >> > > > > >> > >>>>>>>>> being "full". We always
>> > > > > > >> > > > > >> > >>>>>> add
>> > > > > > >> > > > > >> > >>>>>>>>> more machines. The problem with
>> coupling
>> > > > Samza
>> > > > > > with
>> > > > > > >> a
>> > > > > > >> > > > > >> scheduler
>> > > > > > >> > > > > >> > >>>>>>>>> is
>> > > > > > >> > > > > >> > >>>>>> that
>> > > > > > >> > > > > >> > >>>>>>>>> Samza (as a framework) now has to
>> handle
>> > > > > > deployment.
>> > > > > > >> > > This
>> > > > > > >> > > > > >> pulls
>> > > > > > >> > > > > >> > >>>>>>>>> in a
>> > > > > > >> > > > > >> > >>>>>>> bunch
>> > > > > > >> > > > > >> > >>>>>>>>> of things such as configuration
>> > > distribution
>> > > > > > (config
>> > > > > > >> > > > > stream),
>> > > > > > >> > > > > >> > >>>>>>>>> shell
>> > > > > > >> > > > > >> > >>>>>>> scrips
>> > > > > > >> > > > > >> > >>>>>>>>> (bin/run-job.sh, JobRunner), packaging
>> > (all
>> > > > the
>> > > > > > .tgz
>> > > > > > >> > > > stuff),
>> > > > > > >> > > > > >> etc.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Another reason for requiring dynamic
>> > > > deployment
>> > > > > > was
>> > > > > > >> to
>> > > > > > >> > > > > support
>> > > > > > >> > > > > >> > >>>>>>>>> data locality. If you want to have
>> > > locality,
>> > > > > you
>> > > > > > >> need
>> > > > > > >> > to
>> > > > > > >> > > > put
>> > > > > > >> > > > > >> > >>>>>>>>> your
>> > > > > > >> > > > > >> > >>>>>> processors
>> > > > > > >> > > > > >> > >>>>>>>>> close to the data they're processing.
>> > Upon
>> > > > > > further
>> > > > > > >> > > > > >> > >>>>>>>>> investigation,
>> > > > > > >> > > > > >> > >>>>>>> though,
>> > > > > > >> > > > > >> > >>>>>>>>> this feature is not that beneficial.
>> > There
>> > > is
>> > > > > > some
>> > > > > > >> > good
>> > > > > > >> > > > > >> > >>>>>>>>> discussion
>> > > > > > >> > > > > >> > >>>>>> about
>> > > > > > >> > > > > >> > >>>>>>>>> some problems with it on SAMZA-335.
>> > Again,
>> > > we
>> > > > > > took
>> > > > > > >> the
>> > > > > > >> > > > > >> > >>>>>>>>> Map/Reduce
>> > > > > > >> > > > > >> > >>>>>> path,
>> > > > > > >> > > > > >> > >>>>>>>>> but
>> > > > > > >> > > > > >> > >>>>>>>>> there are some fundamental differences
>> > > > between
>> > > > > > HDFS
>> > > > > > >> > and
>> > > > > > >> > > > > Kafka.
>> > > > > > >> > > > > >> > >>>>>>>>> HDFS
>> > > > > > >> > > > > >> > >>>>>> has
>> > > > > > >> > > > > >> > >>>>>>>>> blocks, while Kafka has partitions.
>> This
>> > > > leads
>> > > > > to
>> > > > > > >> less
>> > > > > > >> > > > > >> > >>>>>>>>> optimization potential with stream
>> > > processors
>> > > > > on
>> > > > > > top
>> > > > > > >> > of
>> > > > > > >> > > > > Kafka.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> This feature is also used as a crutch.
>> > > Samza
>> > > > > > doesn't
>> > > > > > >> > > have
>> > > > > > >> > > > > any
>> > > > > > >> > > > > >> > >>>>>>>>> built
>> > > > > > >> > > > > >> > >>>>> in
>> > > > > > >> > > > > >> > >>>>>>>>> fault-tolerance logic. Instead, it
>> > depends
>> > > on
>> > > > > the
>> > > > > > >> > > dynamic
>> > > > > > >> > > > > >> > >>>>>>>>> deployment scheduling system to handle
>> > > > restarts
>> > > > > > >> when a
>> > > > > > >> > > > > >> > >>>>>>>>> processor dies. This has
>> > > > > > >> > > > > >> > >>>>>>> made
>> > > > > > >> > > > > >> > >>>>>>>>> it very difficult to write a
>> standalone
>> > > Samza
>> > > > > > >> > container
>> > > > > > >> > > > > >> > >>>> (SAMZA-516).
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Pluggability
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> In some cases pluggability is good,
>> but I
>> > > > think
>> > > > > > that
>> > > > > > >> > > we've
>> > > > > > >> > > > > >> gone
>> > > > > > >> > > > > >> > >>>>>>>>> too
>> > > > > > >> > > > > >> > >>>>>> far
>> > > > > > >> > > > > >> > >>>>>>>>> with it. Currently, Samza has:
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable config.
>> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable metrics.
>> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable deployment systems.
>> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable streaming systems
>> > > > (SystemConsumer,
>> > > > > > >> > > > > SystemProducer,
>> > > > > > >> > > > > >> > >>>> etc).
>> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable serdes.
>> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable storage engines.
>> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable strategies for just about
>> > every
>> > > > > > >> component
>> > > > > > >> > > > > >> > >>>>> (MessageChooser,
>> > > > > > >> > > > > >> > >>>>>>>>> SystemStreamPartitionGrouper,
>> > > ConfigRewriter,
>> > > > > > etc).
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> There's probably more that I've
>> > forgotten,
>> > > as
>> > > > > > well.
>> > > > > > >> > Some
>> > > > > > >> > > > of
>> > > > > > >> > > > > >> > >>>>>>>>> these
>> > > > > > >> > > > > >> > >>>>> are
>> > > > > > >> > > > > >> > >>>>>>>>> useful, but some have proven not to
>> be.
>> > > This
>> > > > > all
>> > > > > > >> comes
>> > > > > > >> > > at
>> > > > > > >> > > > a
>> > > > > > >> > > > > >> cost:
>> > > > > > >> > > > > >> > >>>>>>>>> complexity. This complexity is making
>> it
>> > > > harder
>> > > > > > for
>> > > > > > >> > our
>> > > > > > >> > > > > users
>> > > > > > >> > > > > >> > >>>>>>>>> to
>> > > > > > >> > > > > >> > >>>>> pick
>> > > > > > >> > > > > >> > >>>>>> up
>> > > > > > >> > > > > >> > >>>>>>>>> and use Samza out of the box. It also
>> > makes
>> > > > it
>> > > > > > >> > difficult
>> > > > > > >> > > > for
>> > > > > > >> > > > > >> > >>>>>>>>> Samza developers to reason about what
>> the
>> > > > > > >> > > characteristics
>> > > > > > >> > > > of
>> > > > > > >> > > > > >> > >>>>>>>>> the container (since the
>> characteristics
>> > > > change
>> > > > > > >> > > depending
>> > > > > > >> > > > on
>> > > > > > >> > > > > >> > >>>>>>>>> which plugins are use).
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> The issues with pluggability are most
>> > > visible
>> > > > > in
>> > > > > > the
>> > > > > > >> > > > System
>> > > > > > >> > > > > >> APIs.
>> > > > > > >> > > > > >> > >>>>> What
>> > > > > > >> > > > > >> > >>>>>>>>> Samza really requires to be
>> functional is
>> > > > Kafka
>> > > > > > as
>> > > > > > >> its
>> > > > > > >> > > > > >> > >>>>>>>>> transport
>> > > > > > >> > > > > >> > >>>>>> layer.
>> > > > > > >> > > > > >> > >>>>>>>>> But
>> > > > > > >> > > > > >> > >>>>>>>>> we've conflated two unrelated use
>> cases
>> > > into
>> > > > > one
>> > > > > > >> API:
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> 1. Get data into/out of Kafka.
>> > > > > > >> > > > > >> > >>>>>>>>> 2. Process the data in Kafka.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> The current System API supports both
>> of
>> > > these
>> > > > > use
>> > > > > > >> > cases.
>> > > > > > >> > > > The
>> > > > > > >> > > > > >> > >>>>>>>>> problem
>> > > > > > >> > > > > >> > >>>>>> is,
>> > > > > > >> > > > > >> > >>>>>>>>> we
>> > > > > > >> > > > > >> > >>>>>>>>> actually want different features for
>> each
>> > > use
>> > > > > > case.
>> > > > > > >> By
>> > > > > > >> > > > > >> papering
>> > > > > > >> > > > > >> > >>>>>>>>> over
>> > > > > > >> > > > > >> > >>>>>>> these
>> > > > > > >> > > > > >> > >>>>>>>>> two use cases, and providing a single
>> > API,
>> > > > > we've
>> > > > > > >> > > > introduced
>> > > > > > >> > > > > a
>> > > > > > >> > > > > >> > >>>>>>>>> ton of
>> > > > > > >> > > > > >> > >>>>>>> leaky
>> > > > > > >> > > > > >> > >>>>>>>>> abstractions.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> For example, what we'd really like in
>> (2)
>> > > is
>> > > > to
>> > > > > > have
>> > > > > > >> > > > > >> > >>>>>>>>> monotonically increasing longs for
>> > offsets
>> > > > > (like
>> > > > > > >> > Kafka).
>> > > > > > >> > > > > This
>> > > > > > >> > > > > >> > >>>>>>>>> would be at odds
>> > > > > > >> > > > > >> > >>>>> with
>> > > > > > >> > > > > >> > >>>>>>> (1),
>> > > > > > >> > > > > >> > >>>>>>>>> though, since different systems have
>> > > > different
>> > > > > > >> > > > > >> > >>>>>>> SCNs/Offsets/UUIDs/vectors.
>> > > > > > >> > > > > >> > >>>>>>>>> There was discussion both on the
>> mailing
>> > > list
>> > > > > and
>> > > > > > >> the
>> > > > > > >> > > SQL
>> > > > > > >> > > > > >> JIRAs
>> > > > > > >> > > > > >> > >>>>> about
>> > > > > > >> > > > > >> > >>>>>>> the
>> > > > > > >> > > > > >> > >>>>>>>>> need for this.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> The same thing holds true for
>> > > replayability.
>> > > > > > Kafka
>> > > > > > >> > > allows
>> > > > > > >> > > > us
>> > > > > > >> > > > > >> to
>> > > > > > >> > > > > >> > >>>>> rewind
>> > > > > > >> > > > > >> > >>>>>>>>> when
>> > > > > > >> > > > > >> > >>>>>>>>> we have a failure. Many other systems
>> > > don't.
>> > > > In
>> > > > > > some
>> > > > > > >> > > > cases,
>> > > > > > >> > > > > >> > >>>>>>>>> systems
>> > > > > > >> > > > > >> > >>>>>>> return
>> > > > > > >> > > > > >> > >>>>>>>>> null for their offsets (e.g.
>> > > > > > >> WikipediaSystemConsumer)
>> > > > > > >> > > > > because
>> > > > > > >> > > > > >> > >>>>>>>>> they
>> > > > > > >> > > > > >> > >>>>>> have
>> > > > > > >> > > > > >> > >>>>>>> no
>> > > > > > >> > > > > >> > >>>>>>>>> offsets.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Partitioning is another example. Kafka
>> > > > supports
>> > > > > > >> > > > > partitioning,
>> > > > > > >> > > > > >> > >>>>>>>>> but
>> > > > > > >> > > > > >> > >>>>> many
>> > > > > > >> > > > > >> > >>>>>>>>> systems don't. We model this by
>> having a
>> > > > single
>> > > > > > >> > > partition
>> > > > > > >> > > > > for
>> > > > > > >> > > > > >> > >>>>>>>>> those systems. Still, other systems
>> model
>> > > > > > >> partitioning
>> > > > > > >> > > > > >> > >>>> differently (e.g.
>> > > > > > >> > > > > >> > >>>>>>>>> Kinesis).
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> The SystemAdmin interface is also a
>> mess.
>> > > > > > Creating
>> > > > > > >> > > streams
>> > > > > > >> > > > > in
>> > > > > > >> > > > > >> a
>> > > > > > >> > > > > >> > >>>>>>>>> system-agnostic way is almost
>> impossible.
>> > > As
>> > > > is
>> > > > > > >> > modeling
>> > > > > > >> > > > > >> > >>>>>>>>> metadata
>> > > > > > >> > > > > >> > >>>>> for
>> > > > > > >> > > > > >> > >>>>>>> the
>> > > > > > >> > > > > >> > >>>>>>>>> system (replication factor,
>> partitions,
>> > > > > location,
>> > > > > > >> > etc).
>> > > > > > >> > > > The
>> > > > > > >> > > > > >> > >>>>>>>>> list
>> > > > > > >> > > > > >> > >>>>> goes
>> > > > > > >> > > > > >> > >>>>>>> on.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Duplicate work
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> At the time that we began writing
>> Samza,
>> > > > > Kafka's
>> > > > > > >> > > consumer
>> > > > > > >> > > > > and
>> > > > > > >> > > > > >> > >>>>> producer
>> > > > > > >> > > > > >> > >>>>>>>>> APIs
>> > > > > > >> > > > > >> > >>>>>>>>> had a relatively weak feature set. On
>> the
>> > > > > > >> > consumer-side,
>> > > > > > >> > > > you
>> > > > > > >> > > > > >> > >>>>>>>>> had two
>> > > > > > >> > > > > >> > >>>>>>>>> options: use the high level consumer,
>> or
>> > > the
>> > > > > > simple
>> > > > > > >> > > > > consumer.
>> > > > > > >> > > > > >> > >>>>>>>>> The
>> > > > > > >> > > > > >> > >>>>>>> problem
>> > > > > > >> > > > > >> > >>>>>>>>> with the high-level consumer was that
>> it
>> > > > > > controlled
>> > > > > > >> > your
>> > > > > > >> > > > > >> > >>>>>>>>> offsets, partition assignments, and
>> the
>> > > order
>> > > > > in
>> > > > > > >> which
>> > > > > > >> > > you
>> > > > > > >> > > > > >> > >>>>>>>>> received messages. The
>> > > > > > >> > > > > >> > >>>>> problem
>> > > > > > >> > > > > >> > >>>>>>>>> with
>> > > > > > >> > > > > >> > >>>>>>>>> the simple consumer is that it's not
>> > > simple.
>> > > > > It's
>> > > > > > >> > basic.
>> > > > > > >> > > > You
>> > > > > > >> > > > > >> > >>>>>>>>> end up
>> > > > > > >> > > > > >> > >>>>>>> having
>> > > > > > >> > > > > >> > >>>>>>>>> to handle a lot of really low-level
>> stuff
>> > > > that
>> > > > > > you
>> > > > > > >> > > > > shouldn't.
>> > > > > > >> > > > > >> > >>>>>>>>> We
>> > > > > > >> > > > > >> > >>>>>> spent a
>> > > > > > >> > > > > >> > >>>>>>>>> lot of time to make Samza's
>> > > > KafkaSystemConsumer
>> > > > > > very
>> > > > > > >> > > > robust.
>> > > > > > >> > > > > >> It
>> > > > > > >> > > > > >> > >>>>>>>>> also allows us to support some cool
>> > > features:
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> * Per-partition message ordering and
>> > > > > > prioritization.
>> > > > > > >> > > > > >> > >>>>>>>>> * Tight control over partition
>> assignment
>> > > to
>> > > > > > support
>> > > > > > >> > > > joins,
>> > > > > > >> > > > > >> > >>>>>>>>> global
>> > > > > > >> > > > > >> > >>>>>> state
>> > > > > > >> > > > > >> > >>>>>>>>> (if we want to implement it :)), etc.
>> > > > > > >> > > > > >> > >>>>>>>>> * Tight control over offset
>> > checkpointing.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> What we didn't realize at the time is
>> > that
>> > > > > these
>> > > > > > >> > > features
>> > > > > > >> > > > > >> > >>>>>>>>> should
>> > > > > > >> > > > > >> > >>>>>>> actually
>> > > > > > >> > > > > >> > >>>>>>>>> be in Kafka. A lot of Kafka consumers
>> > (not
>> > > > just
>> > > > > > >> Samza
>> > > > > > >> > > > stream
>> > > > > > >> > > > > >> > >>>>>> processors)
>> > > > > > >> > > > > >> > >>>>>>>>> end up wanting to do things like joins
>> > and
>> > > > > > partition
>> > > > > > >> > > > > >> > >>>>>>>>> assignment. The
>> > > > > > >> > > > > >> > >>>>>>> Kafka
>> > > > > > >> > > > > >> > >>>>>>>>> community has come to the same
>> > conclusion.
>> > > > > > They're
>> > > > > > >> > > adding
>> > > > > > >> > > > a
>> > > > > > >> > > > > >> ton
>> > > > > > >> > > > > >> > >>>>>>>>> of upgrades into their new Kafka
>> consumer
>> > > > > > >> > > implementation.
>> > > > > > >> > > > > To a
>> > > > > > >> > > > > >> > >>>>>>>>> large extent,
>> > > > > > >> > > > > >> > >>>>> it's
>> > > > > > >> > > > > >> > >>>>>>>>> duplicate work to what we've already
>> done
>> > > in
>> > > > > > Samza.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> On top of this, Kafka ended up taking
>> a
>> > > very
>> > > > > > similar
>> > > > > > >> > > > > approach
>> > > > > > >> > > > > >> > >>>>>>>>> to
>> > > > > > >> > > > > >> > >>>>>> Samza's
>> > > > > > >> > > > > >> > >>>>>>>>> KafkaCheckpointManager implementation
>> for
>> > > > > > handling
>> > > > > > >> > > offset
>> > > > > > >> > > > > >> > >>>>>> checkpointing.
>> > > > > > >> > > > > >> > >>>>>>>>> Like Samza, Kafka's new offset
>> management
>> > > > > feature
>> > > > > > >> > stores
>> > > > > > >> > > > > >> offset
>> > > > > > >> > > > > >> > >>>>>>>>> checkpoints in a topic, and allows
>> you to
>> > > > fetch
>> > > > > > them
>> > > > > > >> > > from
>> > > > > > >> > > > > the
>> > > > > > >> > > > > >> > >>>>>>>>> broker.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> A lot of this seems like a waste,
>> since
>> > we
>> > > > > could
>> > > > > > >> have
>> > > > > > >> > > > shared
>> > > > > > >> > > > > >> > >>>>>>>>> the
>> > > > > > >> > > > > >> > >>>>> work
>> > > > > > >> > > > > >> > >>>>>> if
>> > > > > > >> > > > > >> > >>>>>>>>> it
>> > > > > > >> > > > > >> > >>>>>>>>> had been done in Kafka from the
>> get-go.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Vision
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> All of this leads me to a rather
>> radical
>> > > > > > proposal.
>> > > > > > >> > Samza
>> > > > > > >> > > > is
>> > > > > > >> > > > > >> > >>>>> relatively
>> > > > > > >> > > > > >> > >>>>>>>>> stable at this point. I'd venture to
>> say
>> > > that
>> > > > > > we're
>> > > > > > >> > > near a
>> > > > > > >> > > > > 1.0
>> > > > > > >> > > > > >> > >>>>>> release.
>> > > > > > >> > > > > >> > >>>>>>>>> I'd
>> > > > > > >> > > > > >> > >>>>>>>>> like to propose that we take what
>> we've
>> > > > > learned,
>> > > > > > and
>> > > > > > >> > > begin
>> > > > > > >> > > > > >> > >>>>>>>>> thinking
>> > > > > > >> > > > > >> > >>>>>>> about
>> > > > > > >> > > > > >> > >>>>>>>>> Samza beyond 1.0. What would we
>> change if
>> > > we
>> > > > > were
>> > > > > > >> > > starting
>> > > > > > >> > > > > >> from
>> > > > > > >> > > > > >> > >>>>>> scratch?
>> > > > > > >> > > > > >> > >>>>>>>>> My
>> > > > > > >> > > > > >> > >>>>>>>>> proposal is to:
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> 1. Make Samza standalone the *only*
>> way
>> > to
>> > > > run
>> > > > > > Samza
>> > > > > > >> > > > > >> > >>>>>>>>> processors, and eliminate all direct
>> > > > > dependences
>> > > > > > on
>> > > > > > >> > > YARN,
>> > > > > > >> > > > > >> Mesos,
>> > > > > > >> > > > > >> > >>>> etc.
>> > > > > > >> > > > > >> > >>>>>>>>> 2. Make a definitive call to support
>> only
>> > > > Kafka
>> > > > > > as
>> > > > > > >> the
>> > > > > > >> > > > > stream
>> > > > > > >> > > > > >> > >>>>>> processing
>> > > > > > >> > > > > >> > >>>>>>>>> layer.
>> > > > > > >> > > > > >> > >>>>>>>>> 3. Eliminate Samza's metrics, logging,
>> > > > > > >> serialization,
>> > > > > > >> > > and
>> > > > > > >> > > > > >> > >>>>>>>>> config
>> > > > > > >> > > > > >> > >>>>>>> systems,
>> > > > > > >> > > > > >> > >>>>>>>>> and simply use Kafka's instead.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> This would fix all of the issues that
>> I
>> > > > > outlined
>> > > > > > >> > above.
>> > > > > > >> > > It
>> > > > > > >> > > > > >> > >>>>>>>>> should
>> > > > > > >> > > > > >> > >>>>> also
>> > > > > > >> > > > > >> > >>>>>>>>> shrink the Samza code base pretty
>> > > > dramatically.
>> > > > > > >> > > Supporting
>> > > > > > >> > > > > >> only
>> > > > > > >> > > > > >> > >>>>>>>>> a standalone container will allow
>> Samza
>> > to
>> > > be
>> > > > > > >> executed
>> > > > > > >> > > on
>> > > > > > >> > > > > YARN
>> > > > > > >> > > > > >> > >>>>>>>>> (using Slider), Mesos (using
>> > > > Marathon/Aurora),
>> > > > > or
>> > > > > > >> most
>> > > > > > >> > > > other
>> > > > > > >> > > > > >> > >>>>>>>>> in-house
>> > > > > > >> > > > > >> > >>>>>>> deployment
>> > > > > > >> > > > > >> > >>>>>>>>> systems. This should make life a lot
>> > easier
>> > > > for
>> > > > > > new
>> > > > > > >> > > users.
>> > > > > > >> > > > > >> > >>>>>>>>> Imagine
>> > > > > > >> > > > > >> > >>>>>>> having
>> > > > > > >> > > > > >> > >>>>>>>>> the hello-samza tutorial without YARN.
>> > The
>> > > > drop
>> > > > > > in
>> > > > > > >> > > mailing
>> > > > > > >> > > > > >> list
>> > > > > > >> > > > > >> > >>>>>> traffic
>> > > > > > >> > > > > >> > >>>>>>>>> will be pretty dramatic.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Coupling with Kafka seems long
>> overdue to
>> > > me.
>> > > > > The
>> > > > > > >> > > reality
>> > > > > > >> > > > > is,
>> > > > > > >> > > > > >> > >>>>> everyone
>> > > > > > >> > > > > >> > >>>>>>>>> that
>> > > > > > >> > > > > >> > >>>>>>>>> I'm aware of is using Samza with
>> Kafka.
>> > We
>> > > > > > basically
>> > > > > > >> > > > require
>> > > > > > >> > > > > >> it
>> > > > > > >> > > > > >> > >>>>>> already
>> > > > > > >> > > > > >> > >>>>>>> in
>> > > > > > >> > > > > >> > >>>>>>>>> order for most features to work. Those
>> > that
>> > > > are
>> > > > > > >> using
>> > > > > > >> > > > other
>> > > > > > >> > > > > >> > >>>>>>>>> systems
>> > > > > > >> > > > > >> > >>>>>> are
>> > > > > > >> > > > > >> > >>>>>>>>> generally using it for ingest into
>> Kafka
>> > > (1),
>> > > > > and
>> > > > > > >> then
>> > > > > > >> > > > they
>> > > > > > >> > > > > do
>> > > > > > >> > > > > >> > >>>>>>>>> the processing on top. There is
>> already
>> > > > > > discussion (
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > >
>> > > > > > >> >
>> > > > > >
>> > >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851
>> > > > > > >> > > > > >> > >>>>> 767
>> > > > > > >> > > > > >> > >>>>>>>>> )
>> > > > > > >> > > > > >> > >>>>>>>>> in Kafka to make ingesting into Kafka
>> > > > extremely
>> > > > > > >> easy.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Once we make the call to couple with
>> > Kafka,
>> > > > we
>> > > > > > can
>> > > > > > >> > > > leverage
>> > > > > > >> > > > > a
>> > > > > > >> > > > > >> > >>>>>>>>> ton of
>> > > > > > >> > > > > >> > >>>>>>> their
>> > > > > > >> > > > > >> > >>>>>>>>> ecosystem. We no longer have to
>> maintain
>> > > our
>> > > > > own
>> > > > > > >> > config,
>> > > > > > >> > > > > >> > >>>>>>>>> metrics,
>> > > > > > >> > > > > >> > >>>>> etc.
>> > > > > > >> > > > > >> > >>>>>>> We
>> > > > > > >> > > > > >> > >>>>>>>>> can all share the same libraries, and
>> > make
>> > > > them
>> > > > > > >> > better.
>> > > > > > >> > > > This
>> > > > > > >> > > > > >> > >>>>>>>>> will
>> > > > > > >> > > > > >> >
>> ...
>>
>> [Message clipped]
>
>
>

Re: Thoughts and obesrvations on Samza

Reply via email to