Re: Thoughts and obesrvations on Samza

Yi Pan Thu, 09 Jul 2015 11:04:23 -0700

Hi, Julian and Martin,

Good point on community-merging vs project-merging and good summary!


For Julian's point #2, I think that he was referring to the support to
integrate w/ a cluster job execution framework, like YARN/Mesos/AWS. And
who (i.e. the community) and which project (i.e. code) would support this
integration layer. My personal preference is this should be considered as a
sub- or separate project (i.e. code-wise) on top of Samza, and supported by
Samza community (or at least, a good overlap w/ the Samza community).
Personally, I view it as a distributed job execution framework for
streaming processing, just like YARN+Slider for MapReduce jobs, if it makes
sense.


On Thu, Jul 9, 2015 at 10:14 AM, Martin Kleppmann <mar...@kleppmann.com>
wrote:

> Thanks Julian for calling out the principle of community over code, which
> is super important. If it was just a matter of code, the Kafka project
> could simply pull in the Samza code (or write a new stream processor)
> without asking permission -- but they wouldn't get the Samza community.
> Thus, I think the community aspect is the most important part of this
> discussion. If we're talking about merging projects, it's really about
> merging communities.
>
> I had a chat with a friend who is a Lucene/Solr committer: those were also
> originally two separate projects, which merged into one. He said the merge
> was not always easy, but probably a net win for both projects and
> communities overall. In their community people tend to specialise on either
> the Lucene part or the Solr part, but that's ok -- it's still a cohesive
> community nevertheless, and it benefits from close collaboration due to
> having everyone in the same project. Releases didn't slow down; in fact,
> they perhaps got faster due to less cross-project coordination overhead. So
> that allayed my concerns about a big project becoming slow.
>
> Besides community and code/architecture, another consideration is our user
> base (including those who are not on this mailing list). What is good for
> our users? I've thought about this more over the last few days:
>
> - Reducing users' confusion is good. If someone is adopting Kafka, they
> will also need some way of processing their data in Kafka. At the moment,
> the Kafka docs give you consumer APIs but nothing more. Having to choose a
> separate stream processing framework is a burden on users, especially if
> that framework uses terminology that is inconsistent with Kafka. If we make
> Samza a part of Kafka and unify the terminology, it would become a coherent
> part of the documentation, and be much less confusing for users.
>
> - Making it easy for users to get started is good. Simplifying the API and
> configuration is part of it. Making YARN optional is also good. It would
> also help to be part of the same package that people download, and part of
> the same documentation. (Simplifying API/config and decoupling from YARN
> can be done as a separate project; becoming part of the same package would
> require merging projects.)
>
> - Supporting users' choice of programming language is good. I used to work
> with Ruby, and in the Ruby community there are plenty of people with an
> irrational hatred of the JVM. I imagine other language communities are
> likely similar. If Samza becomes a fairly thin client library to Kafka
> (using partition assignment etc provided by the Kafka brokers), then it
> becomes much more feasible to implement the same interface in other
> languages too, giving true multi-language support.
>
> Having thought about this, I am coming to the conclusion that a stream
> processor that is part of the Kafka project would be good for users, and
> thus a more successful project. However, the people with experience in
> stream processing systems are in the Samza community. This leads me to
> thinking that merging projects and communities might be a good idea: with
> the union of experience from both communities, we will probably build a
> better system that is better for users.
>
> Jakob advocated maintaining support for input sources other than Kafka.
> While I can totally see the need for a framework that does this, I think
> the need is pretty well satisfied by Storm, which already has spouts for
> Kafka, Kestrel, JMS, AMQP, Redis and beanstalkd (and perhaps more). I don't
> see much value in Samza attempting to catch up here, especially if Copycat
> will provide connectors to many systems by different means. On the other
> hand, my failed attempts to implement SystemConsumers for Kinesis and
> Postgres make me think that a stream processor that supports many different
> inputs is limited to a lowest-common-denominator model; if Samza supports
> only Kafka, I think it could support Kafka better than any other framework
> (by doing one thing and doing it well).
>
> Julian: not sure I understand your point 2 about departing from the vision
> of distributed processing. A library-ified Samza would still allow
> distributed processing, and (with a small amount of glue) could still be
> deployed to YARN or other cluster.
>
> So, in conclusion, I'm starting to agree with the approach that Jay has
> been advocating in this thread.
>
> Martin
>
>
> On 9 Jul 2015, at 15:32, Julian Hyde <jh...@apache.org> wrote:
>
> > Wow, what a great discussion. A brave discussion, since no project
> > wants to reduce its scope. And important, because "right-sizing"
> > technology components can help them win in the long run.
> >
> > I have a couple of let-me-play-devil's-advocate questions.
> >
> > 1. Community over code
> >
> > Let's look at this in terms of the Apache Way. The Apache Way
> > advocates "community over code", and as Jakob points out, the Samza
> > community is distinct from the Kafka community. It seems that we are
> > talking here about Samza-the-code.
> >
> > According to the Apache Way, what Samza-the-project should be doing is
> > what Samza-the-community is good at. Samza-the-code-as-it-is-today can
> > move to Kafka, stay in Samza, or be deleted if it has been superseded.
> >
> > Architectural discussions are important to have, and the Apache Way
> > gets in the way of good architecture sometimes. When we're thinking
> > about moving code, let's also think about the community of people
> > working on the code.
> >
> > Apache Phoenix is a good analogy. Phoenix is technically very closely
> > tied to HBase, but a distinct community, with different skill-sets.
> > (HBase, like Kafka, is "hard core", and not for everyone.) They have
> > also been good at re-examining their project scope and re-scoping
> > where necessary.
> >
> > 2. Architecture
> >
> > This proposal retreats from the grand vision of "distributed stream
> > management system" where not only storage is distributed but also
> > processing. There is no architectural piece that says "I need 10 JVMs
> > to process this CPU intensive standing query and I currently only have
> > 6." What projects, current or envisioned, would fit that gap? Is that
> > work a good fit for the Samza community?
> >
> > Julian
> >
> >
> >
> > On Wed, Jul 8, 2015 at 10:47 PM, Jordan Shaw <jor...@pubnub.com> wrote:
> >> I'm all for any optimizations that can be made to the Yarn workflow.
> >>
> >> I actually agree with Jakob in regard to the producers/consumers. I have
> >> spent sometime writing consumers and producers for other transport
> >> abstractions and overall the current api abstractions in Samza I feel
> are
> >> pretty good. There are some things that are sort of anomalous and
> catered
> >> more toward the Kafka model but easy enough to work around and I've been
> >> able to make other Producers and Consumers work that are no where near
> the
> >> same paradigm as Kafka.
> >>
> >> To Jay's point although Kafka is great and does the streaming data
> paradigm
> >> very well there is really no reason why a different transport
> application
> >> implemented properly wouldn't be able to stream data with the same
> >> effectiveness as Kafka and that transport may suite the user's use case
> >> better or be more cost effective than Kafka. For example we had to
> decide
> >> if Kafka was worth the extra cost of running a zookeeper cluster and if
> the
> >> scaling through partitioning was worth the operational overhead vs
> having a
> >> mesh network over ZeroMQ. After deciding that our use case would fit
> with
> >> Kafka fine there were other challenges like understanding how AWS EC2
> SSD's
> >> behaved (AWS amortizes all disk into into Random I/O this is bad for
> Kafka).
> >>
> >> Thus, I would lend of the side of transport flexibility for a framework
> >> like Samza over binding to a transport medium like Kafka.
> >>
> >>
> >> On Wed, Jul 8, 2015 at 1:39 PM, Jay Kreps <jay.kr...@gmail.com> wrote:
> >>
> >>> Good summary Jakob.
> >>>
> >>> WRT to the general purpose vs Kafka-specific, I actually see it
> slightly
> >>> differently. Consider how Storm works as an example, there is a data
> source
> >>> (spout) which could be Kafka, Database, etc, and then there is a
> transport
> >>> (a netty TCP thing iiuc). Storm allows you to process data from any
> source,
> >>> but when it comes from a source they always funnel it through their
> >>> transport to get to the tasks/bolts. It is natural to think of Kafka
> as the
> >>> Spout, but I think the better analogy is actually that Kafka is the
> >>> transport.
> >>>
> >>> It is really hard to make the transport truly pluggable because this is
> >>> what the tasks interact with and you need to have guarantees about
> delivery
> >>> (and reprocessing), partitioning, atomicity of output, ordering, etc so
> >>> your stream processing can get the right answer. From my point of view
> what
> >>> this proposal says is that Kafka would be non-pluggable as the
> *transport*.
> >>>
> >>> So in this proposal data would still come into and out of Kafka from a
> wide
> >>> variety of sources, but by requiring Kafka as the transport the
> interaction
> >>> with the tasks will always look the same (a persistent, partitioned,
> log).
> >>> So going back to the Storm analogy it is something like
> >>>  Spout interface = copy cat
> >>>  Bolt interface = samza
> >>>
> >>> This does obviously make Samza dependent on Kafka but it doesn't mean
> you
> >>> wouldn't be processing data from all kinds of sources--indeed that is
> the
> >>> whole purpose. It just means that each of these data streams would be
> >>> available as a multi-subscriber Kafka topic to other systems,
> applications,
> >>> etc, not just for your job.
> >>>
> >>> If you think about how things are now Samza already depends on a
> >>> partitioned, persistent, offset addressable log with log
> >>> compaction...which, unsurprisingly, so I don't think this is really a
> new
> >>> dependency.
> >>>
> >>> Philosophically I think this makes sense too. To make a bunch of
> programs
> >>> fit together you have to standardize something. In this proposal what
> you
> >>> are standardizing around is really Kafka's protocol for streaming data
> and
> >>> your data format. The transformations that connect these streams can be
> >>> done via Samza, Storm, Spark, standalone java or python programs, etc
> but
> >>> the ultimate output and contract to the rest of the organization/world
> will
> >>> be the resulting Kafka topic. Philosophically I think this kind of
> data and
> >>> protocol based contract is the right way to go rather than saying that
> the
> >>> contract is a particular java api and the stream/data is what is
> pluggable.
> >>>
> >>> -Jay
> >>>
> >>>
> >>>
> >>> On Wed, Jul 8, 2015 at 11:03 AM, Jakob Homan <jgho...@gmail.com>
> wrote:
> >>>
> >>>> Rewinding back to the beginning of this topic, there are effectively
> >>>> three proposals on the table:
> >>>>
> >>>> 1) Chris' ideas for a direction towards a 2.0 release with an emphasis
> >>>> on API and configuration simplification.  This ideas are based on lots
> >>>> of lessons learned from the 0.x branch and are worthy of a 2.0 label
> >>>> and breaking backwards compability.  I'm not sure I agree with all of
> >>>> them, but they're definitely worth pursuing.
> >>>>
> >>>> 2) Chris' alternative proposal, which goes beyond his first and is
> >>>> essentially a reboot of Samza to a more limited, entirely
> >>>> Kafka-focused approach.  Samza would cease being a general purpose
> >>>> stream processing framework, akin to and an alternative to say, Apache
> >>>> Storm, and would instead become a standalone complement to the Kafka
> >>>> project.
> >>>>
> >>>> 3) Jay's proposal, which goes even further, and suggests that the
> >>>> Kafka community would be better served by adding stream processing as
> >>>> a module to Kafka.  This is a perfectly valid approach, but since it's
> >>>> entirely confined to the Kafka project, doesn't really involve Samza.
> >>>> If the Kafka team were to go this route, there would be no obligation
> >>>> on the Samza team to shut down, disband, etc.
> >>>>
> >>>> This last bit is important because Samza and Kafka, while closely
> >>>> linked, are distinct communities.  The intersection of committers on
> >>>> both Kafka and Samza is three people out of a combined 18 committers
> >>>> across both projects.   Samza is a distinct community that shares
> >>>> quite a few users with Kafka, but is able to chart its own course.
> >>>>
> >>>> My own view is that Samza has had an amazing year and is taking off at
> >>>> a rapid rate.  It was only proposed for Incubator two years ago and is
> >>>> still very young. The original team at LinkedIn has left that company
> >>>> but the project has continued to grow via contributions both from
> >>>> LinkedIn and from without.  We've recently seen a significant uptake
> >>>> in discussion and bug reports.
> >>>>
> >>>> The API, deployment and configuration changes Chris suggests are good
> >>>> ideas, but I think there is still serious value in having a
> >>>> stand-alone general stream processing framework that supports other
> >>>> input sources than Kafka.  We've already had contributions for adding
> >>>> producer support to ElasticSearch and HDFS.  As more users come on
> >>>> board, I would expect them to contribute more consumers and producers.
> >>>>
> >>>> It's a bit of chicken-and-the-egg problem; since the original team
> >>>> didn't have cycles to prioritize support for non-Kafka systems
> >>>> (kinesis, eventhub, twitter, flume, zeromq, etc.), Samza was less
> >>>> compelling than other stream processing frameworks that did have
> >>>> support and was therefore not used in those situations.  I'd love to
> >>>> see those added and the SystemConsumer/Producer APIs improved to
> >>>> fluently support them as well as Kafka.
> >>>> Martin had a question regarding the tight coupling between Hadoop HDFS
> >>>> and MapReduce (and YARN and Common).  This has been a problem for
> >>>> years and there have been several aborted attempts to split the
> >>>> projects out.  Each time there turned out to be a strong need for
> >>>> cross-cutting collaboration and so the effort was dropped.  Absent the
> >>>> third option above (Kafka adding stream support to itself directly), I
> >>>> would imagine something similar would play out here.
> >>>>
> >>>> We should get a feeling for which of the three proposals the Samza
> >>>> community is behind, technical details of each notwithstanding.  This
> >>>> would include not just the committers/PMC members, but also the users,
> >>>> contributors and lurkers.
> >>>>
> >>>> -Jakob
> >>>>
> >>>> On 8 July 2015 at 07:41, Ben Kirwin <b...@kirw.in> wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> Interesting stuff! Jumping in a bit late, but here goes...
> >>>>>
> >>>>> I'd definitely be excited about a slimmed-down and more
> Kafka-specific
> >>>>> Samza -- you don't seem to lose much functionality that people
> >>>>> actually use, and the gains in simplicity / code sharing seem
> >>>>> potentially very large. (I've spent a bunch of time peeling back
> those
> >>>>> layers of abstraction to get eg. more control over message send
> order,
> >>>>> and working directly against Kafka's APIs would have been much
> >>>>> easier.) I also like the approach of letting Kafka code do the heavy
> >>>>> lifting and letting stream processing systems build on those -- good,
> >>>>> reusable implementations would be great for the whole
> >>>>> stream-processing ecosystem, and Samza in particular.
> >>>>>
> >>>>> On the other hand, I do hope that using Kafka's group membership /
> >>>>> partition assignment / etc. stays optional. As far as I can tell,
> >>>>> ~every major stream processing system that uses Kafka has chosen (or
> >>>>> switched to) 'static' partitioning, where each logical task consumes
> a
> >>>>> fixed set of partitions. When 'dynamic deploying' (a la Storm / Mesos
> >>>>> / Yarn) the underlying system is already doing failure detection and
> >>>>> transferring work between hosts when machines go down, so using
> >>>>> Kafka's implementation is redundant at best -- and at worst, the
> >>>>> interaction between the two systems can make outages worse.
> >>>>>
> >>>>> And thanks to Chris / Jay for getting this ball rolling. Exciting
> >>>> times...
> >>>>>
> >>>>> On Tue, Jul 7, 2015 at 2:35 PM, Jay Kreps <j...@confluent.io> wrote:
> >>>>>> Hey Roger,
> >>>>>>
> >>>>>> I couldn't agree more. We spent a bunch of time talking to people
> and
> >>>> that
> >>>>>> is exactly the stuff we heard time and again. What makes it hard, of
> >>>>>> course, is that there is some tension between compatibility with
> >>> what's
> >>>>>> there now and making things better for new users.
> >>>>>>
> >>>>>> I also strongly agree with the importance of multi-language support.
> >>> We
> >>>> are
> >>>>>> talking now about Java, but for application development use cases
> >>> people
> >>>>>> want to work in whatever language they are using elsewhere. I think
> >>>> moving
> >>>>>> to a model where Kafka itself does the group membership, lifecycle
> >>>> control,
> >>>>>> and partition assignment has the advantage of putting all that
> complex
> >>>>>> stuff behind a clean api that the clients are already going to be
> >>>>>> implementing for their consumer, so the added functionality for
> stream
> >>>>>> processing beyond a consumer becomes very minor.
> >>>>>>
> >>>>>> -Jay
> >>>>>>
> >>>>>> On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover <
> roger.hoo...@gmail.com
> >>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Metamorphosis...nice. :)
> >>>>>>>
> >>>>>>> This has been a great discussion.  As a user of Samza who's
> recently
> >>>>>>> integrated it into a relatively large organization, I just want to
> >>> add
> >>>>>>> support to a few points already made.
> >>>>>>>
> >>>>>>> The biggest hurdles to adoption of Samza as it currently exists
> that
> >>>> I've
> >>>>>>> experienced are:
> >>>>>>> 1) YARN - YARN is overly complex in many environments where Puppet
> >>>> would do
> >>>>>>> just fine but it was the only mechanism to get fault tolerance.
> >>>>>>> 2) Configuration - I think I like the idea of configuring most of
> the
> >>>> job
> >>>>>>> in code rather than config files.  In general, I think the goal
> >>> should
> >>>> be
> >>>>>>> to make it harder to make mistakes, especially of the kind where
> the
> >>>> code
> >>>>>>> expects something and the config doesn't match.  The current config
> >>> is
> >>>>>>> quite intricate and error-prone.  For example, the application
> logic
> >>>> may
> >>>>>>> depend on bootstrapping a topic but rather than asserting that in
> the
> >>>> code,
> >>>>>>> you have to rely on getting the config right.  Likewise with
> serdes,
> >>>> the
> >>>>>>> Java representations produced by various serdes (JSON, Avro, etc.)
> >>> are
> >>>> not
> >>>>>>> equivalent so you cannot just reconfigure a serde without changing
> >>> the
> >>>>>>> code.   It would be nice for jobs to be able to assert what they
> >>> expect
> >>>>>>> from their input topics in terms of partitioning.  This is getting
> a
> >>>> little
> >>>>>>> off topic but I was even thinking about creating a "Samza config
> >>>> linter"
> >>>>>>> that would sanity check a set of configs.  Especially in
> >>> organizations
> >>>>>>> where config is managed by a different team than the application
> >>>> developer,
> >>>>>>> it's very hard to get avoid config mistakes.
> >>>>>>> 3) Java/Scala centric - for many teams (especially DevOps-type
> >>> folks),
> >>>> the
> >>>>>>> pain of the Java toolchain (maven, slow builds, weak command line
> >>>> support,
> >>>>>>> configuration over convention) really inhibits productivity.  As
> more
> >>>> and
> >>>>>>> more high-quality clients become available for Kafka, I hope
> they'll
> >>>> follow
> >>>>>>> Samza's model.  Not sure how much it affects the proposals in this
> >>>> thread
> >>>>>>> but please consider other languages in the ecosystem as well.  From
> >>>> what
> >>>>>>> I've heard, Spark has more Python users than Java/Scala.
> >>>>>>> (FYI, we added a Jython wrapper for the Samza API
> >>>>>>>
> >>>>>>>
> >>>>
> >>>
> https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza
> >>>>>>> and are working on a Yeoman generator
> >>>>>>> https://github.com/Quantiply/generator-rico for Jython/Samza
> >>> projects
> >>>> to
> >>>>>>> alleviate some of the pain)
> >>>>>>>
> >>>>>>> I also want to underscore Jay's point about improving the user
> >>>> experience.
> >>>>>>> That's a very important factor for adoption.  I think the goal
> should
> >>>> be to
> >>>>>>> make Samza as easy to get started with as something like Logstash.
> >>>>>>> Logstash is vastly inferior in terms of capabilities to Samza but
> >>> it's
> >>>> easy
> >>>>>>> to get started and that makes a big difference.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>>
> >>>>>>> Roger
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De Francisci Morales <
> >>>>>>> g...@apache.org> wrote:
> >>>>>>>
> >>>>>>>> Forgot to add. On the naming issues, Kafka Metamorphosis is a
> clear
> >>>>>>> winner
> >>>>>>>> :)
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Gianmarco
> >>>>>>>>
> >>>>>>>> On 7 July 2015 at 13:26, Gianmarco De Francisci Morales <
> >>>> g...@apache.org
> >>>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> @Martin, thanks for you comments.
> >>>>>>>>> Maybe I'm missing some important point, but I think coupling the
> >>>>>>> releases
> >>>>>>>>> is actually a *good* thing.
> >>>>>>>>> To make an example, would it be better if the MR and HDFS
> >>>> components of
> >>>>>>>>> Hadoop had different release schedules?
> >>>>>>>>>
> >>>>>>>>> Actually, keeping the discussion in a single place would make
> >>>> agreeing
> >>>>>>> on
> >>>>>>>>> releases (and backwards compatibility) much easier, as everybody
> >>>> would
> >>>>>>> be
> >>>>>>>>> responsible for the whole codebase.
> >>>>>>>>>
> >>>>>>>>> That said, I like the idea of absorbing samza-core as a
> >>>> sub-project,
> >>>>>>> and
> >>>>>>>>> leave the fancy stuff separate.
> >>>>>>>>> It probably gives 90% of the benefits we have been discussing
> >>> here.
> >>>>>>>>>
> >>>>>>>>> Cheers,
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Gianmarco
> >>>>>>>>>
> >>>>>>>>> On 7 July 2015 at 02:30, Jay Kreps <jay.kr...@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hey Martin,
> >>>>>>>>>>
> >>>>>>>>>> I agree coupling release schedules is a downside.
> >>>>>>>>>>
> >>>>>>>>>> Definitely we can try to solve some of the integration problems
> >>> in
> >>>>>>>>>> Confluent Platform or in other distributions. But I think this
> >>>> ends up
> >>>>>>>>>> being really shallow. I guess I feel to really get a good user
> >>>>>>>> experience
> >>>>>>>>>> the two systems have to kind of feel like part of the same thing
> >>>> and
> >>>>>>> you
> >>>>>>>>>> can't really add that in later--you can put both in the same
> >>>>>>>> downloadable
> >>>>>>>>>> tar file but it doesn't really give a very cohesive feeling. I
> >>>> agree
> >>>>>>>> that
> >>>>>>>>>> ultimately any of the project stuff is as much social and naming
> >>>> as
> >>>>>>>>>> anything else--theoretically two totally independent projects
> >>>> could
> >>>>>>> work
> >>>>>>>>>> to
> >>>>>>>>>> tightly align. In practice this seems to be quite difficult
> >>>> though.
> >>>>>>>>>>
> >>>>>>>>>> For the frameworks--totally agree it would be good to maintain
> >>> the
> >>>>>>>>>> framework support with the project. In some cases there may not
> >>>> be too
> >>>>>>>>>> much
> >>>>>>>>>> there since the integration gets lighter but I think whatever
> >>>> stubs
> >>>>>>> you
> >>>>>>>>>> need should be included. So no I definitely wasn't trying to
> >>> imply
> >>>>>>>>>> dropping
> >>>>>>>>>> support for these frameworks, just making the integration
> >>> lighter
> >>>> by
> >>>>>>>>>> separating process management from partition management.
> >>>>>>>>>>
> >>>>>>>>>> You raise two good points we would have to figure out if we went
> >>>> down
> >>>>>>>> the
> >>>>>>>>>> alignment path:
> >>>>>>>>>> 1. With respect to the name, yeah I think the first question is
> >>>>>>> whether
> >>>>>>>>>> some "re-branding" would be worth it. If so then I think we can
> >>>> have a
> >>>>>>>> big
> >>>>>>>>>> thread on the name. I'm definitely not set on Kafka Streaming or
> >>>> Kafka
> >>>>>>>>>> Streams I was just using them to be kind of illustrative. I
> >>> agree
> >>>> with
> >>>>>>>>>> your
> >>>>>>>>>> critique of these names, though I think people would get the
> >>> idea.
> >>>>>>>>>> 2. Yeah you also raise a good point about how to "factor" it.
> >>>> Here are
> >>>>>>>> the
> >>>>>>>>>> options I see (I could get enthusiastic about any of them):
> >>>>>>>>>>   a. One repo for both Kafka and Samza
> >>>>>>>>>>   b. Two repos, retaining the current seperation
> >>>>>>>>>>   c. Two repos, the equivalent of samza-api and samza-core is
> >>>>>>> absorbed
> >>>>>>>>>> almost like a third client
> >>>>>>>>>>
> >>>>>>>>>> Cheers,
> >>>>>>>>>>
> >>>>>>>>>> -Jay
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Jul 6, 2015 at 1:18 PM, Martin Kleppmann <
> >>>>>>> mar...@kleppmann.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Ok, thanks for the clarifications. Just a few follow-up
> >>>> comments.
> >>>>>>>>>>>
> >>>>>>>>>>> - I see the appeal of merging with Kafka or becoming a
> >>>> subproject:
> >>>>>>> the
> >>>>>>>>>>> reasons you mention are good. The risk I see is that release
> >>>>>>> schedules
> >>>>>>>>>>> become coupled to each other, which can slow everyone down,
> >>> and
> >>>>>>> large
> >>>>>>>>>>> projects with many contributors are harder to manage. (Jakob,
> >>>> can
> >>>>>>> you
> >>>>>>>>>> speak
> >>>>>>>>>>> from experience, having seen a wider range of Hadoop ecosystem
> >>>>>>>>>> projects?)
> >>>>>>>>>>>
> >>>>>>>>>>> Some of the goals of a better unified developer experience
> >>> could
> >>>>>>> also
> >>>>>>>> be
> >>>>>>>>>>> solved by integrating Samza nicely into a Kafka distribution
> >>>> (such
> >>>>>>> as
> >>>>>>>>>>> Confluent's). I'm not against merging projects if we decide
> >>>> that's
> >>>>>>> the
> >>>>>>>>>> way
> >>>>>>>>>>> to go, just pointing out the same goals can perhaps also be
> >>>> achieved
> >>>>>>>> in
> >>>>>>>>>>> other ways.
> >>>>>>>>>>>
> >>>>>>>>>>> - With regard to dropping the YARN dependency: are you
> >>> proposing
> >>>>>>> that
> >>>>>>>>>>> Samza doesn't give any help to people wanting to run on
> >>>>>>>>>> YARN/Mesos/AWS/etc?
> >>>>>>>>>>> So the docs would basically have a link to Slider and nothing
> >>>> else?
> >>>>>>> Or
> >>>>>>>>>>> would we maintain integrations with a bunch of popular
> >>>> deployment
> >>>>>>>>>> methods
> >>>>>>>>>>> (e.g. the necessary glue and shell scripts to make Samza work
> >>>> with
> >>>>>>>>>> Slider)?
> >>>>>>>>>>>
> >>>>>>>>>>> I absolutely think it's a good idea to have the "as a library"
> >>>> and
> >>>>>>>> "as a
> >>>>>>>>>>> process" (using Yi's taxonomy) options for people who want
> >>> them,
> >>>>>>> but I
> >>>>>>>>>>> think there should also be a low-friction path for common "as
> >>> a
> >>>>>>>> service"
> >>>>>>>>>>> deployment methods, for which we probably need to maintain
> >>>>>>>> integrations.
> >>>>>>>>>>>
> >>>>>>>>>>> - Project naming: "Kafka Streams" seems odd to me, because
> >>>> Kafka is
> >>>>>>>> all
> >>>>>>>>>>> about streams already. Perhaps "Kafka Transformers" or "Kafka
> >>>>>>> Filters"
> >>>>>>>>>>> would be more apt?
> >>>>>>>>>>>
> >>>>>>>>>>> One suggestion: perhaps the core of Samza (stream
> >>> transformation
> >>>>>>> with
> >>>>>>>>>>> state management -- i.e. the "Samza as a library" bit) could
> >>>> become
> >>>>>>>>>> part of
> >>>>>>>>>>> Kafka, while higher-level tools such as streaming SQL and
> >>>>>>> integrations
> >>>>>>>>>> with
> >>>>>>>>>>> deployment frameworks remain in a separate project? In other
> >>>> words,
> >>>>>>>>>> Kafka
> >>>>>>>>>>> would absorb the proven, stable core of Samza, which would
> >>>> become
> >>>>>>> the
> >>>>>>>>>>> "third Kafka client" mentioned early in this thread. The Samza
> >>>>>>> project
> >>>>>>>>>>> would then target that third Kafka client as its base API, and
> >>>> the
> >>>>>>>>>> project
> >>>>>>>>>>> would be freed up to explore more experimental new horizons.
> >>>>>>>>>>>
> >>>>>>>>>>> Martin
> >>>>>>>>>>>
> >>>>>>>>>>> On 6 Jul 2015, at 18:51, Jay Kreps <jay.kr...@gmail.com>
> >>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hey Martin,
> >>>>>>>>>>>>
> >>>>>>>>>>>> For the YARN/Mesos/etc decoupling I actually don't think it
> >>>> ties
> >>>>>>> our
> >>>>>>>>>>> hands
> >>>>>>>>>>>> at all, all it does is refactor things. The division of
> >>>>>>>>>> responsibility is
> >>>>>>>>>>>> that Samza core is responsible for task lifecycle, state,
> >>> and
> >>>>>>>>>> partition
> >>>>>>>>>>>> management (using the Kafka co-ordinator) but it is NOT
> >>>>>>> responsible
> >>>>>>>>>> for
> >>>>>>>>>>>> packaging, configuration deployment or execution of
> >>>> processes. The
> >>>>>>>>>>> problem
> >>>>>>>>>>>> of packaging and starting these processes is
> >>>>>>>>>>>> framework/environment-specific. This leaves individual
> >>>> frameworks
> >>>>>>> to
> >>>>>>>>>> be
> >>>>>>>>>>> as
> >>>>>>>>>>>> fancy or vanilla as they like. So you can get simple
> >>> stateless
> >>>>>>>>>> support in
> >>>>>>>>>>>> YARN, Mesos, etc using their off-the-shelf app framework
> >>>> (Slider,
> >>>>>>>>>>> Marathon,
> >>>>>>>>>>>> etc). These are well known by people and have nice UIs and a
> >>>> lot
> >>>>>>> of
> >>>>>>>>>>>> flexibility. I don't think they have node affinity as a
> >>> built
> >>>> in
> >>>>>>>>>> option
> >>>>>>>>>>>> (though I could be wrong). So if we want that we can either
> >>>> wait
> >>>>>>> for
> >>>>>>>>>> them
> >>>>>>>>>>>> to add it or do a custom framework to add that feature (as
> >>>> now).
> >>>>>>>>>>> Obviously
> >>>>>>>>>>>> if you manage things with old-school ops tools
> >>>> (puppet/chef/etc)
> >>>>>>> you
> >>>>>>>>>> get
> >>>>>>>>>>>> locality easily. The nice thing, though, is that all the
> >>> samza
> >>>>>>>>>> "business
> >>>>>>>>>>>> logic" around partition management and fault tolerance is in
> >>>> Samza
> >>>>>>>>>> core
> >>>>>>>>>>> so
> >>>>>>>>>>>> it is shared across frameworks and the framework specific
> >>> bit
> >>>> is
> >>>>>>>> just
> >>>>>>>>>>>> whether it is smart enough to try to get the same host when
> >>> a
> >>>> job
> >>>>>>> is
> >>>>>>>>>>>> restarted.
> >>>>>>>>>>>>
> >>>>>>>>>>>> With respect to the Kafka-alignment, yeah I think the goal
> >>>> would
> >>>>>>> be
> >>>>>>>>>> (a)
> >>>>>>>>>>>> actually get better alignment in user experience, and (b)
> >>>> express
> >>>>>>>>>> this in
> >>>>>>>>>>>> the naming and project branding. Specifically:
> >>>>>>>>>>>> 1. Website/docs, it would be nice for the "transformation"
> >>>> api to
> >>>>>>> be
> >>>>>>>>>>>> discoverable in the main Kafka docs--i.e. be able to explain
> >>>> when
> >>>>>>> to
> >>>>>>>>>> use
> >>>>>>>>>>>> the consumer and when to use the stream processing
> >>>> functionality
> >>>>>>> and
> >>>>>>>>>> lead
> >>>>>>>>>>>> people into that experience.
> >>>>>>>>>>>> 2. Align releases so if you get Kafkza 1.4.2 (or whatever)
> >>>> that
> >>>>>>> has
> >>>>>>>>>> both
> >>>>>>>>>>>> Kafka and the stream processing part and they actually work
> >>>>>>>> together.
> >>>>>>>>>>>> 3. Unify the programming experience so the client and Samza
> >>>> api
> >>>>>>>> share
> >>>>>>>>>>>> config/monitoring/naming/packaging/etc.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think sub-projects keep separate committers and can have a
> >>>>>>>> separate
> >>>>>>>>>>> repo,
> >>>>>>>>>>>> but I'm actually not really sure (I can't find a definition
> >>>> of a
> >>>>>>>>>>> subproject
> >>>>>>>>>>>> in Apache).
> >>>>>>>>>>>>
> >>>>>>>>>>>> Basically at a high-level you want the experience to "feel"
> >>>> like a
> >>>>>>>>>> single
> >>>>>>>>>>>> system, not to relatively independent things that are kind
> >>> of
> >>>>>>>>>> awkwardly
> >>>>>>>>>>>> glued together.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think if we did that they having naming or branding like
> >>>> "kafka
> >>>>>>>>>>>> streaming" or "kafka streams" or something like that would
> >>>>>>> actually
> >>>>>>>>>> do a
> >>>>>>>>>>>> good job of conveying what it is. I do that this would help
> >>>>>>> adoption
> >>>>>>>>>>> quite
> >>>>>>>>>>>> a lot as it would correctly convey that using Kafka
> >>> Streaming
> >>>> with
> >>>>>>>>>> Kafka
> >>>>>>>>>>> is
> >>>>>>>>>>>> a fairly seamless experience and Kafka is pretty heavily
> >>>> adopted
> >>>>>>> at
> >>>>>>>>>> this
> >>>>>>>>>>>> point.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Fwiw we actually considered this model originally when open
> >>>>>>> sourcing
> >>>>>>>>>>> Samza,
> >>>>>>>>>>>> however at that time Kafka was relatively unknown and we
> >>>> decided
> >>>>>>> not
> >>>>>>>>>> to
> >>>>>>>>>>> do
> >>>>>>>>>>>> it since we felt it would be limiting. From my point of view
> >>>> the
> >>>>>>>> three
> >>>>>>>>>>>> things have changed (1) Kafka is now really heavily used for
> >>>>>>> stream
> >>>>>>>>>>>> processing, (2) we learned that abstracting out the stream
> >>>> well is
> >>>>>>>>>>>> basically impossible, (3) we learned it is really hard to
> >>>> keep the
> >>>>>>>> two
> >>>>>>>>>>>> things feeling like a single product.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Jay
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Mon, Jul 6, 2015 at 3:37 AM, Martin Kleppmann <
> >>>>>>>>>> mar...@kleppmann.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Lots of good thoughts here.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I agree with the general philosophy of tying Samza more
> >>>> firmly to
> >>>>>>>>>> Kafka.
> >>>>>>>>>>>>> After I spent a while looking at integrating other message
> >>>>>>> brokers
> >>>>>>>>>> (e.g.
> >>>>>>>>>>>>> Kinesis) with SystemConsumer, I came to the conclusion that
> >>>>>>>>>>> SystemConsumer
> >>>>>>>>>>>>> tacitly assumes a model so much like Kafka's that pretty
> >>> much
> >>>>>>>> nobody
> >>>>>>>>>> but
> >>>>>>>>>>>>> Kafka actually implements it. (Databus is perhaps an
> >>>> exception,
> >>>>>>> but
> >>>>>>>>>> it
> >>>>>>>>>>>>> isn't widely used outside of LinkedIn.) Thus, making Samza
> >>>> fully
> >>>>>>>>>>> dependent
> >>>>>>>>>>>>> on Kafka acknowledges that the system-independence was
> >>> never
> >>>> as
> >>>>>>>> real
> >>>>>>>>>> as
> >>>>>>>>>>> we
> >>>>>>>>>>>>> perhaps made it out to be. The gains of code reuse are
> >>> real.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The idea of decoupling Samza from YARN has also always been
> >>>>>>>>>> appealing to
> >>>>>>>>>>>>> me, for various reasons already mentioned in this thread.
> >>>>>>> Although
> >>>>>>>>>>> making
> >>>>>>>>>>>>> Samza jobs deployable on anything (YARN/Mesos/AWS/etc)
> >>> seems
> >>>>>>>>>> laudable,
> >>>>>>>>>>> I am
> >>>>>>>>>>>>> a little concerned that it will restrict us to a lowest
> >>>> common
> >>>>>>>>>>> denominator.
> >>>>>>>>>>>>> For example, would host affinity (SAMZA-617) still be
> >>>> possible?
> >>>>>>> For
> >>>>>>>>>> jobs
> >>>>>>>>>>>>> with large amounts of state, I think SAMZA-617 would be a
> >>> big
> >>>>>>> boon,
> >>>>>>>>>>> since
> >>>>>>>>>>>>> restoring state off the changelog on every single restart
> >>> is
> >>>>>>>> painful,
> >>>>>>>>>>> due
> >>>>>>>>>>>>> to long recovery times. It would be a shame if the
> >>> decoupling
> >>>>>>> from
> >>>>>>>>>> YARN
> >>>>>>>>>>>>> made host affinity impossible.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Jay, a question about the proposed API for instantiating a
> >>>> job in
> >>>>>>>>>> code
> >>>>>>>>>>>>> (rather than a properties file): when submitting a job to a
> >>>>>>>> cluster,
> >>>>>>>>>> is
> >>>>>>>>>>> the
> >>>>>>>>>>>>> idea that the instantiation code runs on a client
> >>> somewhere,
> >>>>>>> which
> >>>>>>>>>> then
> >>>>>>>>>>>>> pokes the necessary endpoints on YARN/Mesos/AWS/etc? Or
> >>> does
> >>>> that
> >>>>>>>>>> code
> >>>>>>>>>>> run
> >>>>>>>>>>>>> on each container that is part of the job (in which case,
> >>> how
> >>>>>>> does
> >>>>>>>>>> the
> >>>>>>>>>>> job
> >>>>>>>>>>>>> submission to the cluster work)?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I agree with Garry that it doesn't feel right to make a 1.0
> >>>>>>> release
> >>>>>>>>>>> with a
> >>>>>>>>>>>>> plan for it to be immediately obsolete. So if this is going
> >>>> to
> >>>>>>>>>> happen, I
> >>>>>>>>>>>>> think it would be more honest to stick with 0.* version
> >>>> numbers
> >>>>>>>> until
> >>>>>>>>>>> the
> >>>>>>>>>>>>> library-ified Samza has been implemented, is stable and
> >>>> widely
> >>>>>>>> used.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Should the new Samza be a subproject of Kafka? There is
> >>>> precedent
> >>>>>>>> for
> >>>>>>>>>>>>> tight coupling between different Apache projects (e.g.
> >>>> Curator
> >>>>>>> and
> >>>>>>>>>>>>> Zookeeper, or Slider and YARN), so I think remaining
> >>> separate
> >>>>>>> would
> >>>>>>>>>> be
> >>>>>>>>>>> ok.
> >>>>>>>>>>>>> Even if Samza is fully dependent on Kafka, there is enough
> >>>>>>>> substance
> >>>>>>>>>> in
> >>>>>>>>>>>>> Samza that it warrants being a separate project. An
> >>> argument
> >>>> in
> >>>>>>>>>> favour
> >>>>>>>>>>> of
> >>>>>>>>>>>>> merging would be if we think Kafka has a much stronger
> >>> "brand
> >>>>>>>>>> presence"
> >>>>>>>>>>>>> than Samza; I'm ambivalent on that one. If the Kafka
> >>> project
> >>>> is
> >>>>>>>>>> willing
> >>>>>>>>>>> to
> >>>>>>>>>>>>> endorse Samza as the "official" way of doing stateful
> >>> stream
> >>>>>>>>>>>>> transformations, that would probably have much the same
> >>>> effect as
> >>>>>>>>>>>>> re-branding Samza as "Kafka Stream Processors" or suchlike.
> >>>> Close
> >>>>>>>>>>>>> collaboration between the two projects will be needed in
> >>> any
> >>>>>>> case.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> From a project management perspective, I guess the "new
> >>>> Samza"
> >>>>>>>> would
> >>>>>>>>>>> have
> >>>>>>>>>>>>> to be developed on a branch alongside ongoing maintenance
> >>> of
> >>>> the
> >>>>>>>>>> current
> >>>>>>>>>>>>> line of development? I think it would be important to
> >>>> continue
> >>>>>>>>>>> supporting
> >>>>>>>>>>>>> existing users, and provide a graceful migration path to
> >>> the
> >>>> new
> >>>>>>>>>>> version.
> >>>>>>>>>>>>> Leaving the current versions unsupported and forcing people
> >>>> to
> >>>>>>>>>> rewrite
> >>>>>>>>>>>>> their jobs would send a bad signal.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best,
> >>>>>>>>>>>>> Martin
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 2 Jul 2015, at 16:59, Jay Kreps <j...@confluent.io>
> >>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hey Garry,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Yeah that's super frustrating. I'd be happy to chat more
> >>>> about
> >>>>>>>> this
> >>>>>>>>>> if
> >>>>>>>>>>>>>> you'd be interested. I think Chris and I started with the
> >>>> idea
> >>>>>>> of
> >>>>>>>>>> "what
> >>>>>>>>>>>>>> would it take to make Samza a kick-ass ingestion tool" but
> >>>>>>>>>> ultimately
> >>>>>>>>>>> we
> >>>>>>>>>>>>>> kind of came around to the idea that ingestion and
> >>>>>>> transformation
> >>>>>>>>>> had
> >>>>>>>>>>>>>> pretty different needs and coupling the two made things
> >>>> hard.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> For what it's worth I think copycat (KIP-26) actually will
> >>>> do
> >>>>>>> what
> >>>>>>>>>> you
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>>> looking for.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> With regard to your point about slider, I don't
> >>> necessarily
> >>>>>>>>>> disagree.
> >>>>>>>>>>>>> But I
> >>>>>>>>>>>>>> think getting good YARN support is quite doable and I
> >>> think
> >>>> we
> >>>>>>> can
> >>>>>>>>>> make
> >>>>>>>>>>>>>> that work well. I think the issue this proposal solves is
> >>>> that
> >>>>>>>>>>>>> technically
> >>>>>>>>>>>>>> it is pretty hard to support multiple cluster management
> >>>> systems
> >>>>>>>> the
> >>>>>>>>>>> way
> >>>>>>>>>>>>>> things are now, you need to write an "app master" or
> >>>> "framework"
> >>>>>>>> for
> >>>>>>>>>>> each
> >>>>>>>>>>>>>> and they are all a little different so testing is really
> >>>> hard.
> >>>>>>> In
> >>>>>>>>>> the
> >>>>>>>>>>>>>> absence of this we have been stuck with just YARN which
> >>> has
> >>>>>>>>>> fantastic
> >>>>>>>>>>>>>> penetration in the Hadoopy part of the org, but zero
> >>>> penetration
> >>>>>>>>>>>>> elsewhere.
> >>>>>>>>>>>>>> Given the huge amount of work being put in to slider,
> >>>> marathon,
> >>>>>>>> aws
> >>>>>>>>>>>>>> tooling, not to mention the umpteen related packaging
> >>>>>>> technologies
> >>>>>>>>>>> people
> >>>>>>>>>>>>>> want to use (Docker, Kubernetes, various cloud-specific
> >>>> deploy
> >>>>>>>>>> tools,
> >>>>>>>>>>>>> etc)
> >>>>>>>>>>>>>> I really think it is important to get this right.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -Jay
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, Jul 2, 2015 at 4:17 AM, Garry Turkington <
> >>>>>>>>>>>>>> g.turking...@improvedigital.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I think the question below re does Samza become a
> >>>> sub-project
> >>>>>>> of
> >>>>>>>>>> Kafka
> >>>>>>>>>>>>>>> highlights the broader point around migration. Chris
> >>>> mentions
> >>>>>>>>>> Samza's
> >>>>>>>>>>>>>>> maturity is heading towards a v1 release but I'm not sure
> >>>> it
> >>>>>>>> feels
> >>>>>>>>>>>>> right to
> >>>>>>>>>>>>>>> launch a v1 then immediately plan to deprecate most of
> >>> it.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> From a selfish perspective I have some guys who have
> >>>> started
> >>>>>>>>>> working
> >>>>>>>>>>>>> with
> >>>>>>>>>>>>>>> Samza and building some new consumers/producers was next
> >>>> up.
> >>>>>>>> Sounds
> >>>>>>>>>>> like
> >>>>>>>>>>>>>>> that is absolutely not the direction to go. I need to
> >>> look
> >>>> into
> >>>>>>>> the
> >>>>>>>>>>> KIP
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>>> more detail but for me the attractiveness of adding new
> >>>> Samza
> >>>>>>>>>>>>>>> consumer/producers -- even if yes all they were doing was
> >>>>>>> really
> >>>>>>>>>>> getting
> >>>>>>>>>>>>>>> data into and out of Kafka --  was to avoid  having to
> >>>> worry
> >>>>>>>> about
> >>>>>>>>>> the
> >>>>>>>>>>>>>>> lifecycle management of external clients. If there is a
> >>>> generic
> >>>>>>>>>> Kafka
> >>>>>>>>>>>>>>> ingress/egress layer that I can plug a new connector into
> >>>> and
> >>>>>>>> have
> >>>>>>>>>> a
> >>>>>>>>>>>>> lot of
> >>>>>>>>>>>>>>> the heavy lifting re scale and reliability done for me
> >>>> then it
> >>>>>>>>>> gives
> >>>>>>>>>>> me
> >>>>>>>>>>>>> all
> >>>>>>>>>>>>>>> the pushing new consumers/producers would. If not then it
> >>>>>>>>>> complicates
> >>>>>>>>>>> my
> >>>>>>>>>>>>>>> operational deployments.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Which is similar to my other question with the proposal
> >>> --
> >>>> if
> >>>>>>> we
> >>>>>>>>>>> build a
> >>>>>>>>>>>>>>> fully available/stand-alone Samza plus the requisite
> >>> shims
> >>>> to
> >>>>>>>>>>> integrate
> >>>>>>>>>>>>>>> with Slider etc I suspect the former may be a lot more
> >>> work
> >>>>>>> than
> >>>>>>>> we
> >>>>>>>>>>>>> think.
> >>>>>>>>>>>>>>> We may make it much easier for a newcomer to get
> >>> something
> >>>>>>>> running
> >>>>>>>>>> but
> >>>>>>>>>>>>>>> having them step up and get a reliable production
> >>>> deployment
> >>>>>>> may
> >>>>>>>>>> still
> >>>>>>>>>>>>>>> dominate mailing list  traffic, if for different reasons
> >>>> than
> >>>>>>>>>> today.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Don't get me wrong -- I'm comfortable with making the
> >>> Samza
> >>>>>>>>>> dependency
> >>>>>>>>>>>>> on
> >>>>>>>>>>>>>>> Kafka much more explicit and I absolutely see the
> >>>> benefits  in
> >>>>>>>> the
> >>>>>>>>>>>>>>> reduction of duplication and clashing
> >>>>>>> terminologies/abstractions
> >>>>>>>>>> that
> >>>>>>>>>>>>>>> Chris/Jay describe. Samza as a library would likely be a
> >>>> very
> >>>>>>>> nice
> >>>>>>>>>>> tool
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>> add to the Kafka ecosystem. I just have the concerns
> >>> above
> >>>> re
> >>>>>>> the
> >>>>>>>>>>>>>>> operational side.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Garry
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>>>>> From: Gianmarco De Francisci Morales [mailto:
> >>>> g...@apache.org]
> >>>>>>>>>>>>>>> Sent: 02 July 2015 12:56
> >>>>>>>>>>>>>>> To: dev@samza.apache.org
> >>>>>>>>>>>>>>> Subject: Re: Thoughts and obesrvations on Samza
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Very interesting thoughts.
> >>>>>>>>>>>>>>> From outside, I have always perceived Samza as a
> >>> computing
> >>>>>>> layer
> >>>>>>>>>> over
> >>>>>>>>>>>>>>> Kafka.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The question, maybe a bit provocative, is "should Samza
> >>> be
> >>>> a
> >>>>>>>>>>> sub-project
> >>>>>>>>>>>>>>> of Kafka then?"
> >>>>>>>>>>>>>>> Or does it make sense to keep it as a separate project
> >>>> with a
> >>>>>>>>>> separate
> >>>>>>>>>>>>>>> governance?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>> Gianmarco
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 2 July 2015 at 08:59, Yan Fang <yanfang...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Overall, I agree to couple with Kafka more tightly.
> >>>> Because
> >>>>>>>> Samza
> >>>>>>>>>> de
> >>>>>>>>>>>>>>>> facto is based on Kafka, and it should leverage what
> >>> Kafka
> >>>>>>> has.
> >>>>>>>> At
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>> same time, Kafka does not need to reinvent what Samza
> >>>> already
> >>>>>>>>>> has. I
> >>>>>>>>>>>>>>>> also like the idea of separating the ingestion and
> >>>>>>>> transformation.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> But it is a little difficult for me to image how the
> >>> Samza
> >>>>>>> will
> >>>>>>>>>> look
> >>>>>>>>>>>>>>> like.
> >>>>>>>>>>>>>>>> And I feel Chris and Jay have a little difference in
> >>>> terms of
> >>>>>>>> how
> >>>>>>>>>>>>>>>> Samza should look like.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> *** Will it look like what Jay's code shows (A client of
> >>>>>>> Kakfa)
> >>>>>>>> ?
> >>>>>>>>>> And
> >>>>>>>>>>>>>>>> user's application code calls this client?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 1. If we make Samza be a library of Kafka (like what the
> >>>> code
> >>>>>>>>>> shows),
> >>>>>>>>>>>>>>>> how do we implement auto-balance and fault-tolerance?
> >>> Are
> >>>> they
> >>>>>>>>>> taken
> >>>>>>>>>>>>>>>> care by the Kafka broker or other mechanism, such as
> >>>> "Samza
> >>>>>>>>>> worker"
> >>>>>>>>>>>>>>>> (just make up the name) ?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 2. What about other features, such as auto-scaling,
> >>> shared
> >>>>>>>> state,
> >>>>>>>>>>>>>>>> monitoring?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> *** If we have Samza standalone, (is this what Chris
> >>>>>>> suggests?)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 1. we still need to ingest data from Kakfa and produce
> >>> to
> >>>> it.
> >>>>>>>>>> Then it
> >>>>>>>>>>>>>>>> becomes the same as what Samza looks like now, except it
> >>>> does
> >>>>>>>> not
> >>>>>>>>>>> rely
> >>>>>>>>>>>>>>>> on Yarn anymore.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 2. if it is standalone, how can it leverage Kafka's
> >>>> metrics,
> >>>>>>>> logs,
> >>>>>>>>>>>>>>>> etc? Use Kafka code as the dependency?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Fang, Yan
> >>>>>>>>>>>>>>>> yanfang...@gmail.com
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Wed, Jul 1, 2015 at 5:46 PM, Guozhang Wang <
> >>>>>>>> wangg...@gmail.com
> >>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Read through the code example and it looks good to me.
> >>> A
> >>>> few
> >>>>>>>>>>>>>>>>> thoughts regarding deployment:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Today Samza deploys as executable runnable like:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> deploy/samza/bin/run-job.sh --config-factory=...
> >>>>>>>>>>>>>>> --config-path=file://...
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> And this proposal advocate for deploying Samza more as
> >>>>>>> embedded
> >>>>>>>>>>>>>>>>> libraries in user application code (ignoring the
> >>>> terminology
> >>>>>>>>>> since
> >>>>>>>>>>>>>>>>> it is not the
> >>>>>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>> as the prototype code):
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> StreamTask task = new MyStreamTask(configs); Thread
> >>>> thread =
> >>>>>>>> new
> >>>>>>>>>>>>>>>>> Thread(task); thread.start();
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I think both of these deployment modes are important
> >>> for
> >>>>>>>>>> different
> >>>>>>>>>>>>>>>>> types
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> users. That said, I think making Samza purely
> >>> standalone
> >>>> is
> >>>>>>>> still
> >>>>>>>>>>>>>>>>> sufficient for either runnable or library modes.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Guozhang
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Tue, Jun 30, 2015 at 11:33 PM, Jay Kreps <
> >>>>>>> j...@confluent.io>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Looks like gmail mangled the code example, it was
> >>>> supposed
> >>>>>>> to
> >>>>>>>>>> look
> >>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>> this:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Properties props = new Properties();
> >>>>>>>>>>>>>>>>>> props.put("bootstrap.servers", "localhost:4242");
> >>>>>>>>>> StreamingConfig
> >>>>>>>>>>>>>>>>>> config = new StreamingConfig(props);
> >>>>>>>>>>>>>>>>>> config.subscribe("test-topic-1", "test-topic-2");
> >>>>>>>>>>>>>>>>>> config.processor(ExampleStreamProcessor.class);
> >>>>>>>>>>>>>>>>>> config.serialization(new StringSerializer(), new
> >>>>>>>>>>>>>>>>>> StringDeserializer()); KafkaStreaming container = new
> >>>>>>>>>>>>>>>>>> KafkaStreaming(config); container.run();
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> -Jay
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Tue, Jun 30, 2015 at 11:32 PM, Jay Kreps <
> >>>>>>> j...@confluent.io
> >>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hey guys,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> This came out of some conversations Chris and I were
> >>>> having
> >>>>>>>>>>>>>>>>>>> around
> >>>>>>>>>>>>>>>>>> whether
> >>>>>>>>>>>>>>>>>>> it would make sense to use Samza as a kind of data
> >>>>>>> ingestion
> >>>>>>>>>>>>>>>> framework
> >>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> Kafka (which ultimately lead to KIP-26 "copycat").
> >>> This
> >>>>>>> kind
> >>>>>>>> of
> >>>>>>>>>>>>>>>>> combined
> >>>>>>>>>>>>>>>>>>> with complaints around config and YARN and the
> >>>> discussion
> >>>>>>>>>> around
> >>>>>>>>>>>>>>>>>>> how
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> best do a standalone mode.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> So the thought experiment was, given that Samza was
> >>>>>>> basically
> >>>>>>>>>>>>>>>>>>> already totally Kafka specific, what if you just
> >>>> embraced
> >>>>>>>> that
> >>>>>>>>>>>>>>>>>>> and turned it
> >>>>>>>>>>>>>>>>> into
> >>>>>>>>>>>>>>>>>>> something less like a heavyweight framework and more
> >>>> like a
> >>>>>>>>>>>>>>>>>>> third
> >>>>>>>>>>>>>>>> Kafka
> >>>>>>>>>>>>>>>>>>> client--a kind of "producing consumer" with state
> >>>>>>> management
> >>>>>>>>>>>>>>>>> facilities.
> >>>>>>>>>>>>>>>>>>> Basically a library. Instead of a complex stream
> >>>> processing
> >>>>>>>>>>>>>>>>>>> framework
> >>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> would actually be a very simple thing, not much more
> >>>>>>>>>> complicated
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>> operate than a Kafka consumer. As Chris said we
> >>> thought
> >>>>>>> about
> >>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>> lot
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> what Samza (and the other stream processing systems
> >>>> were
> >>>>>>>> doing)
> >>>>>>>>>>>>>>>> seemed
> >>>>>>>>>>>>>>>>>> like
> >>>>>>>>>>>>>>>>>>> kind of a hangover from MapReduce.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Of course you need to ingest/output data to and from
> >>>> the
> >>>>>>>> stream
> >>>>>>>>>>>>>>>>>>> processing. But when we actually looked into how that
> >>>> would
> >>>>>>>>>>>>>>>>>>> work,
> >>>>>>>>>>>>>>>> Samza
> >>>>>>>>>>>>>>>>>>> isn't really an ideal data ingestion framework for a
> >>>> bunch
> >>>>>>> of
> >>>>>>>>>>>>>>>> reasons.
> >>>>>>>>>>>>>>>>> To
> >>>>>>>>>>>>>>>>>>> really do that right you need a pretty different
> >>>> internal
> >>>>>>>> data
> >>>>>>>>>>>>>>>>>>> model
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> set of apis. So what if you split them and had an api
> >>>> for
> >>>>>>>> Kafka
> >>>>>>>>>>>>>>>>>>> ingress/egress (copycat AKA KIP-26) and a separate
> >>> api
> >>>> for
> >>>>>>>>>> Kafka
> >>>>>>>>>>>>>>>>>>> transformation (Samza).
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> This would also allow really embracing the same
> >>>> terminology
> >>>>>>>> and
> >>>>>>>>>>>>>>>>>>> conventions. One complaint about the current state is
> >>>> that
> >>>>>>>> the
> >>>>>>>>>>>>>>>>>>> two
> >>>>>>>>>>>>>>>>>> systems
> >>>>>>>>>>>>>>>>>>> kind of feel bolted on. Terminology like "stream" vs
> >>>>>>> "topic"
> >>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>> config and monitoring systems means you kind of have
> >>> to
> >>>>>>> learn
> >>>>>>>>>>>>>>>>>>> Kafka's
> >>>>>>>>>>>>>>>>>> way,
> >>>>>>>>>>>>>>>>>>> then learn Samza's slightly different way, then kind
> >>> of
> >>>>>>>>>>>>>>>>>>> understand
> >>>>>>>>>>>>>>>> how
> >>>>>>>>>>>>>>>>>> they
> >>>>>>>>>>>>>>>>>>> map to each other, which having walked a few people
> >>>> through
> >>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>> is surprisingly tricky for folks to get.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Since I have been spending a lot of time on
> >>> airplanes I
> >>>>>>>> hacked
> >>>>>>>>>>>>>>>>>>> up an ernest but still somewhat incomplete prototype
> >>> of
> >>>>>>> what
> >>>>>>>>>>>>>>>>>>> this would
> >>>>>>>>>>>>>>>> look
> >>>>>>>>>>>>>>>>>>> like. This is just unceremoniously dumped into Kafka
> >>>> as it
> >>>>>>>>>>>>>>>>>>> required a
> >>>>>>>>>>>>>>>>> few
> >>>>>>>>>>>>>>>>>>> changes to the new consumer. Here is the code:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>
> https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org
> >>>>>>>>>>>>>>>> /apache/kafka/clients/streaming
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> For the purpose of the prototype I just liberally
> >>>> renamed
> >>>>>>>>>>>>>>>>>>> everything
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> try to align it with Kafka with no regard for
> >>>>>>> compatibility.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> To use this would be something like this:
> >>>>>>>>>>>>>>>>>>> Properties props = new Properties();
> >>>>>>>>>>>>>>>>>>> props.put("bootstrap.servers", "localhost:4242");
> >>>>>>>>>>>>>>>>>>> StreamingConfig config = new
> >>>>>>>>>>>>>>>> StreamingConfig(props);
> >>>>>>>>>>>>>>>>>> config.subscribe("test-topic-1",
> >>>>>>>>>>>>>>>>>>> "test-topic-2");
> >>>>>>>>>> config.processor(ExampleStreamProcessor.class);
> >>>>>>>>>>>>>>>>>> config.serialization(new
> >>>>>>>>>>>>>>>>>>> StringSerializer(), new StringDeserializer());
> >>>>>>> KafkaStreaming
> >>>>>>>>>>>>>>>>> container =
> >>>>>>>>>>>>>>>>>>> new KafkaStreaming(config); container.run();
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> KafkaStreaming is basically the SamzaContainer;
> >>>>>>>> StreamProcessor
> >>>>>>>>>>>>>>>>>>> is basically StreamTask.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> So rather than putting all the class names in a file
> >>>> and
> >>>>>>> then
> >>>>>>>>>>>>>>>>>>> having
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> job assembled by reflection, you just instantiate the
> >>>>>>>> container
> >>>>>>>>>>>>>>>>>>> programmatically. Work is balanced over however many
> >>>>>>>> instances
> >>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>> alive at any time (i.e. if an instance dies, new
> >>> tasks
> >>>> are
> >>>>>>>>>> added
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> existing containers without shutting them down).
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> We would provide some glue for running this stuff in
> >>>> YARN
> >>>>>>> via
> >>>>>>>>>>>>>>>>>>> Slider, Mesos via Marathon, and AWS using some of
> >>> their
> >>>>>>> tools
> >>>>>>>>>>>>>>>>>>> but from the
> >>>>>>>>>>>>>>>>> point
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>> view of these frameworks these stream processing jobs
> >>>> are
> >>>>>>>> just
> >>>>>>>>>>>>>>>>> stateless
> >>>>>>>>>>>>>>>>>>> services that can come and go and expand and contract
> >>>> at
> >>>>>>>> will.
> >>>>>>>>>>>>>>>>>>> There
> >>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> no
> >>>>>>>>>>>>>>>>>>> more custom scheduler.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Here are some relevant details:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 1. It is only ~1300 lines of code, it would get
> >>>> larger if
> >>>>>>> we
> >>>>>>>>>>>>>>>>>>> productionized but not vastly larger. We really do
> >>>> get a
> >>>>>>> ton
> >>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> leverage
> >>>>>>>>>>>>>>>>>>> out of Kafka.
> >>>>>>>>>>>>>>>>>>> 2. Partition management is fully delegated to the
> >>> new
> >>>>>>>>>> consumer.
> >>>>>>>>>>>>>>>> This
> >>>>>>>>>>>>>>>>>>> is nice since now any partition management strategy
> >>>>>>>> available
> >>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> Kafka
> >>>>>>>>>>>>>>>>>>> consumer is also available to Samza (and vice versa)
> >>>> and
> >>>>>>>> with
> >>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> exact
> >>>>>>>>>>>>>>>>>>> same configs.
> >>>>>>>>>>>>>>>>>>> 3. It supports state as well as state reuse
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Anyhow take a look, hopefully it is thought
> >>> provoking.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> -Jay
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Tue, Jun 30, 2015 at 6:55 PM, Chris Riccomini <
> >>>>>>>>>>>>>>>>> criccom...@apache.org>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hey all,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I have had some discussions with Samza engineers at
> >>>>>>> LinkedIn
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> Confluent
> >>>>>>>>>>>>>>>>>>>> and we came up with a few observations and would
> >>> like
> >>>> to
> >>>>>>>>>>>>>>>>>>>> propose
> >>>>>>>>>>>>>>>> some
> >>>>>>>>>>>>>>>>>>>> changes.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> We've observed some things that I want to call out
> >>>> about
> >>>>>>>>>>>>>>>>>>>> Samza's
> >>>>>>>>>>>>>>>>> design,
> >>>>>>>>>>>>>>>>>>>> and I'd like to propose some changes.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> * Samza is dependent upon a dynamic deployment
> >>> system.
> >>>>>>>>>>>>>>>>>>>> * Samza is too pluggable.
> >>>>>>>>>>>>>>>>>>>> * Samza's SystemConsumer/SystemProducer and Kafka's
> >>>>>>> consumer
> >>>>>>>>>>>>>>>>>>>> APIs
> >>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>> trying to solve a lot of the same problems.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> All three of these issues are related, but I'll
> >>>> address
> >>>>>>> them
> >>>>>>>>>> in
> >>>>>>>>>>>>>>>> order.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Deployment
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Samza strongly depends on the use of a dynamic
> >>>> deployment
> >>>>>>>>>>>>>>>>>>>> scheduler
> >>>>>>>>>>>>>>>>> such
> >>>>>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>> YARN, Mesos, etc. When we initially built Samza, we
> >>>> bet
> >>>>>>> that
> >>>>>>>>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>> would
> >>>>>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>> one or two winners in this area, and we could
> >>> support
> >>>>>>> them,
> >>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> rest
> >>>>>>>>>>>>>>>>>>>> would go away. In reality, there are many
> >>> variations.
> >>>>>>>>>>>>>>>>>>>> Furthermore,
> >>>>>>>>>>>>>>>>> many
> >>>>>>>>>>>>>>>>>>>> people still prefer to just start their processors
> >>>> like
> >>>>>>>> normal
> >>>>>>>>>>>>>>>>>>>> Java processes, and use traditional deployment
> >>> scripts
> >>>>>>> such
> >>>>>>>> as
> >>>>>>>>>>>>>>>>>>>> Fabric,
> >>>>>>>>>>>>>>>>> Chef,
> >>>>>>>>>>>>>>>>>>>> Ansible, etc. Forcing a deployment system on users
> >>>> makes
> >>>>>>> the
> >>>>>>>>>>>>>>>>>>>> Samza start-up process really painful for first time
> >>>>>>> users.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Dynamic deployment as a requirement was also a bit
> >>> of
> >>>> a
> >>>>>>>>>>>>>>>>>>>> mis-fire
> >>>>>>>>>>>>>>>>> because
> >>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>> a fundamental misunderstanding between the nature of
> >>>> batch
> >>>>>>>>>> jobs
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>> stream
> >>>>>>>>>>>>>>>>>>>> processing jobs. Early on, we made conscious effort
> >>> to
> >>>>>>> favor
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> Hadoop
> >>>>>>>>>>>>>>>>>>>> (Map/Reduce) way of doing things, since it worked
> >>> and
> >>>> was
> >>>>>>>> well
> >>>>>>>>>>>>>>>>>> understood.
> >>>>>>>>>>>>>>>>>>>> One thing that we missed was that batch jobs have a
> >>>>>>> definite
> >>>>>>>>>>>>>>>>> beginning,
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>> end, and stream processing jobs don't (usually).
> >>> This
> >>>>>>> leads
> >>>>>>>> to
> >>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>> much
> >>>>>>>>>>>>>>>>>>>> simpler scheduling problem for stream processors.
> >>> You
> >>>>>>>>>> basically
> >>>>>>>>>>>>>>>>>>>> just
> >>>>>>>>>>>>>>>>>> need
> >>>>>>>>>>>>>>>>>>>> to find a place to start the processor, and start
> >>> it.
> >>>> The
> >>>>>>>> way
> >>>>>>>>>>>>>>>>>>>> we run grids, at LinkedIn, there's no concept of a
> >>>> cluster
> >>>>>>>>>>>>>>>>>>>> being "full". We always
> >>>>>>>>>>>>>>>>> add
> >>>>>>>>>>>>>>>>>>>> more machines. The problem with coupling Samza with
> >>> a
> >>>>>>>>>> scheduler
> >>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>> Samza (as a framework) now has to handle deployment.
> >>>> This
> >>>>>>>>>> pulls
> >>>>>>>>>>>>>>>>>>>> in a
> >>>>>>>>>>>>>>>>>> bunch
> >>>>>>>>>>>>>>>>>>>> of things such as configuration distribution (config
> >>>>>>>> stream),
> >>>>>>>>>>>>>>>>>>>> shell
> >>>>>>>>>>>>>>>>>> scrips
> >>>>>>>>>>>>>>>>>>>> (bin/run-job.sh, JobRunner), packaging (all the .tgz
> >>>>>>> stuff),
> >>>>>>>>>> etc.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Another reason for requiring dynamic deployment was
> >>> to
> >>>>>>>> support
> >>>>>>>>>>>>>>>>>>>> data locality. If you want to have locality, you
> >>> need
> >>>> to
> >>>>>>> put
> >>>>>>>>>>>>>>>>>>>> your
> >>>>>>>>>>>>>>>>> processors
> >>>>>>>>>>>>>>>>>>>> close to the data they're processing. Upon further
> >>>>>>>>>>>>>>>>>>>> investigation,
> >>>>>>>>>>>>>>>>>> though,
> >>>>>>>>>>>>>>>>>>>> this feature is not that beneficial. There is some
> >>>> good
> >>>>>>>>>>>>>>>>>>>> discussion
> >>>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>> some problems with it on SAMZA-335. Again, we took
> >>> the
> >>>>>>>>>>>>>>>>>>>> Map/Reduce
> >>>>>>>>>>>>>>>>> path,
> >>>>>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>> there are some fundamental differences between HDFS
> >>>> and
> >>>>>>>> Kafka.
> >>>>>>>>>>>>>>>>>>>> HDFS
> >>>>>>>>>>>>>>>>> has
> >>>>>>>>>>>>>>>>>>>> blocks, while Kafka has partitions. This leads to
> >>> less
> >>>>>>>>>>>>>>>>>>>> optimization potential with stream processors on top
> >>>> of
> >>>>>>>> Kafka.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> This feature is also used as a crutch. Samza doesn't
> >>>> have
> >>>>>>>> any
> >>>>>>>>>>>>>>>>>>>> built
> >>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>> fault-tolerance logic. Instead, it depends on the
> >>>> dynamic
> >>>>>>>>>>>>>>>>>>>> deployment scheduling system to handle restarts
> >>> when a
> >>>>>>>>>>>>>>>>>>>> processor dies. This has
> >>>>>>>>>>>>>>>>>> made
> >>>>>>>>>>>>>>>>>>>> it very difficult to write a standalone Samza
> >>>> container
> >>>>>>>>>>>>>>> (SAMZA-516).
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Pluggability
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> In some cases pluggability is good, but I think that
> >>>> we've
> >>>>>>>>>> gone
> >>>>>>>>>>>>>>>>>>>> too
> >>>>>>>>>>>>>>>>> far
> >>>>>>>>>>>>>>>>>>>> with it. Currently, Samza has:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> * Pluggable config.
> >>>>>>>>>>>>>>>>>>>> * Pluggable metrics.
> >>>>>>>>>>>>>>>>>>>> * Pluggable deployment systems.
> >>>>>>>>>>>>>>>>>>>> * Pluggable streaming systems (SystemConsumer,
> >>>>>>>> SystemProducer,
> >>>>>>>>>>>>>>> etc).
> >>>>>>>>>>>>>>>>>>>> * Pluggable serdes.
> >>>>>>>>>>>>>>>>>>>> * Pluggable storage engines.
> >>>>>>>>>>>>>>>>>>>> * Pluggable strategies for just about every
> >>> component
> >>>>>>>>>>>>>>>> (MessageChooser,
> >>>>>>>>>>>>>>>>>>>> SystemStreamPartitionGrouper, ConfigRewriter, etc).
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> There's probably more that I've forgotten, as well.
> >>>> Some
> >>>>>>> of
> >>>>>>>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>> useful, but some have proven not to be. This all
> >>>> comes at
> >>>>>>> a
> >>>>>>>>>> cost:
> >>>>>>>>>>>>>>>>>>>> complexity. This complexity is making it harder for
> >>>> our
> >>>>>>>> users
> >>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> pick
> >>>>>>>>>>>>>>>>> up
> >>>>>>>>>>>>>>>>>>>> and use Samza out of the box. It also makes it
> >>>> difficult
> >>>>>>> for
> >>>>>>>>>>>>>>>>>>>> Samza developers to reason about what the
> >>>> characteristics
> >>>>>>> of
> >>>>>>>>>>>>>>>>>>>> the container (since the characteristics change
> >>>> depending
> >>>>>>> on
> >>>>>>>>>>>>>>>>>>>> which plugins are use).
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> The issues with pluggability are most visible in the
> >>>>>>> System
> >>>>>>>>>> APIs.
> >>>>>>>>>>>>>>>> What
> >>>>>>>>>>>>>>>>>>>> Samza really requires to be functional is Kafka as
> >>> its
> >>>>>>>>>>>>>>>>>>>> transport
> >>>>>>>>>>>>>>>>> layer.
> >>>>>>>>>>>>>>>>>>>> But
> >>>>>>>>>>>>>>>>>>>> we've conflated two unrelated use cases into one
> >>> API:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 1. Get data into/out of Kafka.
> >>>>>>>>>>>>>>>>>>>> 2. Process the data in Kafka.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> The current System API supports both of these use
> >>>> cases.
> >>>>>>> The
> >>>>>>>>>>>>>>>>>>>> problem
> >>>>>>>>>>>>>>>>> is,
> >>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>> actually want different features for each use case.
> >>> By
> >>>>>>>>>> papering
> >>>>>>>>>>>>>>>>>>>> over
> >>>>>>>>>>>>>>>>>> these
> >>>>>>>>>>>>>>>>>>>> two use cases, and providing a single API, we've
> >>>>>>> introduced
> >>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> ton of
> >>>>>>>>>>>>>>>>>> leaky
> >>>>>>>>>>>>>>>>>>>> abstractions.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> For example, what we'd really like in (2) is to have
> >>>>>>>>>>>>>>>>>>>> monotonically increasing longs for offsets (like
> >>>> Kafka).
> >>>>>>>> This
> >>>>>>>>>>>>>>>>>>>> would be at odds
> >>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>> (1),
> >>>>>>>>>>>>>>>>>>>> though, since different systems have different
> >>>>>>>>>>>>>>>>>> SCNs/Offsets/UUIDs/vectors.
> >>>>>>>>>>>>>>>>>>>> There was discussion both on the mailing list and
> >>> the
> >>>> SQL
> >>>>>>>>>> JIRAs
> >>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> need for this.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> The same thing holds true for replayability. Kafka
> >>>> allows
> >>>>>>> us
> >>>>>>>>>> to
> >>>>>>>>>>>>>>>> rewind
> >>>>>>>>>>>>>>>>>>>> when
> >>>>>>>>>>>>>>>>>>>> we have a failure. Many other systems don't. In some
> >>>>>>> cases,
> >>>>>>>>>>>>>>>>>>>> systems
> >>>>>>>>>>>>>>>>>> return
> >>>>>>>>>>>>>>>>>>>> null for their offsets (e.g.
> >>> WikipediaSystemConsumer)
> >>>>>>>> because
> >>>>>>>>>>>>>>>>>>>> they
> >>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>> no
> >>>>>>>>>>>>>>>>>>>> offsets.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Partitioning is another example. Kafka supports
> >>>>>>>> partitioning,
> >>>>>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>> many
> >>>>>>>>>>>>>>>>>>>> systems don't. We model this by having a single
> >>>> partition
> >>>>>>>> for
> >>>>>>>>>>>>>>>>>>>> those systems. Still, other systems model
> >>> partitioning
> >>>>>>>>>>>>>>> differently (e.g.
> >>>>>>>>>>>>>>>>>>>> Kinesis).
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> The SystemAdmin interface is also a mess. Creating
> >>>> streams
> >>>>>>>> in
> >>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> system-agnostic way is almost impossible. As is
> >>>> modeling
> >>>>>>>>>>>>>>>>>>>> metadata
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> system (replication factor, partitions, location,
> >>>> etc).
> >>>>>>> The
> >>>>>>>>>>>>>>>>>>>> list
> >>>>>>>>>>>>>>>> goes
> >>>>>>>>>>>>>>>>>> on.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Duplicate work
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> At the time that we began writing Samza, Kafka's
> >>>> consumer
> >>>>>>>> and
> >>>>>>>>>>>>>>>> producer
> >>>>>>>>>>>>>>>>>>>> APIs
> >>>>>>>>>>>>>>>>>>>> had a relatively weak feature set. On the
> >>>> consumer-side,
> >>>>>>> you
> >>>>>>>>>>>>>>>>>>>> had two
> >>>>>>>>>>>>>>>>>>>> options: use the high level consumer, or the simple
> >>>>>>>> consumer.
> >>>>>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>> problem
> >>>>>>>>>>>>>>>>>>>> with the high-level consumer was that it controlled
> >>>> your
> >>>>>>>>>>>>>>>>>>>> offsets, partition assignments, and the order in
> >>>> which you
> >>>>>>>>>>>>>>>>>>>> received messages. The
> >>>>>>>>>>>>>>>> problem
> >>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>> the simple consumer is that it's not simple. It's
> >>>> basic.
> >>>>>>> You
> >>>>>>>>>>>>>>>>>>>> end up
> >>>>>>>>>>>>>>>>>> having
> >>>>>>>>>>>>>>>>>>>> to handle a lot of really low-level stuff that you
> >>>>>>>> shouldn't.
> >>>>>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>> spent a
> >>>>>>>>>>>>>>>>>>>> lot of time to make Samza's KafkaSystemConsumer very
> >>>>>>> robust.
> >>>>>>>>>> It
> >>>>>>>>>>>>>>>>>>>> also allows us to support some cool features:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> * Per-partition message ordering and prioritization.
> >>>>>>>>>>>>>>>>>>>> * Tight control over partition assignment to support
> >>>>>>> joins,
> >>>>>>>>>>>>>>>>>>>> global
> >>>>>>>>>>>>>>>>> state
> >>>>>>>>>>>>>>>>>>>> (if we want to implement it :)), etc.
> >>>>>>>>>>>>>>>>>>>> * Tight control over offset checkpointing.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> What we didn't realize at the time is that these
> >>>> features
> >>>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>>>> actually
> >>>>>>>>>>>>>>>>>>>> be in Kafka. A lot of Kafka consumers (not just
> >>> Samza
> >>>>>>> stream
> >>>>>>>>>>>>>>>>> processors)
> >>>>>>>>>>>>>>>>>>>> end up wanting to do things like joins and partition
> >>>>>>>>>>>>>>>>>>>> assignment. The
> >>>>>>>>>>>>>>>>>> Kafka
> >>>>>>>>>>>>>>>>>>>> community has come to the same conclusion. They're
> >>>> adding
> >>>>>>> a
> >>>>>>>>>> ton
> >>>>>>>>>>>>>>>>>>>> of upgrades into their new Kafka consumer
> >>>> implementation.
> >>>>>>>> To a
> >>>>>>>>>>>>>>>>>>>> large extent,
> >>>>>>>>>>>>>>>> it's
> >>>>>>>>>>>>>>>>>>>> duplicate work to what we've already done in Samza.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On top of this, Kafka ended up taking a very similar
> >>>>>>>> approach
> >>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> Samza's
> >>>>>>>>>>>>>>>>>>>> KafkaCheckpointManager implementation for handling
> >>>> offset
> >>>>>>>>>>>>>>>>> checkpointing.
> >>>>>>>>>>>>>>>>>>>> Like Samza, Kafka's new offset management feature
> >>>> stores
> >>>>>>>>>> offset
> >>>>>>>>>>>>>>>>>>>> checkpoints in a topic, and allows you to fetch them
> >>>> from
> >>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> broker.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> A lot of this seems like a waste, since we could
> >>> have
> >>>>>>> shared
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> work
> >>>>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>> had been done in Kafka from the get-go.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Vision
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> All of this leads me to a rather radical proposal.
> >>>> Samza
> >>>>>>> is
> >>>>>>>>>>>>>>>> relatively
> >>>>>>>>>>>>>>>>>>>> stable at this point. I'd venture to say that we're
> >>>> near a
> >>>>>>>> 1.0
> >>>>>>>>>>>>>>>>> release.
> >>>>>>>>>>>>>>>>>>>> I'd
> >>>>>>>>>>>>>>>>>>>> like to propose that we take what we've learned, and
> >>>> begin
> >>>>>>>>>>>>>>>>>>>> thinking
> >>>>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>> Samza beyond 1.0. What would we change if we were
> >>>> starting
> >>>>>>>>>> from
> >>>>>>>>>>>>>>>>> scratch?
> >>>>>>>>>>>>>>>>>>>> My
> >>>>>>>>>>>>>>>>>>>> proposal is to:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 1. Make Samza standalone the *only* way to run Samza
> >>>>>>>>>>>>>>>>>>>> processors, and eliminate all direct dependences on
> >>>> YARN,
> >>>>>>>>>> Mesos,
> >>>>>>>>>>>>>>> etc.
> >>>>>>>>>>>>>>>>>>>> 2. Make a definitive call to support only Kafka as
> >>> the
> >>>>>>>> stream
> >>>>>>>>>>>>>>>>> processing
> >>>>>>>>>>>>>>>>>>>> layer.
> >>>>>>>>>>>>>>>>>>>> 3. Eliminate Samza's metrics, logging,
> >>> serialization,
> >>>> and
> >>>>>>>>>>>>>>>>>>>> config
> >>>>>>>>>>>>>>>>>> systems,
> >>>>>>>>>>>>>>>>>>>> and simply use Kafka's instead.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> This would fix all of the issues that I outlined
> >>>> above. It
> >>>>>>>>>>>>>>>>>>>> should
> >>>>>>>>>>>>>>>> also
> >>>>>>>>>>>>>>>>>>>> shrink the Samza code base pretty dramatically.
> >>>> Supporting
> >>>>>>>>>> only
> >>>>>>>>>>>>>>>>>>>> a standalone container will allow Samza to be
> >>>> executed on
> >>>>>>>> YARN
> >>>>>>>>>>>>>>>>>>>> (using Slider), Mesos (using Marathon/Aurora), or
> >>> most
> >>>>>>> other
> >>>>>>>>>>>>>>>>>>>> in-house
> >>>>>>>>>>>>>>>>>> deployment
> >>>>>>>>>>>>>>>>>>>> systems. This should make life a lot easier for new
> >>>> users.
> >>>>>>>>>>>>>>>>>>>> Imagine
> >>>>>>>>>>>>>>>>>> having
> >>>>>>>>>>>>>>>>>>>> the hello-samza tutorial without YARN. The drop in
> >>>> mailing
> >>>>>>>>>> list
> >>>>>>>>>>>>>>>>> traffic
> >>>>>>>>>>>>>>>>>>>> will be pretty dramatic.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Coupling with Kafka seems long overdue to me. The
> >>>> reality
> >>>>>>>> is,
> >>>>>>>>>>>>>>>> everyone
> >>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>> I'm aware of is using Samza with Kafka. We basically
> >>>>>>> require
> >>>>>>>>>> it
> >>>>>>>>>>>>>>>>> already
> >>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>> order for most features to work. Those that are
> >>> using
> >>>>>>> other
> >>>>>>>>>>>>>>>>>>>> systems
> >>>>>>>>>>>>>>>>> are
> >>>>>>>>>>>>>>>>>>>> generally using it for ingest into Kafka (1), and
> >>> then
> >>>>>>> they
> >>>>>>>> do
> >>>>>>>>>>>>>>>>>>>> the processing on top. There is already discussion (
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851
> >>>>>>>>>>>>>>>> 767
> >>>>>>>>>>>>>>>>>>>> )
> >>>>>>>>>>>>>>>>>>>> in Kafka to make ingesting into Kafka extremely
> >>> easy.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Once we make the call to couple with Kafka, we can
> >>>>>>> leverage
> >>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> ton of
> >>>>>>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>>>>>>> ecosystem. We no longer have to maintain our own
> >>>> config,
> >>>>>>>>>>>>>>>>>>>> metrics,
> >>>>>>>>>>>>>>>> etc.
> >>>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>>>> can all share the same libraries, and make them
> >>>> better.
> >>>>>>> This
> >>>>>>>>>>>>>>>>>>>> will
> >>>>>>>>>>>>>>>> also
> >>>>>>>>>>>>>>>>>>>> allow us to share the consumer/producer APIs, and
> >>>> will let
> >>>>>>>> us
> >>>>>>>>>>>>>>>> leverage
> >>>>>>>>>>>>>>>>>>>> their offset management and partition management,
> >>>> rather
> >>>>>>>> than
> >>>>>>>>>>>>>>>>>>>> having
> >>>>>>>>>>>>>>>>> our
> >>>>>>>>>>>>>>>>>>>> own. All of the coordinator stream code would go
> >>>> away, as
> >>>>>>>>>> would
> >>>>>>>>>>>>>>>>>>>> most
> >>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> YARN AppMaster code. We'd probably have to push some
> >>>>>>>> partition
> >>>>>>>>>>>>>>>>>> management
> >>>>>>>>>>>>>>>>>>>> features into the Kafka broker, but they're already
> >>>> moving
> >>>>>>>> in
> >>>>>>>>>>>>>>>>>>>> that direction with the new consumer API. The
> >>>> features we
> >>>>>>>> have
> >>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>> partition
> >>>>>>>>>>>>>>>>>>>> assignment aren't unique to Samza, and seem like
> >>> they
> >>>>>>> should
> >>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> Kafka
> >>>>>>>>>>>>>>>>>>>> anyway. There will always be some niche usages which
> >>>> will
> >>>>>>>>>>>>>>>>>>>> require
> >>>>>>>>>>>>>>>>> extra
> >>>>>>>>>>>>>>>>>>>> care and hence full control over partition
> >>> assignments
> >>>>>>> much
> >>>>>>>>>>>>>>>>>>>> like the
> >>>>>>>>>>>>>>>>>> Kafka
> >>>>>>>>>>>>>>>>>>>> low level consumer api. These would continue to be
> >>>>>>>> supported.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> These items will be good for the Samza community.
> >>>> They'll
> >>>>>>>> make
> >>>>>>>>>>>>>>>>>>>> Samza easier to use, and make it easier for
> >>>> developers to
> >>>>>>>> add
> >>>>>>>>>>>>>>>>>>>> new features.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Obviously this is a fairly large (and somewhat
> >>>> backwards
> >>>>>>>>>>>>>>>> incompatible
> >>>>>>>>>>>>>>>>>>>> change). If we choose to go this route, it's
> >>> important
> >>>>>>> that
> >>>>>>>> we
> >>>>>>>>>>>>>>>> openly
> >>>>>>>>>>>>>>>>>>>> communicate how we're going to provide a migration
> >>>> path
> >>>>>>> from
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> existing
> >>>>>>>>>>>>>>>>>>>> APIs to the new ones (if we make incompatible
> >>>> changes). I
> >>>>>>>>>> think
> >>>>>>>>>>>>>>>>>>>> at a minimum, we'd probably need to provide a
> >>> wrapper
> >>>> to
> >>>>>>>> allow
> >>>>>>>>>>>>>>>>>>>> existing StreamTask implementations to continue
> >>>> running on
> >>>>>>>> the
> >>>>>>>>>>>>>>> new container.
> >>>>>>>>>>>>>>>>>> It's
> >>>>>>>>>>>>>>>>>>>> also important that we openly communicate about
> >>>> timing,
> >>>>>>> and
> >>>>>>>>>>>>>>>>>>>> stages
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> migration.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> If you made it this far, I'm sure you have opinions.
> >>>> :)
> >>>>>>>> Please
> >>>>>>>>>>>>>>>>>>>> send
> >>>>>>>>>>>>>>>>> your
> >>>>>>>>>>>>>>>>>>>> thoughts and feedback.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>>>> Chris
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Jordan Shaw
> >> Full Stack Software Engineer
> >> PubNub Inc
> >> 1045 17th St
> >> San Francisco, CA 94107
>
>

Re: Thoughts and obesrvations on Samza

Reply via email to