Julian, does this mean you AREN'T in favor of our new Kafka-based email reader? How else will Kafka users read email :-)
Seriously, could not agree more. -Jay On Fri, Jul 10, 2015 at 2:03 PM, Julian Hyde <jh...@apache.org> wrote: > I broadly support it, with one big proviso. > > One of the attractive things about Kafka has been its minimalism -- > the fact that it solves one part of the problem, simply, and very > well. It is very important that it continues to do that, and that > people continue to perceive it that way. Make Kafka into a stack, if > you must, but make sure that the architectural layers remain clear. > > Making perfectly good software components into a muddled stack is one > of the pitfalls of success, and in particular a pitfall of being a > software company with a sales team looking for a bigger pieces to sell > and customers looking for "simpler" large components to install. Avoid > at all costs! > > > On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps <j...@confluent.io> wrote: > > Yeah I agree with this summary. I think there are kind of two questions > > here: > > 1. Technically does alignment/reliance on Kafka make sense > > 2. Branding wise (naming, website, concepts, etc) does alignment with > Kafka > > make sense > > > > Personally I do think both of these things would be really valuable, and > > would dramatically alter the trajectory of the project. > > > > My preference would be to see if people can mostly agree on a direction > > rather than splintering things off. From my point of view the ideal > outcome > > of all the options discussed would be to make Samza a closely aligned > > subproject, maintained in a separate repository and retaining the > existing > > committership but sharing as much else as possible (website, etc). No > idea > > about how these things work, Jacob, you probably know more. > > > > No discussion amongst the Kafka folks has happened on this, but likely we > > should figure out what the Samza community actually wants first. > > > > I admit that this is a fairly radical departure from how things are. > > > > If that doesn't fly, I think, yeah we could leave Samza as it is and do > the > > more radical reboot inside Kafka. From my point of view that does leave > > things in a somewhat confusing state since now there are two stream > > processing systems more or less coupled to Kafka in large part made by > the > > same people. But, arguably that might be a cleaner way to make the > cut-over > > and perhaps less risky for Samza community since if it works people can > > switch and if it doesn't nothing will have changed. Dunno, how do people > > feel about this? > > > > -Jay > > > > On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan <jgho...@gmail.com> wrote: > > > >> > This leads me to thinking that merging projects and communities might > >> be a good idea: with the union of experience from both communities, we > will > >> probably build a better system that is better for users. > >> Is this what's being proposed though? Merging the projects seems like > >> a consequence of at most one of the three directions under discussion: > >> 1) Samza 2.0: The Samza community relies more heavily on Kafka for > >> configuration, etc. (to a greater or lesser extent to be determined) > >> but the Samza community would not automatically merge withe Kafka > >> community (the Phoenix/HBase example is a good one here). > >> 2) Samza Reboot: The Samza community continues to exist with a limited > >> project scope, but similarly would not need to be part of the Kafka > >> community (ie given committership) to progress. Here, maybe the Samza > >> team would become a subproject of Kafka (the Board frowns on > >> subprojects at the moment, so I'm not sure if that's even feasible), > >> but that would not be required. > >> 3) Hey Samza! FYI, Kafka does streaming now: In this option the Kafka > >> team builds its own streaming library, possibly off of Jay's > >> prototype, which has not direct lineage to the Samza team. There's no > >> reason for the Kafka team to bring in the Samza team. > >> > >> Is the Kafka community on board with this? > >> > >> To be clear, all three options under discussion are interesting, > >> technically valid and likely healthy directions for the project. > >> Also, they are not mutually exclusive. The Samza community could > >> decide to pursue, say, 'Samza 2.0', while the Kafka community went > >> forward with 'Hey Samza!' My points above are directed entirely at > >> the community aspect of these choices. > >> -Jakob > >> > >> On 10 July 2015 at 09:10, Roger Hoover <roger.hoo...@gmail.com> wrote: > >> > That's great. Thanks, Jay. > >> > > >> > On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps <j...@confluent.io> wrote: > >> > > >> >> Yeah totally agree. I think you have this issue even today, right? > I.e. > >> if > >> >> you need to make a simple config change and you're running in YARN > today > >> >> you end up bouncing the job which then rebuilds state. I think the > fix > >> is > >> >> exactly what you described which is to have a long timeout on > partition > >> >> movement for stateful jobs so that if a job is just getting bounced, > and > >> >> the cluster manager (or admin) is smart enough to restart it on the > same > >> >> host when possible, it can optimistically reuse any existing state it > >> finds > >> >> on disk (if it is valid). > >> >> > >> >> So in this model the charter of the CM is to place processes as > >> stickily as > >> >> possible and to restart or re-place failed processes. The charter of > the > >> >> partition management system is to control the assignment of work to > >> these > >> >> processes. The nice thing about this is that the work assignment, > >> timeouts, > >> >> behavior, configs, and code will all be the same across all cluster > >> >> managers. > >> >> > >> >> So I think that prototype would actually give you exactly what you > want > >> >> today for any cluster manager (or manual placement + restart script) > >> that > >> >> was sticky in terms of host placement since there is already a > >> configurable > >> >> partition movement timeout and task-by-task state reuse with a check > on > >> >> state validity. > >> >> > >> >> -Jay > >> >> > >> >> On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover < > roger.hoo...@gmail.com> > >> >> wrote: > >> >> > >> >> > That would be great to let Kafka do as much heavy lifting as > possible > >> and > >> >> > make it easier for other languages to implement Samza apis. > >> >> > > >> >> > One thing to watch out for is the interplay between Kafka's group > >> >> > management and the external scheduler/process manager's fault > >> tolerance. > >> >> > If a container dies, the Kafka group membership protocol will try > to > >> >> assign > >> >> > it's tasks to other containers while at the same time the process > >> manager > >> >> > is trying to relaunch the container. Without some consideration > for > >> this > >> >> > (like a configurable amount of time to wait before Kafka alters the > >> group > >> >> > membership), there may be thrashing going on which is especially > bad > >> for > >> >> > containers with large amounts of local state. > >> >> > > >> >> > Someone else pointed this out already but I thought it might be > worth > >> >> > calling out again. > >> >> > > >> >> > Cheers, > >> >> > > >> >> > Roger > >> >> > > >> >> > > >> >> > On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps <j...@confluent.io> > wrote: > >> >> > > >> >> > > Hey Roger, > >> >> > > > >> >> > > I couldn't agree more. We spent a bunch of time talking to people > >> and > >> >> > that > >> >> > > is exactly the stuff we heard time and again. What makes it > hard, of > >> >> > > course, is that there is some tension between compatibility with > >> what's > >> >> > > there now and making things better for new users. > >> >> > > > >> >> > > I also strongly agree with the importance of multi-language > >> support. We > >> >> > are > >> >> > > talking now about Java, but for application development use cases > >> >> people > >> >> > > want to work in whatever language they are using elsewhere. I > think > >> >> > moving > >> >> > > to a model where Kafka itself does the group membership, > lifecycle > >> >> > control, > >> >> > > and partition assignment has the advantage of putting all that > >> complex > >> >> > > stuff behind a clean api that the clients are already going to be > >> >> > > implementing for their consumer, so the added functionality for > >> stream > >> >> > > processing beyond a consumer becomes very minor. > >> >> > > > >> >> > > -Jay > >> >> > > > >> >> > > On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover < > >> roger.hoo...@gmail.com> > >> >> > > wrote: > >> >> > > > >> >> > > > Metamorphosis...nice. :) > >> >> > > > > >> >> > > > This has been a great discussion. As a user of Samza who's > >> recently > >> >> > > > integrated it into a relatively large organization, I just > want to > >> >> add > >> >> > > > support to a few points already made. > >> >> > > > > >> >> > > > The biggest hurdles to adoption of Samza as it currently exists > >> that > >> >> > I've > >> >> > > > experienced are: > >> >> > > > 1) YARN - YARN is overly complex in many environments where > Puppet > >> >> > would > >> >> > > do > >> >> > > > just fine but it was the only mechanism to get fault tolerance. > >> >> > > > 2) Configuration - I think I like the idea of configuring most > of > >> the > >> >> > job > >> >> > > > in code rather than config files. In general, I think the goal > >> >> should > >> >> > be > >> >> > > > to make it harder to make mistakes, especially of the kind > where > >> the > >> >> > code > >> >> > > > expects something and the config doesn't match. The current > >> config > >> >> is > >> >> > > > quite intricate and error-prone. For example, the application > >> logic > >> >> > may > >> >> > > > depend on bootstrapping a topic but rather than asserting that > in > >> the > >> >> > > code, > >> >> > > > you have to rely on getting the config right. Likewise with > >> serdes, > >> >> > the > >> >> > > > Java representations produced by various serdes (JSON, Avro, > etc.) > >> >> are > >> >> > > not > >> >> > > > equivalent so you cannot just reconfigure a serde without > changing > >> >> the > >> >> > > > code. It would be nice for jobs to be able to assert what > they > >> >> expect > >> >> > > > from their input topics in terms of partitioning. This is > >> getting a > >> >> > > little > >> >> > > > off topic but I was even thinking about creating a "Samza > config > >> >> > linter" > >> >> > > > that would sanity check a set of configs. Especially in > >> >> organizations > >> >> > > > where config is managed by a different team than the > application > >> >> > > developer, > >> >> > > > it's very hard to get avoid config mistakes. > >> >> > > > 3) Java/Scala centric - for many teams (especially DevOps-type > >> >> folks), > >> >> > > the > >> >> > > > pain of the Java toolchain (maven, slow builds, weak command > line > >> >> > > support, > >> >> > > > configuration over convention) really inhibits productivity. > As > >> more > >> >> > and > >> >> > > > more high-quality clients become available for Kafka, I hope > >> they'll > >> >> > > follow > >> >> > > > Samza's model. Not sure how much it affects the proposals in > this > >> >> > thread > >> >> > > > but please consider other languages in the ecosystem as well. > >> From > >> >> > what > >> >> > > > I've heard, Spark has more Python users than Java/Scala. > >> >> > > > (FYI, we added a Jython wrapper for the Samza API > >> >> > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> > https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza > >> >> > > > and are working on a Yeoman generator > >> >> > > > https://github.com/Quantiply/generator-rico for Jython/Samza > >> >> projects > >> >> > to > >> >> > > > alleviate some of the pain) > >> >> > > > > >> >> > > > I also want to underscore Jay's point about improving the user > >> >> > > experience. > >> >> > > > That's a very important factor for adoption. I think the goal > >> should > >> >> > be > >> >> > > to > >> >> > > > make Samza as easy to get started with as something like > Logstash. > >> >> > > > Logstash is vastly inferior in terms of capabilities to Samza > but > >> >> it's > >> >> > > easy > >> >> > > > to get started and that makes a big difference. > >> >> > > > > >> >> > > > Cheers, > >> >> > > > > >> >> > > > Roger > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De Francisci Morales > < > >> >> > > > g...@apache.org> wrote: > >> >> > > > > >> >> > > > > Forgot to add. On the naming issues, Kafka Metamorphosis is a > >> clear > >> >> > > > winner > >> >> > > > > :) > >> >> > > > > > >> >> > > > > -- > >> >> > > > > Gianmarco > >> >> > > > > > >> >> > > > > On 7 July 2015 at 13:26, Gianmarco De Francisci Morales < > >> >> > > g...@apache.org > >> >> > > > > > >> >> > > > > wrote: > >> >> > > > > > >> >> > > > > > Hi, > >> >> > > > > > > >> >> > > > > > @Martin, thanks for you comments. > >> >> > > > > > Maybe I'm missing some important point, but I think > coupling > >> the > >> >> > > > releases > >> >> > > > > > is actually a *good* thing. > >> >> > > > > > To make an example, would it be better if the MR and HDFS > >> >> > components > >> >> > > of > >> >> > > > > > Hadoop had different release schedules? > >> >> > > > > > > >> >> > > > > > Actually, keeping the discussion in a single place would > make > >> >> > > agreeing > >> >> > > > on > >> >> > > > > > releases (and backwards compatibility) much easier, as > >> everybody > >> >> > > would > >> >> > > > be > >> >> > > > > > responsible for the whole codebase. > >> >> > > > > > > >> >> > > > > > That said, I like the idea of absorbing samza-core as a > >> >> > sub-project, > >> >> > > > and > >> >> > > > > > leave the fancy stuff separate. > >> >> > > > > > It probably gives 90% of the benefits we have been > discussing > >> >> here. > >> >> > > > > > > >> >> > > > > > Cheers, > >> >> > > > > > > >> >> > > > > > -- > >> >> > > > > > Gianmarco > >> >> > > > > > > >> >> > > > > > On 7 July 2015 at 02:30, Jay Kreps <jay.kr...@gmail.com> > >> wrote: > >> >> > > > > > > >> >> > > > > >> Hey Martin, > >> >> > > > > >> > >> >> > > > > >> I agree coupling release schedules is a downside. > >> >> > > > > >> > >> >> > > > > >> Definitely we can try to solve some of the integration > >> problems > >> >> in > >> >> > > > > >> Confluent Platform or in other distributions. But I think > >> this > >> >> > ends > >> >> > > up > >> >> > > > > >> being really shallow. I guess I feel to really get a good > >> user > >> >> > > > > experience > >> >> > > > > >> the two systems have to kind of feel like part of the same > >> thing > >> >> > and > >> >> > > > you > >> >> > > > > >> can't really add that in later--you can put both in the > same > >> >> > > > > downloadable > >> >> > > > > >> tar file but it doesn't really give a very cohesive > feeling. > >> I > >> >> > agree > >> >> > > > > that > >> >> > > > > >> ultimately any of the project stuff is as much social and > >> naming > >> >> > as > >> >> > > > > >> anything else--theoretically two totally independent > projects > >> >> > could > >> >> > > > work > >> >> > > > > >> to > >> >> > > > > >> tightly align. In practice this seems to be quite > difficult > >> >> > though. > >> >> > > > > >> > >> >> > > > > >> For the frameworks--totally agree it would be good to > >> maintain > >> >> the > >> >> > > > > >> framework support with the project. In some cases there > may > >> not > >> >> be > >> >> > > too > >> >> > > > > >> much > >> >> > > > > >> there since the integration gets lighter but I think > whatever > >> >> > stubs > >> >> > > > you > >> >> > > > > >> need should be included. So no I definitely wasn't trying > to > >> >> imply > >> >> > > > > >> dropping > >> >> > > > > >> support for these frameworks, just making the integration > >> >> lighter > >> >> > by > >> >> > > > > >> separating process management from partition management. > >> >> > > > > >> > >> >> > > > > >> You raise two good points we would have to figure out if > we > >> went > >> >> > > down > >> >> > > > > the > >> >> > > > > >> alignment path: > >> >> > > > > >> 1. With respect to the name, yeah I think the first > question > >> is > >> >> > > > whether > >> >> > > > > >> some "re-branding" would be worth it. If so then I think > we > >> can > >> >> > > have a > >> >> > > > > big > >> >> > > > > >> thread on the name. I'm definitely not set on Kafka > >> Streaming or > >> >> > > Kafka > >> >> > > > > >> Streams I was just using them to be kind of illustrative. > I > >> >> agree > >> >> > > with > >> >> > > > > >> your > >> >> > > > > >> critique of these names, though I think people would get > the > >> >> idea. > >> >> > > > > >> 2. Yeah you also raise a good point about how to "factor" > it. > >> >> Here > >> >> > > are > >> >> > > > > the > >> >> > > > > >> options I see (I could get enthusiastic about any of > them): > >> >> > > > > >> a. One repo for both Kafka and Samza > >> >> > > > > >> b. Two repos, retaining the current seperation > >> >> > > > > >> c. Two repos, the equivalent of samza-api and > samza-core > >> is > >> >> > > > absorbed > >> >> > > > > >> almost like a third client > >> >> > > > > >> > >> >> > > > > >> Cheers, > >> >> > > > > >> > >> >> > > > > >> -Jay > >> >> > > > > >> > >> >> > > > > >> On Mon, Jul 6, 2015 at 1:18 PM, Martin Kleppmann < > >> >> > > > mar...@kleppmann.com> > >> >> > > > > >> wrote: > >> >> > > > > >> > >> >> > > > > >> > Ok, thanks for the clarifications. Just a few follow-up > >> >> > comments. > >> >> > > > > >> > > >> >> > > > > >> > - I see the appeal of merging with Kafka or becoming a > >> >> > subproject: > >> >> > > > the > >> >> > > > > >> > reasons you mention are good. The risk I see is that > >> release > >> >> > > > schedules > >> >> > > > > >> > become coupled to each other, which can slow everyone > down, > >> >> and > >> >> > > > large > >> >> > > > > >> > projects with many contributors are harder to manage. > >> (Jakob, > >> >> > can > >> >> > > > you > >> >> > > > > >> speak > >> >> > > > > >> > from experience, having seen a wider range of Hadoop > >> ecosystem > >> >> > > > > >> projects?) > >> >> > > > > >> > > >> >> > > > > >> > Some of the goals of a better unified developer > experience > >> >> could > >> >> > > > also > >> >> > > > > be > >> >> > > > > >> > solved by integrating Samza nicely into a Kafka > >> distribution > >> >> > (such > >> >> > > > as > >> >> > > > > >> > Confluent's). I'm not against merging projects if we > decide > >> >> > that's > >> >> > > > the > >> >> > > > > >> way > >> >> > > > > >> > to go, just pointing out the same goals can perhaps > also be > >> >> > > achieved > >> >> > > > > in > >> >> > > > > >> > other ways. > >> >> > > > > >> > > >> >> > > > > >> > - With regard to dropping the YARN dependency: are you > >> >> proposing > >> >> > > > that > >> >> > > > > >> > Samza doesn't give any help to people wanting to run on > >> >> > > > > >> YARN/Mesos/AWS/etc? > >> >> > > > > >> > So the docs would basically have a link to Slider and > >> nothing > >> >> > > else? > >> >> > > > Or > >> >> > > > > >> > would we maintain integrations with a bunch of popular > >> >> > deployment > >> >> > > > > >> methods > >> >> > > > > >> > (e.g. the necessary glue and shell scripts to make Samza > >> work > >> >> > with > >> >> > > > > >> Slider)? > >> >> > > > > >> > > >> >> > > > > >> > I absolutely think it's a good idea to have the "as a > >> library" > >> >> > and > >> >> > > > > "as a > >> >> > > > > >> > process" (using Yi's taxonomy) options for people who > want > >> >> them, > >> >> > > > but I > >> >> > > > > >> > think there should also be a low-friction path for > common > >> "as > >> >> a > >> >> > > > > service" > >> >> > > > > >> > deployment methods, for which we probably need to > maintain > >> >> > > > > integrations. > >> >> > > > > >> > > >> >> > > > > >> > - Project naming: "Kafka Streams" seems odd to me, > because > >> >> Kafka > >> >> > > is > >> >> > > > > all > >> >> > > > > >> > about streams already. Perhaps "Kafka Transformers" or > >> "Kafka > >> >> > > > Filters" > >> >> > > > > >> > would be more apt? > >> >> > > > > >> > > >> >> > > > > >> > One suggestion: perhaps the core of Samza (stream > >> >> transformation > >> >> > > > with > >> >> > > > > >> > state management -- i.e. the "Samza as a library" bit) > >> could > >> >> > > become > >> >> > > > > >> part of > >> >> > > > > >> > Kafka, while higher-level tools such as streaming SQL > and > >> >> > > > integrations > >> >> > > > > >> with > >> >> > > > > >> > deployment frameworks remain in a separate project? In > >> other > >> >> > > words, > >> >> > > > > >> Kafka > >> >> > > > > >> > would absorb the proven, stable core of Samza, which > would > >> >> > become > >> >> > > > the > >> >> > > > > >> > "third Kafka client" mentioned early in this thread. The > >> Samza > >> >> > > > project > >> >> > > > > >> > would then target that third Kafka client as its base > API, > >> and > >> >> > the > >> >> > > > > >> project > >> >> > > > > >> > would be freed up to explore more experimental new > >> horizons. > >> >> > > > > >> > > >> >> > > > > >> > Martin > >> >> > > > > >> > > >> >> > > > > >> > On 6 Jul 2015, at 18:51, Jay Kreps <jay.kr...@gmail.com > > > >> >> wrote: > >> >> > > > > >> > > >> >> > > > > >> > > Hey Martin, > >> >> > > > > >> > > > >> >> > > > > >> > > For the YARN/Mesos/etc decoupling I actually don't > think > >> it > >> >> > ties > >> >> > > > our > >> >> > > > > >> > hands > >> >> > > > > >> > > at all, all it does is refactor things. The division > of > >> >> > > > > >> responsibility is > >> >> > > > > >> > > that Samza core is responsible for task lifecycle, > state, > >> >> and > >> >> > > > > >> partition > >> >> > > > > >> > > management (using the Kafka co-ordinator) but it is > NOT > >> >> > > > responsible > >> >> > > > > >> for > >> >> > > > > >> > > packaging, configuration deployment or execution of > >> >> processes. > >> >> > > The > >> >> > > > > >> > problem > >> >> > > > > >> > > of packaging and starting these processes is > >> >> > > > > >> > > framework/environment-specific. This leaves individual > >> >> > > frameworks > >> >> > > > to > >> >> > > > > >> be > >> >> > > > > >> > as > >> >> > > > > >> > > fancy or vanilla as they like. So you can get simple > >> >> stateless > >> >> > > > > >> support in > >> >> > > > > >> > > YARN, Mesos, etc using their off-the-shelf app > framework > >> >> > > (Slider, > >> >> > > > > >> > Marathon, > >> >> > > > > >> > > etc). These are well known by people and have nice UIs > >> and a > >> >> > lot > >> >> > > > of > >> >> > > > > >> > > flexibility. I don't think they have node affinity as > a > >> >> built > >> >> > in > >> >> > > > > >> option > >> >> > > > > >> > > (though I could be wrong). So if we want that we can > >> either > >> >> > wait > >> >> > > > for > >> >> > > > > >> them > >> >> > > > > >> > > to add it or do a custom framework to add that feature > >> (as > >> >> > now). > >> >> > > > > >> > Obviously > >> >> > > > > >> > > if you manage things with old-school ops tools > >> >> > (puppet/chef/etc) > >> >> > > > you > >> >> > > > > >> get > >> >> > > > > >> > > locality easily. The nice thing, though, is that all > the > >> >> samza > >> >> > > > > >> "business > >> >> > > > > >> > > logic" around partition management and fault tolerance > >> is in > >> >> > > Samza > >> >> > > > > >> core > >> >> > > > > >> > so > >> >> > > > > >> > > it is shared across frameworks and the framework > specific > >> >> bit > >> >> > is > >> >> > > > > just > >> >> > > > > >> > > whether it is smart enough to try to get the same host > >> when > >> >> a > >> >> > > job > >> >> > > > is > >> >> > > > > >> > > restarted. > >> >> > > > > >> > > > >> >> > > > > >> > > With respect to the Kafka-alignment, yeah I think the > >> goal > >> >> > would > >> >> > > > be > >> >> > > > > >> (a) > >> >> > > > > >> > > actually get better alignment in user experience, and > (b) > >> >> > > express > >> >> > > > > >> this in > >> >> > > > > >> > > the naming and project branding. Specifically: > >> >> > > > > >> > > 1. Website/docs, it would be nice for the > >> "transformation" > >> >> api > >> >> > > to > >> >> > > > be > >> >> > > > > >> > > discoverable in the main Kafka docs--i.e. be able to > >> explain > >> >> > > when > >> >> > > > to > >> >> > > > > >> use > >> >> > > > > >> > > the consumer and when to use the stream processing > >> >> > functionality > >> >> > > > and > >> >> > > > > >> lead > >> >> > > > > >> > > people into that experience. > >> >> > > > > >> > > 2. Align releases so if you get Kafkza 1.4.2 (or > >> whatever) > >> >> > that > >> >> > > > has > >> >> > > > > >> both > >> >> > > > > >> > > Kafka and the stream processing part and they actually > >> work > >> >> > > > > together. > >> >> > > > > >> > > 3. Unify the programming experience so the client and > >> Samza > >> >> > api > >> >> > > > > share > >> >> > > > > >> > > config/monitoring/naming/packaging/etc. > >> >> > > > > >> > > > >> >> > > > > >> > > I think sub-projects keep separate committers and can > >> have a > >> >> > > > > separate > >> >> > > > > >> > repo, > >> >> > > > > >> > > but I'm actually not really sure (I can't find a > >> definition > >> >> > of a > >> >> > > > > >> > subproject > >> >> > > > > >> > > in Apache). > >> >> > > > > >> > > > >> >> > > > > >> > > Basically at a high-level you want the experience to > >> "feel" > >> >> > > like a > >> >> > > > > >> single > >> >> > > > > >> > > system, not to relatively independent things that are > >> kind > >> >> of > >> >> > > > > >> awkwardly > >> >> > > > > >> > > glued together. > >> >> > > > > >> > > > >> >> > > > > >> > > I think if we did that they having naming or branding > >> like > >> >> > > "kafka > >> >> > > > > >> > > streaming" or "kafka streams" or something like that > >> would > >> >> > > > actually > >> >> > > > > >> do a > >> >> > > > > >> > > good job of conveying what it is. I do that this would > >> help > >> >> > > > adoption > >> >> > > > > >> > quite > >> >> > > > > >> > > a lot as it would correctly convey that using Kafka > >> >> Streaming > >> >> > > with > >> >> > > > > >> Kafka > >> >> > > > > >> > is > >> >> > > > > >> > > a fairly seamless experience and Kafka is pretty > heavily > >> >> > adopted > >> >> > > > at > >> >> > > > > >> this > >> >> > > > > >> > > point. > >> >> > > > > >> > > > >> >> > > > > >> > > Fwiw we actually considered this model originally when > >> open > >> >> > > > sourcing > >> >> > > > > >> > Samza, > >> >> > > > > >> > > however at that time Kafka was relatively unknown and > we > >> >> > decided > >> >> > > > not > >> >> > > > > >> to > >> >> > > > > >> > do > >> >> > > > > >> > > it since we felt it would be limiting. From my point > of > >> view > >> >> > the > >> >> > > > > three > >> >> > > > > >> > > things have changed (1) Kafka is now really heavily > used > >> for > >> >> > > > stream > >> >> > > > > >> > > processing, (2) we learned that abstracting out the > >> stream > >> >> > well > >> >> > > is > >> >> > > > > >> > > basically impossible, (3) we learned it is really > hard to > >> >> keep > >> >> > > the > >> >> > > > > two > >> >> > > > > >> > > things feeling like a single product. > >> >> > > > > >> > > > >> >> > > > > >> > > -Jay > >> >> > > > > >> > > > >> >> > > > > >> > > > >> >> > > > > >> > > On Mon, Jul 6, 2015 at 3:37 AM, Martin Kleppmann < > >> >> > > > > >> mar...@kleppmann.com> > >> >> > > > > >> > > wrote: > >> >> > > > > >> > > > >> >> > > > > >> > >> Hi all, > >> >> > > > > >> > >> > >> >> > > > > >> > >> Lots of good thoughts here. > >> >> > > > > >> > >> > >> >> > > > > >> > >> I agree with the general philosophy of tying Samza > more > >> >> > firmly > >> >> > > to > >> >> > > > > >> Kafka. > >> >> > > > > >> > >> After I spent a while looking at integrating other > >> message > >> >> > > > brokers > >> >> > > > > >> (e.g. > >> >> > > > > >> > >> Kinesis) with SystemConsumer, I came to the > conclusion > >> that > >> >> > > > > >> > SystemConsumer > >> >> > > > > >> > >> tacitly assumes a model so much like Kafka's that > pretty > >> >> much > >> >> > > > > nobody > >> >> > > > > >> but > >> >> > > > > >> > >> Kafka actually implements it. (Databus is perhaps an > >> >> > exception, > >> >> > > > but > >> >> > > > > >> it > >> >> > > > > >> > >> isn't widely used outside of LinkedIn.) Thus, making > >> Samza > >> >> > > fully > >> >> > > > > >> > dependent > >> >> > > > > >> > >> on Kafka acknowledges that the system-independence > was > >> >> never > >> >> > as > >> >> > > > > real > >> >> > > > > >> as > >> >> > > > > >> > we > >> >> > > > > >> > >> perhaps made it out to be. The gains of code reuse > are > >> >> real. > >> >> > > > > >> > >> > >> >> > > > > >> > >> The idea of decoupling Samza from YARN has also > always > >> been > >> >> > > > > >> appealing to > >> >> > > > > >> > >> me, for various reasons already mentioned in this > >> thread. > >> >> > > > Although > >> >> > > > > >> > making > >> >> > > > > >> > >> Samza jobs deployable on anything > (YARN/Mesos/AWS/etc) > >> >> seems > >> >> > > > > >> laudable, > >> >> > > > > >> > I am > >> >> > > > > >> > >> a little concerned that it will restrict us to a > lowest > >> >> > common > >> >> > > > > >> > denominator. > >> >> > > > > >> > >> For example, would host affinity (SAMZA-617) still be > >> >> > possible? > >> >> > > > For > >> >> > > > > >> jobs > >> >> > > > > >> > >> with large amounts of state, I think SAMZA-617 would > be > >> a > >> >> big > >> >> > > > boon, > >> >> > > > > >> > since > >> >> > > > > >> > >> restoring state off the changelog on every single > >> restart > >> >> is > >> >> > > > > painful, > >> >> > > > > >> > due > >> >> > > > > >> > >> to long recovery times. It would be a shame if the > >> >> decoupling > >> >> > > > from > >> >> > > > > >> YARN > >> >> > > > > >> > >> made host affinity impossible. > >> >> > > > > >> > >> > >> >> > > > > >> > >> Jay, a question about the proposed API for > >> instantiating a > >> >> > job > >> >> > > in > >> >> > > > > >> code > >> >> > > > > >> > >> (rather than a properties file): when submitting a > job > >> to a > >> >> > > > > cluster, > >> >> > > > > >> is > >> >> > > > > >> > the > >> >> > > > > >> > >> idea that the instantiation code runs on a client > >> >> somewhere, > >> >> > > > which > >> >> > > > > >> then > >> >> > > > > >> > >> pokes the necessary endpoints on YARN/Mesos/AWS/etc? > Or > >> >> does > >> >> > > that > >> >> > > > > >> code > >> >> > > > > >> > run > >> >> > > > > >> > >> on each container that is part of the job (in which > >> case, > >> >> how > >> >> > > > does > >> >> > > > > >> the > >> >> > > > > >> > job > >> >> > > > > >> > >> submission to the cluster work)? > >> >> > > > > >> > >> > >> >> > > > > >> > >> I agree with Garry that it doesn't feel right to > make a > >> 1.0 > >> >> > > > release > >> >> > > > > >> > with a > >> >> > > > > >> > >> plan for it to be immediately obsolete. So if this is > >> going > >> >> > to > >> >> > > > > >> happen, I > >> >> > > > > >> > >> think it would be more honest to stick with 0.* > version > >> >> > numbers > >> >> > > > > until > >> >> > > > > >> > the > >> >> > > > > >> > >> library-ified Samza has been implemented, is stable > and > >> >> > widely > >> >> > > > > used. > >> >> > > > > >> > >> > >> >> > > > > >> > >> Should the new Samza be a subproject of Kafka? There > is > >> >> > > precedent > >> >> > > > > for > >> >> > > > > >> > >> tight coupling between different Apache projects > (e.g. > >> >> > Curator > >> >> > > > and > >> >> > > > > >> > >> Zookeeper, or Slider and YARN), so I think remaining > >> >> separate > >> >> > > > would > >> >> > > > > >> be > >> >> > > > > >> > ok. > >> >> > > > > >> > >> Even if Samza is fully dependent on Kafka, there is > >> enough > >> >> > > > > substance > >> >> > > > > >> in > >> >> > > > > >> > >> Samza that it warrants being a separate project. An > >> >> argument > >> >> > in > >> >> > > > > >> favour > >> >> > > > > >> > of > >> >> > > > > >> > >> merging would be if we think Kafka has a much > stronger > >> >> "brand > >> >> > > > > >> presence" > >> >> > > > > >> > >> than Samza; I'm ambivalent on that one. If the Kafka > >> >> project > >> >> > is > >> >> > > > > >> willing > >> >> > > > > >> > to > >> >> > > > > >> > >> endorse Samza as the "official" way of doing stateful > >> >> stream > >> >> > > > > >> > >> transformations, that would probably have much the > same > >> >> > effect > >> >> > > as > >> >> > > > > >> > >> re-branding Samza as "Kafka Stream Processors" or > >> suchlike. > >> >> > > Close > >> >> > > > > >> > >> collaboration between the two projects will be > needed in > >> >> any > >> >> > > > case. > >> >> > > > > >> > >> > >> >> > > > > >> > >> From a project management perspective, I guess the > "new > >> >> > Samza" > >> >> > > > > would > >> >> > > > > >> > have > >> >> > > > > >> > >> to be developed on a branch alongside ongoing > >> maintenance > >> >> of > >> >> > > the > >> >> > > > > >> current > >> >> > > > > >> > >> line of development? I think it would be important to > >> >> > continue > >> >> > > > > >> > supporting > >> >> > > > > >> > >> existing users, and provide a graceful migration > path to > >> >> the > >> >> > > new > >> >> > > > > >> > version. > >> >> > > > > >> > >> Leaving the current versions unsupported and forcing > >> people > >> >> > to > >> >> > > > > >> rewrite > >> >> > > > > >> > >> their jobs would send a bad signal. > >> >> > > > > >> > >> > >> >> > > > > >> > >> Best, > >> >> > > > > >> > >> Martin > >> >> > > > > >> > >> > >> >> > > > > >> > >> On 2 Jul 2015, at 16:59, Jay Kreps <j...@confluent.io > > > >> >> wrote: > >> >> > > > > >> > >> > >> >> > > > > >> > >>> Hey Garry, > >> >> > > > > >> > >>> > >> >> > > > > >> > >>> Yeah that's super frustrating. I'd be happy to chat > >> more > >> >> > about > >> >> > > > > this > >> >> > > > > >> if > >> >> > > > > >> > >>> you'd be interested. I think Chris and I started > with > >> the > >> >> > idea > >> >> > > > of > >> >> > > > > >> "what > >> >> > > > > >> > >>> would it take to make Samza a kick-ass ingestion > tool" > >> but > >> >> > > > > >> ultimately > >> >> > > > > >> > we > >> >> > > > > >> > >>> kind of came around to the idea that ingestion and > >> >> > > > transformation > >> >> > > > > >> had > >> >> > > > > >> > >>> pretty different needs and coupling the two made > things > >> >> > hard. > >> >> > > > > >> > >>> > >> >> > > > > >> > >>> For what it's worth I think copycat (KIP-26) > actually > >> will > >> >> > do > >> >> > > > what > >> >> > > > > >> you > >> >> > > > > >> > >> are > >> >> > > > > >> > >>> looking for. > >> >> > > > > >> > >>> > >> >> > > > > >> > >>> With regard to your point about slider, I don't > >> >> necessarily > >> >> > > > > >> disagree. > >> >> > > > > >> > >> But I > >> >> > > > > >> > >>> think getting good YARN support is quite doable and > I > >> >> think > >> >> > we > >> >> > > > can > >> >> > > > > >> make > >> >> > > > > >> > >>> that work well. I think the issue this proposal > solves > >> is > >> >> > that > >> >> > > > > >> > >> technically > >> >> > > > > >> > >>> it is pretty hard to support multiple cluster > >> management > >> >> > > systems > >> >> > > > > the > >> >> > > > > >> > way > >> >> > > > > >> > >>> things are now, you need to write an "app master" or > >> >> > > "framework" > >> >> > > > > for > >> >> > > > > >> > each > >> >> > > > > >> > >>> and they are all a little different so testing is > >> really > >> >> > hard. > >> >> > > > In > >> >> > > > > >> the > >> >> > > > > >> > >>> absence of this we have been stuck with just YARN > which > >> >> has > >> >> > > > > >> fantastic > >> >> > > > > >> > >>> penetration in the Hadoopy part of the org, but zero > >> >> > > penetration > >> >> > > > > >> > >> elsewhere. > >> >> > > > > >> > >>> Given the huge amount of work being put in to > slider, > >> >> > > marathon, > >> >> > > > > aws > >> >> > > > > >> > >>> tooling, not to mention the umpteen related > packaging > >> >> > > > technologies > >> >> > > > > >> > people > >> >> > > > > >> > >>> want to use (Docker, Kubernetes, various > cloud-specific > >> >> > deploy > >> >> > > > > >> tools, > >> >> > > > > >> > >> etc) > >> >> > > > > >> > >>> I really think it is important to get this right. > >> >> > > > > >> > >>> > >> >> > > > > >> > >>> -Jay > >> >> > > > > >> > >>> > >> >> > > > > >> > >>> On Thu, Jul 2, 2015 at 4:17 AM, Garry Turkington < > >> >> > > > > >> > >>> g.turking...@improvedigital.com> wrote: > >> >> > > > > >> > >>> > >> >> > > > > >> > >>>> Hi all, > >> >> > > > > >> > >>>> > >> >> > > > > >> > >>>> I think the question below re does Samza become a > >> >> > sub-project > >> >> > > > of > >> >> > > > > >> Kafka > >> >> > > > > >> > >>>> highlights the broader point around migration. > Chris > >> >> > mentions > >> >> > > > > >> Samza's > >> >> > > > > >> > >>>> maturity is heading towards a v1 release but I'm > not > >> sure > >> >> > it > >> >> > > > > feels > >> >> > > > > >> > >> right to > >> >> > > > > >> > >>>> launch a v1 then immediately plan to deprecate > most of > >> >> it. > >> >> > > > > >> > >>>> > >> >> > > > > >> > >>>> From a selfish perspective I have some guys who > have > >> >> > started > >> >> > > > > >> working > >> >> > > > > >> > >> with > >> >> > > > > >> > >>>> Samza and building some new consumers/producers was > >> next > >> >> > up. > >> >> > > > > Sounds > >> >> > > > > >> > like > >> >> > > > > >> > >>>> that is absolutely not the direction to go. I need > to > >> >> look > >> >> > > into > >> >> > > > > the > >> >> > > > > >> > KIP > >> >> > > > > >> > >> in > >> >> > > > > >> > >>>> more detail but for me the attractiveness of adding > >> new > >> >> > Samza > >> >> > > > > >> > >>>> consumer/producers -- even if yes all they were > doing > >> was > >> >> > > > really > >> >> > > > > >> > getting > >> >> > > > > >> > >>>> data into and out of Kafka -- was to avoid > having to > >> >> > worry > >> >> > > > > about > >> >> > > > > >> the > >> >> > > > > >> > >>>> lifecycle management of external clients. If there > is > >> a > >> >> > > generic > >> >> > > > > >> Kafka > >> >> > > > > >> > >>>> ingress/egress layer that I can plug a new > connector > >> into > >> >> > and > >> >> > > > > have > >> >> > > > > >> a > >> >> > > > > >> > >> lot of > >> >> > > > > >> > >>>> the heavy lifting re scale and reliability done > for me > >> >> then > >> >> > > it > >> >> > > > > >> gives > >> >> > > > > >> > me > >> >> > > > > >> > >> all > >> >> > > > > >> > >>>> the pushing new consumers/producers would. If not > >> then it > >> >> > > > > >> complicates > >> >> > > > > >> > my > >> >> > > > > >> > >>>> operational deployments. > >> >> > > > > >> > >>>> > >> >> > > > > >> > >>>> Which is similar to my other question with the > >> proposal > >> >> -- > >> >> > if > >> >> > > > we > >> >> > > > > >> > build a > >> >> > > > > >> > >>>> fully available/stand-alone Samza plus the > requisite > >> >> shims > >> >> > to > >> >> > > > > >> > integrate > >> >> > > > > >> > >>>> with Slider etc I suspect the former may be a lot > more > >> >> work > >> >> > > > than > >> >> > > > > we > >> >> > > > > >> > >> think. > >> >> > > > > >> > >>>> We may make it much easier for a newcomer to get > >> >> something > >> >> > > > > running > >> >> > > > > >> but > >> >> > > > > >> > >>>> having them step up and get a reliable production > >> >> > deployment > >> >> > > > may > >> >> > > > > >> still > >> >> > > > > >> > >>>> dominate mailing list traffic, if for different > >> reasons > >> >> > than > >> >> > > > > >> today. > >> >> > > > > >> > >>>> > >> >> > > > > >> > >>>> Don't get me wrong -- I'm comfortable with making > the > >> >> Samza > >> >> > > > > >> dependency > >> >> > > > > >> > >> on > >> >> > > > > >> > >>>> Kafka much more explicit and I absolutely see the > >> >> benefits > >> >> > > in > >> >> > > > > the > >> >> > > > > >> > >>>> reduction of duplication and clashing > >> >> > > > terminologies/abstractions > >> >> > > > > >> that > >> >> > > > > >> > >>>> Chris/Jay describe. Samza as a library would likely > >> be a > >> >> > very > >> >> > > > > nice > >> >> > > > > >> > tool > >> >> > > > > >> > >> to > >> >> > > > > >> > >>>> add to the Kafka ecosystem. I just have the > concerns > >> >> above > >> >> > re > >> >> > > > the > >> >> > > > > >> > >>>> operational side. > >> >> > > > > >> > >>>> > >> >> > > > > >> > >>>> Garry > >> >> > > > > >> > >>>> > >> >> > > > > >> > >>>> -----Original Message----- > >> >> > > > > >> > >>>> From: Gianmarco De Francisci Morales [mailto: > >> >> > g...@apache.org > >> >> > > ] > >> >> > > > > >> > >>>> Sent: 02 July 2015 12:56 > >> >> > > > > >> > >>>> To: dev@samza.apache.org > >> >> > > > > >> > >>>> Subject: Re: Thoughts and obesrvations on Samza > >> >> > > > > >> > >>>> > >> >> > > > > >> > >>>> Very interesting thoughts. > >> >> > > > > >> > >>>> From outside, I have always perceived Samza as a > >> >> computing > >> >> > > > layer > >> >> > > > > >> over > >> >> > > > > >> > >>>> Kafka. > >> >> > > > > >> > >>>> > >> >> > > > > >> > >>>> The question, maybe a bit provocative, is "should > >> Samza > >> >> be > >> >> > a > >> >> > > > > >> > sub-project > >> >> > > > > >> > >>>> of Kafka then?" > >> >> > > > > >> > >>>> Or does it make sense to keep it as a separate > project > >> >> > with a > >> >> > > > > >> separate > >> >> > > > > >> > >>>> governance? > >> >> > > > > >> > >>>> > >> >> > > > > >> > >>>> Cheers, > >> >> > > > > >> > >>>> > >> >> > > > > >> > >>>> -- > >> >> > > > > >> > >>>> Gianmarco > >> >> > > > > >> > >>>> > >> >> > > > > >> > >>>> On 2 July 2015 at 08:59, Yan Fang < > >> yanfang...@gmail.com> > >> >> > > > wrote: > >> >> > > > > >> > >>>> > >> >> > > > > >> > >>>>> Overall, I agree to couple with Kafka more > tightly. > >> >> > Because > >> >> > > > > Samza > >> >> > > > > >> de > >> >> > > > > >> > >>>>> facto is based on Kafka, and it should leverage > what > >> >> Kafka > >> >> > > > has. > >> >> > > > > At > >> >> > > > > >> > the > >> >> > > > > >> > >>>>> same time, Kafka does not need to reinvent what > Samza > >> >> > > already > >> >> > > > > >> has. I > >> >> > > > > >> > >>>>> also like the idea of separating the ingestion and > >> >> > > > > transformation. > >> >> > > > > >> > >>>>> > >> >> > > > > >> > >>>>> But it is a little difficult for me to image how > the > >> >> Samza > >> >> > > > will > >> >> > > > > >> look > >> >> > > > > >> > >>>> like. > >> >> > > > > >> > >>>>> And I feel Chris and Jay have a little difference > in > >> >> terms > >> >> > > of > >> >> > > > > how > >> >> > > > > >> > >>>>> Samza should look like. > >> >> > > > > >> > >>>>> > >> >> > > > > >> > >>>>> *** Will it look like what Jay's code shows (A > >> client of > >> >> > > > Kakfa) > >> >> > > > > ? > >> >> > > > > >> And > >> >> > > > > >> > >>>>> user's application code calls this client? > >> >> > > > > >> > >>>>> > >> >> > > > > >> > >>>>> 1. If we make Samza be a library of Kafka (like > what > >> the > >> >> > > code > >> >> > > > > >> shows), > >> >> > > > > >> > >>>>> how do we implement auto-balance and > fault-tolerance? > >> >> Are > >> >> > > they > >> >> > > > > >> taken > >> >> > > > > >> > >>>>> care by the Kafka broker or other mechanism, such > as > >> >> > "Samza > >> >> > > > > >> worker" > >> >> > > > > >> > >>>>> (just make up the name) ? > >> >> > > > > >> > >>>>> > >> >> > > > > >> > >>>>> 2. What about other features, such as > auto-scaling, > >> >> shared > >> >> > > > > state, > >> >> > > > > >> > >>>>> monitoring? > >> >> > > > > >> > >>>>> > >> >> > > > > >> > >>>>> > >> >> > > > > >> > >>>>> *** If we have Samza standalone, (is this what > Chris > >> >> > > > suggests?) > >> >> > > > > >> > >>>>> > >> >> > > > > >> > >>>>> 1. we still need to ingest data from Kakfa and > >> produce > >> >> to > >> >> > > it. > >> >> > > > > >> Then it > >> >> > > > > >> > >>>>> becomes the same as what Samza looks like now, > >> except it > >> >> > > does > >> >> > > > > not > >> >> > > > > >> > rely > >> >> > > > > >> > >>>>> on Yarn anymore. > >> >> > > > > >> > >>>>> > >> >> > > > > >> > >>>>> 2. if it is standalone, how can it leverage > Kafka's > >> >> > metrics, > >> >> > > > > logs, > >> >> > > > > >> > >>>>> etc? Use Kafka code as the dependency? > >> >> > > > > >> > >>>>> > >> >> > > > > >> > >>>>> > >> >> > > > > >> > >>>>> Thanks, > >> >> > > > > >> > >>>>> > >> >> > > > > >> > >>>>> Fang, Yan > >> >> > > > > >> > >>>>> yanfang...@gmail.com > >> >> > > > > >> > >>>>> > >> >> > > > > >> > >>>>> On Wed, Jul 1, 2015 at 5:46 PM, Guozhang Wang < > >> >> > > > > wangg...@gmail.com > >> >> > > > > >> > > >> >> > > > > >> > >>>> wrote: > >> >> > > > > >> > >>>>> > >> >> > > > > >> > >>>>>> Read through the code example and it looks good > to > >> me. > >> >> A > >> >> > > few > >> >> > > > > >> > >>>>>> thoughts regarding deployment: > >> >> > > > > >> > >>>>>> > >> >> > > > > >> > >>>>>> Today Samza deploys as executable runnable like: > >> >> > > > > >> > >>>>>> > >> >> > > > > >> > >>>>>> deploy/samza/bin/run-job.sh --config-factory=... > >> >> > > > > >> > >>>> --config-path=file://... > >> >> > > > > >> > >>>>>> > >> >> > > > > >> > >>>>>> And this proposal advocate for deploying Samza > more > >> as > >> >> > > > embedded > >> >> > > > > >> > >>>>>> libraries in user application code (ignoring the > >> >> > > terminology > >> >> > > > > >> since > >> >> > > > > >> > >>>>>> it is not the > >> >> > > > > >> > >>>>> same > >> >> > > > > >> > >>>>>> as the prototype code): > >> >> > > > > >> > >>>>>> > >> >> > > > > >> > >>>>>> StreamTask task = new MyStreamTask(configs); > Thread > >> >> > thread > >> >> > > = > >> >> > > > > new > >> >> > > > > >> > >>>>>> Thread(task); thread.start(); > >> >> > > > > >> > >>>>>> > >> >> > > > > >> > >>>>>> I think both of these deployment modes are > important > >> >> for > >> >> > > > > >> different > >> >> > > > > >> > >>>>>> types > >> >> > > > > >> > >>>>> of > >> >> > > > > >> > >>>>>> users. That said, I think making Samza purely > >> >> standalone > >> >> > is > >> >> > > > > still > >> >> > > > > >> > >>>>>> sufficient for either runnable or library modes. > >> >> > > > > >> > >>>>>> > >> >> > > > > >> > >>>>>> Guozhang > >> >> > > > > >> > >>>>>> > >> >> > > > > >> > >>>>>> On Tue, Jun 30, 2015 at 11:33 PM, Jay Kreps < > >> >> > > > j...@confluent.io> > >> >> > > > > >> > wrote: > >> >> > > > > >> > >>>>>> > >> >> > > > > >> > >>>>>>> Looks like gmail mangled the code example, it > was > >> >> > supposed > >> >> > > > to > >> >> > > > > >> look > >> >> > > > > >> > >>>>>>> like > >> >> > > > > >> > >>>>>>> this: > >> >> > > > > >> > >>>>>>> > >> >> > > > > >> > >>>>>>> Properties props = new Properties(); > >> >> > > > > >> > >>>>>>> props.put("bootstrap.servers", > "localhost:4242"); > >> >> > > > > >> StreamingConfig > >> >> > > > > >> > >>>>>>> config = new StreamingConfig(props); > >> >> > > > > >> > >>>>>>> config.subscribe("test-topic-1", > "test-topic-2"); > >> >> > > > > >> > >>>>>>> config.processor(ExampleStreamProcessor.class); > >> >> > > > > >> > >>>>>>> config.serialization(new StringSerializer(), new > >> >> > > > > >> > >>>>>>> StringDeserializer()); KafkaStreaming container > = > >> new > >> >> > > > > >> > >>>>>>> KafkaStreaming(config); container.run(); > >> >> > > > > >> > >>>>>>> > >> >> > > > > >> > >>>>>>> -Jay > >> >> > > > > >> > >>>>>>> > >> >> > > > > >> > >>>>>>> On Tue, Jun 30, 2015 at 11:32 PM, Jay Kreps < > >> >> > > > j...@confluent.io > >> >> > > > > > > >> >> > > > > >> > >>>> wrote: > >> >> > > > > >> > >>>>>>> > >> >> > > > > >> > >>>>>>>> Hey guys, > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> This came out of some conversations Chris and I > >> were > >> >> > > having > >> >> > > > > >> > >>>>>>>> around > >> >> > > > > >> > >>>>>>> whether > >> >> > > > > >> > >>>>>>>> it would make sense to use Samza as a kind of > data > >> >> > > > ingestion > >> >> > > > > >> > >>>>> framework > >> >> > > > > >> > >>>>>>> for > >> >> > > > > >> > >>>>>>>> Kafka (which ultimately lead to KIP-26 > "copycat"). > >> >> This > >> >> > > > kind > >> >> > > > > of > >> >> > > > > >> > >>>>>> combined > >> >> > > > > >> > >>>>>>>> with complaints around config and YARN and the > >> >> > discussion > >> >> > > > > >> around > >> >> > > > > >> > >>>>>>>> how > >> >> > > > > >> > >>>>> to > >> >> > > > > >> > >>>>>>>> best do a standalone mode. > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> So the thought experiment was, given that Samza > >> was > >> >> > > > basically > >> >> > > > > >> > >>>>>>>> already totally Kafka specific, what if you > just > >> >> > embraced > >> >> > > > > that > >> >> > > > > >> > >>>>>>>> and turned it > >> >> > > > > >> > >>>>>> into > >> >> > > > > >> > >>>>>>>> something less like a heavyweight framework and > >> more > >> >> > > like a > >> >> > > > > >> > >>>>>>>> third > >> >> > > > > >> > >>>>> Kafka > >> >> > > > > >> > >>>>>>>> client--a kind of "producing consumer" with > state > >> >> > > > management > >> >> > > > > >> > >>>>>> facilities. > >> >> > > > > >> > >>>>>>>> Basically a library. Instead of a complex > stream > >> >> > > processing > >> >> > > > > >> > >>>>>>>> framework > >> >> > > > > >> > >>>>>>> this > >> >> > > > > >> > >>>>>>>> would actually be a very simple thing, not much > >> more > >> >> > > > > >> complicated > >> >> > > > > >> > >>>>>>>> to > >> >> > > > > >> > >>>>> use > >> >> > > > > >> > >>>>>>> or > >> >> > > > > >> > >>>>>>>> operate than a Kafka consumer. As Chris said we > >> >> thought > >> >> > > > about > >> >> > > > > >> it > >> >> > > > > >> > >>>>>>>> a > >> >> > > > > >> > >>>>> lot > >> >> > > > > >> > >>>>>> of > >> >> > > > > >> > >>>>>>>> what Samza (and the other stream processing > >> systems > >> >> > were > >> >> > > > > doing) > >> >> > > > > >> > >>>>> seemed > >> >> > > > > >> > >>>>>>> like > >> >> > > > > >> > >>>>>>>> kind of a hangover from MapReduce. > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> Of course you need to ingest/output data to and > >> from > >> >> > the > >> >> > > > > stream > >> >> > > > > >> > >>>>>>>> processing. But when we actually looked into > how > >> that > >> >> > > would > >> >> > > > > >> > >>>>>>>> work, > >> >> > > > > >> > >>>>> Samza > >> >> > > > > >> > >>>>>>>> isn't really an ideal data ingestion framework > >> for a > >> >> > > bunch > >> >> > > > of > >> >> > > > > >> > >>>>> reasons. > >> >> > > > > >> > >>>>>> To > >> >> > > > > >> > >>>>>>>> really do that right you need a pretty > different > >> >> > internal > >> >> > > > > data > >> >> > > > > >> > >>>>>>>> model > >> >> > > > > >> > >>>>>> and > >> >> > > > > >> > >>>>>>>> set of apis. So what if you split them and had > an > >> api > >> >> > for > >> >> > > > > Kafka > >> >> > > > > >> > >>>>>>>> ingress/egress (copycat AKA KIP-26) and a > separate > >> >> api > >> >> > > for > >> >> > > > > >> Kafka > >> >> > > > > >> > >>>>>>>> transformation (Samza). > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> This would also allow really embracing the same > >> >> > > terminology > >> >> > > > > and > >> >> > > > > >> > >>>>>>>> conventions. One complaint about the current > >> state is > >> >> > > that > >> >> > > > > the > >> >> > > > > >> > >>>>>>>> two > >> >> > > > > >> > >>>>>>> systems > >> >> > > > > >> > >>>>>>>> kind of feel bolted on. Terminology like > "stream" > >> vs > >> >> > > > "topic" > >> >> > > > > >> and > >> >> > > > > >> > >>>>>>> different > >> >> > > > > >> > >>>>>>>> config and monitoring systems means you kind of > >> have > >> >> to > >> >> > > > learn > >> >> > > > > >> > >>>>>>>> Kafka's > >> >> > > > > >> > >>>>>>> way, > >> >> > > > > >> > >>>>>>>> then learn Samza's slightly different way, then > >> kind > >> >> of > >> >> > > > > >> > >>>>>>>> understand > >> >> > > > > >> > >>>>> how > >> >> > > > > >> > >>>>>>> they > >> >> > > > > >> > >>>>>>>> map to each other, which having walked a few > >> people > >> >> > > through > >> >> > > > > >> this > >> >> > > > > >> > >>>>>>>> is surprisingly tricky for folks to get. > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> Since I have been spending a lot of time on > >> >> airplanes I > >> >> > > > > hacked > >> >> > > > > >> > >>>>>>>> up an ernest but still somewhat incomplete > >> prototype > >> >> of > >> >> > > > what > >> >> > > > > >> > >>>>>>>> this would > >> >> > > > > >> > >>>>> look > >> >> > > > > >> > >>>>>>>> like. This is just unceremoniously dumped into > >> Kafka > >> >> as > >> >> > > it > >> >> > > > > >> > >>>>>>>> required a > >> >> > > > > >> > >>>>>> few > >> >> > > > > >> > >>>>>>>> changes to the new consumer. Here is the code: > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>> > >> >> > > > > >> > >>>>>> > >> >> > > > > >> > >>>>> > >> >> > > > > >> > > >> >> > > > > > >> >> > > >> https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org > >> >> > > > > >> > >>>>> /apache/kafka/clients/streaming > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> For the purpose of the prototype I just > liberally > >> >> > renamed > >> >> > > > > >> > >>>>>>>> everything > >> >> > > > > >> > >>>>> to > >> >> > > > > >> > >>>>>>>> try to align it with Kafka with no regard for > >> >> > > > compatibility. > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> To use this would be something like this: > >> >> > > > > >> > >>>>>>>> Properties props = new Properties(); > >> >> > > > > >> > >>>>>>>> props.put("bootstrap.servers", > "localhost:4242"); > >> >> > > > > >> > >>>>>>>> StreamingConfig config = new > >> >> > > > > >> > >>>>> StreamingConfig(props); > >> >> > > > > >> > >>>>>>> config.subscribe("test-topic-1", > >> >> > > > > >> > >>>>>>>> "test-topic-2"); > >> >> > > > > >> config.processor(ExampleStreamProcessor.class); > >> >> > > > > >> > >>>>>>> config.serialization(new > >> >> > > > > >> > >>>>>>>> StringSerializer(), new StringDeserializer()); > >> >> > > > KafkaStreaming > >> >> > > > > >> > >>>>>> container = > >> >> > > > > >> > >>>>>>>> new KafkaStreaming(config); container.run(); > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> KafkaStreaming is basically the SamzaContainer; > >> >> > > > > StreamProcessor > >> >> > > > > >> > >>>>>>>> is basically StreamTask. > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> So rather than putting all the class names in a > >> file > >> >> > and > >> >> > > > then > >> >> > > > > >> > >>>>>>>> having > >> >> > > > > >> > >>>>>> the > >> >> > > > > >> > >>>>>>>> job assembled by reflection, you just > instantiate > >> the > >> >> > > > > container > >> >> > > > > >> > >>>>>>>> programmatically. Work is balanced over however > >> many > >> >> > > > > instances > >> >> > > > > >> > >>>>>>>> of > >> >> > > > > >> > >>>>> this > >> >> > > > > >> > >>>>>>> are > >> >> > > > > >> > >>>>>>>> alive at any time (i.e. if an instance dies, > new > >> >> tasks > >> >> > > are > >> >> > > > > >> added > >> >> > > > > >> > >>>>>>>> to > >> >> > > > > >> > >>>>> the > >> >> > > > > >> > >>>>>>>> existing containers without shutting them > down). > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> We would provide some glue for running this > stuff > >> in > >> >> > YARN > >> >> > > > via > >> >> > > > > >> > >>>>>>>> Slider, Mesos via Marathon, and AWS using some > of > >> >> their > >> >> > > > tools > >> >> > > > > >> > >>>>>>>> but from the > >> >> > > > > >> > >>>>>> point > >> >> > > > > >> > >>>>>>> of > >> >> > > > > >> > >>>>>>>> view of these frameworks these stream > processing > >> jobs > >> >> > are > >> >> > > > > just > >> >> > > > > >> > >>>>>> stateless > >> >> > > > > >> > >>>>>>>> services that can come and go and expand and > >> contract > >> >> > at > >> >> > > > > will. > >> >> > > > > >> > >>>>>>>> There > >> >> > > > > >> > >>>>> is > >> >> > > > > >> > >>>>>>> no > >> >> > > > > >> > >>>>>>>> more custom scheduler. > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> Here are some relevant details: > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> 1. It is only ~1300 lines of code, it would > get > >> >> larger > >> >> > > if > >> >> > > > we > >> >> > > > > >> > >>>>>>>> productionized but not vastly larger. We > really > >> do > >> >> > get a > >> >> > > > ton > >> >> > > > > >> > >>>>>>>> of > >> >> > > > > >> > >>>>>>> leverage > >> >> > > > > >> > >>>>>>>> out of Kafka. > >> >> > > > > >> > >>>>>>>> 2. Partition management is fully delegated to > the > >> >> new > >> >> > > > > >> consumer. > >> >> > > > > >> > >>>>> This > >> >> > > > > >> > >>>>>>>> is nice since now any partition management > >> strategy > >> >> > > > > available > >> >> > > > > >> > >>>>>>>> to > >> >> > > > > >> > >>>>>> Kafka > >> >> > > > > >> > >>>>>>>> consumer is also available to Samza (and vice > >> versa) > >> >> > and > >> >> > > > > with > >> >> > > > > >> > >>>>>>>> the > >> >> > > > > >> > >>>>>>> exact > >> >> > > > > >> > >>>>>>>> same configs. > >> >> > > > > >> > >>>>>>>> 3. It supports state as well as state reuse > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> Anyhow take a look, hopefully it is thought > >> >> provoking. > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> -Jay > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> On Tue, Jun 30, 2015 at 6:55 PM, Chris > Riccomini < > >> >> > > > > >> > >>>>>> criccom...@apache.org> > >> >> > > > > >> > >>>>>>>> wrote: > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>>> Hey all, > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> I have had some discussions with Samza > engineers > >> at > >> >> > > > LinkedIn > >> >> > > > > >> > >>>>>>>>> and > >> >> > > > > >> > >>>>>>> Confluent > >> >> > > > > >> > >>>>>>>>> and we came up with a few observations and > would > >> >> like > >> >> > to > >> >> > > > > >> > >>>>>>>>> propose > >> >> > > > > >> > >>>>> some > >> >> > > > > >> > >>>>>>>>> changes. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> We've observed some things that I want to call > >> out > >> >> > about > >> >> > > > > >> > >>>>>>>>> Samza's > >> >> > > > > >> > >>>>>> design, > >> >> > > > > >> > >>>>>>>>> and I'd like to propose some changes. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> * Samza is dependent upon a dynamic deployment > >> >> system. > >> >> > > > > >> > >>>>>>>>> * Samza is too pluggable. > >> >> > > > > >> > >>>>>>>>> * Samza's SystemConsumer/SystemProducer and > >> Kafka's > >> >> > > > consumer > >> >> > > > > >> > >>>>>>>>> APIs > >> >> > > > > >> > >>>>> are > >> >> > > > > >> > >>>>>>>>> trying to solve a lot of the same problems. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> All three of these issues are related, but > I'll > >> >> > address > >> >> > > > them > >> >> > > > > >> in > >> >> > > > > >> > >>>>> order. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> Deployment > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> Samza strongly depends on the use of a dynamic > >> >> > > deployment > >> >> > > > > >> > >>>>>>>>> scheduler > >> >> > > > > >> > >>>>>> such > >> >> > > > > >> > >>>>>>>>> as > >> >> > > > > >> > >>>>>>>>> YARN, Mesos, etc. When we initially built > Samza, > >> we > >> >> > bet > >> >> > > > that > >> >> > > > > >> > >>>>>>>>> there > >> >> > > > > >> > >>>>>> would > >> >> > > > > >> > >>>>>>>>> be > >> >> > > > > >> > >>>>>>>>> one or two winners in this area, and we could > >> >> support > >> >> > > > them, > >> >> > > > > >> and > >> >> > > > > >> > >>>>>>>>> the > >> >> > > > > >> > >>>>>> rest > >> >> > > > > >> > >>>>>>>>> would go away. In reality, there are many > >> >> variations. > >> >> > > > > >> > >>>>>>>>> Furthermore, > >> >> > > > > >> > >>>>>> many > >> >> > > > > >> > >>>>>>>>> people still prefer to just start their > >> processors > >> >> > like > >> >> > > > > normal > >> >> > > > > >> > >>>>>>>>> Java processes, and use traditional deployment > >> >> scripts > >> >> > > > such > >> >> > > > > as > >> >> > > > > >> > >>>>>>>>> Fabric, > >> >> > > > > >> > >>>>>> Chef, > >> >> > > > > >> > >>>>>>>>> Ansible, etc. Forcing a deployment system on > >> users > >> >> > makes > >> >> > > > the > >> >> > > > > >> > >>>>>>>>> Samza start-up process really painful for > first > >> time > >> >> > > > users. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> Dynamic deployment as a requirement was also a > >> bit > >> >> of > >> >> > a > >> >> > > > > >> > >>>>>>>>> mis-fire > >> >> > > > > >> > >>>>>> because > >> >> > > > > >> > >>>>>>>>> of > >> >> > > > > >> > >>>>>>>>> a fundamental misunderstanding between the > >> nature of > >> >> > > batch > >> >> > > > > >> jobs > >> >> > > > > >> > >>>>>>>>> and > >> >> > > > > >> > >>>>>>> stream > >> >> > > > > >> > >>>>>>>>> processing jobs. Early on, we made conscious > >> effort > >> >> to > >> >> > > > favor > >> >> > > > > >> > >>>>>>>>> the > >> >> > > > > >> > >>>>>> Hadoop > >> >> > > > > >> > >>>>>>>>> (Map/Reduce) way of doing things, since it > worked > >> >> and > >> >> > > was > >> >> > > > > well > >> >> > > > > >> > >>>>>>> understood. > >> >> > > > > >> > >>>>>>>>> One thing that we missed was that batch jobs > >> have a > >> >> > > > definite > >> >> > > > > >> > >>>>>> beginning, > >> >> > > > > >> > >>>>>>>>> and > >> >> > > > > >> > >>>>>>>>> end, and stream processing jobs don't > (usually). > >> >> This > >> >> > > > leads > >> >> > > > > to > >> >> > > > > >> > >>>>>>>>> a > >> >> > > > > >> > >>>>> much > >> >> > > > > >> > >>>>>>>>> simpler scheduling problem for stream > processors. > >> >> You > >> >> > > > > >> basically > >> >> > > > > >> > >>>>>>>>> just > >> >> > > > > >> > >>>>>>> need > >> >> > > > > >> > >>>>>>>>> to find a place to start the processor, and > start > >> >> it. > >> >> > > The > >> >> > > > > way > >> >> > > > > >> > >>>>>>>>> we run grids, at LinkedIn, there's no concept > of > >> a > >> >> > > cluster > >> >> > > > > >> > >>>>>>>>> being "full". We always > >> >> > > > > >> > >>>>>> add > >> >> > > > > >> > >>>>>>>>> more machines. The problem with coupling Samza > >> with > >> >> a > >> >> > > > > >> scheduler > >> >> > > > > >> > >>>>>>>>> is > >> >> > > > > >> > >>>>>> that > >> >> > > > > >> > >>>>>>>>> Samza (as a framework) now has to handle > >> deployment. > >> >> > > This > >> >> > > > > >> pulls > >> >> > > > > >> > >>>>>>>>> in a > >> >> > > > > >> > >>>>>>> bunch > >> >> > > > > >> > >>>>>>>>> of things such as configuration distribution > >> (config > >> >> > > > > stream), > >> >> > > > > >> > >>>>>>>>> shell > >> >> > > > > >> > >>>>>>> scrips > >> >> > > > > >> > >>>>>>>>> (bin/run-job.sh, JobRunner), packaging (all > the > >> .tgz > >> >> > > > stuff), > >> >> > > > > >> etc. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> Another reason for requiring dynamic > deployment > >> was > >> >> to > >> >> > > > > support > >> >> > > > > >> > >>>>>>>>> data locality. If you want to have locality, > you > >> >> need > >> >> > to > >> >> > > > put > >> >> > > > > >> > >>>>>>>>> your > >> >> > > > > >> > >>>>>> processors > >> >> > > > > >> > >>>>>>>>> close to the data they're processing. Upon > >> further > >> >> > > > > >> > >>>>>>>>> investigation, > >> >> > > > > >> > >>>>>>> though, > >> >> > > > > >> > >>>>>>>>> this feature is not that beneficial. There is > >> some > >> >> > good > >> >> > > > > >> > >>>>>>>>> discussion > >> >> > > > > >> > >>>>>> about > >> >> > > > > >> > >>>>>>>>> some problems with it on SAMZA-335. Again, we > >> took > >> >> the > >> >> > > > > >> > >>>>>>>>> Map/Reduce > >> >> > > > > >> > >>>>>> path, > >> >> > > > > >> > >>>>>>>>> but > >> >> > > > > >> > >>>>>>>>> there are some fundamental differences between > >> HDFS > >> >> > and > >> >> > > > > Kafka. > >> >> > > > > >> > >>>>>>>>> HDFS > >> >> > > > > >> > >>>>>> has > >> >> > > > > >> > >>>>>>>>> blocks, while Kafka has partitions. This > leads to > >> >> less > >> >> > > > > >> > >>>>>>>>> optimization potential with stream processors > on > >> top > >> >> > of > >> >> > > > > Kafka. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> This feature is also used as a crutch. Samza > >> doesn't > >> >> > > have > >> >> > > > > any > >> >> > > > > >> > >>>>>>>>> built > >> >> > > > > >> > >>>>> in > >> >> > > > > >> > >>>>>>>>> fault-tolerance logic. Instead, it depends on > the > >> >> > > dynamic > >> >> > > > > >> > >>>>>>>>> deployment scheduling system to handle > restarts > >> >> when a > >> >> > > > > >> > >>>>>>>>> processor dies. This has > >> >> > > > > >> > >>>>>>> made > >> >> > > > > >> > >>>>>>>>> it very difficult to write a standalone Samza > >> >> > container > >> >> > > > > >> > >>>> (SAMZA-516). > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> Pluggability > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> In some cases pluggability is good, but I > think > >> that > >> >> > > we've > >> >> > > > > >> gone > >> >> > > > > >> > >>>>>>>>> too > >> >> > > > > >> > >>>>>> far > >> >> > > > > >> > >>>>>>>>> with it. Currently, Samza has: > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> * Pluggable config. > >> >> > > > > >> > >>>>>>>>> * Pluggable metrics. > >> >> > > > > >> > >>>>>>>>> * Pluggable deployment systems. > >> >> > > > > >> > >>>>>>>>> * Pluggable streaming systems (SystemConsumer, > >> >> > > > > SystemProducer, > >> >> > > > > >> > >>>> etc). > >> >> > > > > >> > >>>>>>>>> * Pluggable serdes. > >> >> > > > > >> > >>>>>>>>> * Pluggable storage engines. > >> >> > > > > >> > >>>>>>>>> * Pluggable strategies for just about every > >> >> component > >> >> > > > > >> > >>>>> (MessageChooser, > >> >> > > > > >> > >>>>>>>>> SystemStreamPartitionGrouper, ConfigRewriter, > >> etc). > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> There's probably more that I've forgotten, as > >> well. > >> >> > Some > >> >> > > > of > >> >> > > > > >> > >>>>>>>>> these > >> >> > > > > >> > >>>>> are > >> >> > > > > >> > >>>>>>>>> useful, but some have proven not to be. This > all > >> >> comes > >> >> > > at > >> >> > > > a > >> >> > > > > >> cost: > >> >> > > > > >> > >>>>>>>>> complexity. This complexity is making it > harder > >> for > >> >> > our > >> >> > > > > users > >> >> > > > > >> > >>>>>>>>> to > >> >> > > > > >> > >>>>> pick > >> >> > > > > >> > >>>>>> up > >> >> > > > > >> > >>>>>>>>> and use Samza out of the box. It also makes it > >> >> > difficult > >> >> > > > for > >> >> > > > > >> > >>>>>>>>> Samza developers to reason about what the > >> >> > > characteristics > >> >> > > > of > >> >> > > > > >> > >>>>>>>>> the container (since the characteristics > change > >> >> > > depending > >> >> > > > on > >> >> > > > > >> > >>>>>>>>> which plugins are use). > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> The issues with pluggability are most visible > in > >> the > >> >> > > > System > >> >> > > > > >> APIs. > >> >> > > > > >> > >>>>> What > >> >> > > > > >> > >>>>>>>>> Samza really requires to be functional is > Kafka > >> as > >> >> its > >> >> > > > > >> > >>>>>>>>> transport > >> >> > > > > >> > >>>>>> layer. > >> >> > > > > >> > >>>>>>>>> But > >> >> > > > > >> > >>>>>>>>> we've conflated two unrelated use cases into > one > >> >> API: > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> 1. Get data into/out of Kafka. > >> >> > > > > >> > >>>>>>>>> 2. Process the data in Kafka. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> The current System API supports both of these > use > >> >> > cases. > >> >> > > > The > >> >> > > > > >> > >>>>>>>>> problem > >> >> > > > > >> > >>>>>> is, > >> >> > > > > >> > >>>>>>>>> we > >> >> > > > > >> > >>>>>>>>> actually want different features for each use > >> case. > >> >> By > >> >> > > > > >> papering > >> >> > > > > >> > >>>>>>>>> over > >> >> > > > > >> > >>>>>>> these > >> >> > > > > >> > >>>>>>>>> two use cases, and providing a single API, > we've > >> >> > > > introduced > >> >> > > > > a > >> >> > > > > >> > >>>>>>>>> ton of > >> >> > > > > >> > >>>>>>> leaky > >> >> > > > > >> > >>>>>>>>> abstractions. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> For example, what we'd really like in (2) is > to > >> have > >> >> > > > > >> > >>>>>>>>> monotonically increasing longs for offsets > (like > >> >> > Kafka). > >> >> > > > > This > >> >> > > > > >> > >>>>>>>>> would be at odds > >> >> > > > > >> > >>>>> with > >> >> > > > > >> > >>>>>>> (1), > >> >> > > > > >> > >>>>>>>>> though, since different systems have different > >> >> > > > > >> > >>>>>>> SCNs/Offsets/UUIDs/vectors. > >> >> > > > > >> > >>>>>>>>> There was discussion both on the mailing list > and > >> >> the > >> >> > > SQL > >> >> > > > > >> JIRAs > >> >> > > > > >> > >>>>> about > >> >> > > > > >> > >>>>>>> the > >> >> > > > > >> > >>>>>>>>> need for this. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> The same thing holds true for replayability. > >> Kafka > >> >> > > allows > >> >> > > > us > >> >> > > > > >> to > >> >> > > > > >> > >>>>> rewind > >> >> > > > > >> > >>>>>>>>> when > >> >> > > > > >> > >>>>>>>>> we have a failure. Many other systems don't. > In > >> some > >> >> > > > cases, > >> >> > > > > >> > >>>>>>>>> systems > >> >> > > > > >> > >>>>>>> return > >> >> > > > > >> > >>>>>>>>> null for their offsets (e.g. > >> >> WikipediaSystemConsumer) > >> >> > > > > because > >> >> > > > > >> > >>>>>>>>> they > >> >> > > > > >> > >>>>>> have > >> >> > > > > >> > >>>>>>> no > >> >> > > > > >> > >>>>>>>>> offsets. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> Partitioning is another example. Kafka > supports > >> >> > > > > partitioning, > >> >> > > > > >> > >>>>>>>>> but > >> >> > > > > >> > >>>>> many > >> >> > > > > >> > >>>>>>>>> systems don't. We model this by having a > single > >> >> > > partition > >> >> > > > > for > >> >> > > > > >> > >>>>>>>>> those systems. Still, other systems model > >> >> partitioning > >> >> > > > > >> > >>>> differently (e.g. > >> >> > > > > >> > >>>>>>>>> Kinesis). > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> The SystemAdmin interface is also a mess. > >> Creating > >> >> > > streams > >> >> > > > > in > >> >> > > > > >> a > >> >> > > > > >> > >>>>>>>>> system-agnostic way is almost impossible. As > is > >> >> > modeling > >> >> > > > > >> > >>>>>>>>> metadata > >> >> > > > > >> > >>>>> for > >> >> > > > > >> > >>>>>>> the > >> >> > > > > >> > >>>>>>>>> system (replication factor, partitions, > location, > >> >> > etc). > >> >> > > > The > >> >> > > > > >> > >>>>>>>>> list > >> >> > > > > >> > >>>>> goes > >> >> > > > > >> > >>>>>>> on. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> Duplicate work > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> At the time that we began writing Samza, > Kafka's > >> >> > > consumer > >> >> > > > > and > >> >> > > > > >> > >>>>> producer > >> >> > > > > >> > >>>>>>>>> APIs > >> >> > > > > >> > >>>>>>>>> had a relatively weak feature set. On the > >> >> > consumer-side, > >> >> > > > you > >> >> > > > > >> > >>>>>>>>> had two > >> >> > > > > >> > >>>>>>>>> options: use the high level consumer, or the > >> simple > >> >> > > > > consumer. > >> >> > > > > >> > >>>>>>>>> The > >> >> > > > > >> > >>>>>>> problem > >> >> > > > > >> > >>>>>>>>> with the high-level consumer was that it > >> controlled > >> >> > your > >> >> > > > > >> > >>>>>>>>> offsets, partition assignments, and the order > in > >> >> which > >> >> > > you > >> >> > > > > >> > >>>>>>>>> received messages. The > >> >> > > > > >> > >>>>> problem > >> >> > > > > >> > >>>>>>>>> with > >> >> > > > > >> > >>>>>>>>> the simple consumer is that it's not simple. > It's > >> >> > basic. > >> >> > > > You > >> >> > > > > >> > >>>>>>>>> end up > >> >> > > > > >> > >>>>>>> having > >> >> > > > > >> > >>>>>>>>> to handle a lot of really low-level stuff that > >> you > >> >> > > > > shouldn't. > >> >> > > > > >> > >>>>>>>>> We > >> >> > > > > >> > >>>>>> spent a > >> >> > > > > >> > >>>>>>>>> lot of time to make Samza's > KafkaSystemConsumer > >> very > >> >> > > > robust. > >> >> > > > > >> It > >> >> > > > > >> > >>>>>>>>> also allows us to support some cool features: > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> * Per-partition message ordering and > >> prioritization. > >> >> > > > > >> > >>>>>>>>> * Tight control over partition assignment to > >> support > >> >> > > > joins, > >> >> > > > > >> > >>>>>>>>> global > >> >> > > > > >> > >>>>>> state > >> >> > > > > >> > >>>>>>>>> (if we want to implement it :)), etc. > >> >> > > > > >> > >>>>>>>>> * Tight control over offset checkpointing. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> What we didn't realize at the time is that > these > >> >> > > features > >> >> > > > > >> > >>>>>>>>> should > >> >> > > > > >> > >>>>>>> actually > >> >> > > > > >> > >>>>>>>>> be in Kafka. A lot of Kafka consumers (not > just > >> >> Samza > >> >> > > > stream > >> >> > > > > >> > >>>>>> processors) > >> >> > > > > >> > >>>>>>>>> end up wanting to do things like joins and > >> partition > >> >> > > > > >> > >>>>>>>>> assignment. The > >> >> > > > > >> > >>>>>>> Kafka > >> >> > > > > >> > >>>>>>>>> community has come to the same conclusion. > >> They're > >> >> > > adding > >> >> > > > a > >> >> > > > > >> ton > >> >> > > > > >> > >>>>>>>>> of upgrades into their new Kafka consumer > >> >> > > implementation. > >> >> > > > > To a > >> >> > > > > >> > >>>>>>>>> large extent, > >> >> > > > > >> > >>>>> it's > >> >> > > > > >> > >>>>>>>>> duplicate work to what we've already done in > >> Samza. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> On top of this, Kafka ended up taking a very > >> similar > >> >> > > > > approach > >> >> > > > > >> > >>>>>>>>> to > >> >> > > > > >> > >>>>>> Samza's > >> >> > > > > >> > >>>>>>>>> KafkaCheckpointManager implementation for > >> handling > >> >> > > offset > >> >> > > > > >> > >>>>>> checkpointing. > >> >> > > > > >> > >>>>>>>>> Like Samza, Kafka's new offset management > feature > >> >> > stores > >> >> > > > > >> offset > >> >> > > > > >> > >>>>>>>>> checkpoints in a topic, and allows you to > fetch > >> them > >> >> > > from > >> >> > > > > the > >> >> > > > > >> > >>>>>>>>> broker. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> A lot of this seems like a waste, since we > could > >> >> have > >> >> > > > shared > >> >> > > > > >> > >>>>>>>>> the > >> >> > > > > >> > >>>>> work > >> >> > > > > >> > >>>>>> if > >> >> > > > > >> > >>>>>>>>> it > >> >> > > > > >> > >>>>>>>>> had been done in Kafka from the get-go. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> Vision > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> All of this leads me to a rather radical > >> proposal. > >> >> > Samza > >> >> > > > is > >> >> > > > > >> > >>>>> relatively > >> >> > > > > >> > >>>>>>>>> stable at this point. I'd venture to say that > >> we're > >> >> > > near a > >> >> > > > > 1.0 > >> >> > > > > >> > >>>>>> release. > >> >> > > > > >> > >>>>>>>>> I'd > >> >> > > > > >> > >>>>>>>>> like to propose that we take what we've > learned, > >> and > >> >> > > begin > >> >> > > > > >> > >>>>>>>>> thinking > >> >> > > > > >> > >>>>>>> about > >> >> > > > > >> > >>>>>>>>> Samza beyond 1.0. What would we change if we > were > >> >> > > starting > >> >> > > > > >> from > >> >> > > > > >> > >>>>>> scratch? > >> >> > > > > >> > >>>>>>>>> My > >> >> > > > > >> > >>>>>>>>> proposal is to: > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> 1. Make Samza standalone the *only* way to run > >> Samza > >> >> > > > > >> > >>>>>>>>> processors, and eliminate all direct > dependences > >> on > >> >> > > YARN, > >> >> > > > > >> Mesos, > >> >> > > > > >> > >>>> etc. > >> >> > > > > >> > >>>>>>>>> 2. Make a definitive call to support only > Kafka > >> as > >> >> the > >> >> > > > > stream > >> >> > > > > >> > >>>>>> processing > >> >> > > > > >> > >>>>>>>>> layer. > >> >> > > > > >> > >>>>>>>>> 3. Eliminate Samza's metrics, logging, > >> >> serialization, > >> >> > > and > >> >> > > > > >> > >>>>>>>>> config > >> >> > > > > >> > >>>>>>> systems, > >> >> > > > > >> > >>>>>>>>> and simply use Kafka's instead. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> This would fix all of the issues that I > outlined > >> >> > above. > >> >> > > It > >> >> > > > > >> > >>>>>>>>> should > >> >> > > > > >> > >>>>> also > >> >> > > > > >> > >>>>>>>>> shrink the Samza code base pretty > dramatically. > >> >> > > Supporting > >> >> > > > > >> only > >> >> > > > > >> > >>>>>>>>> a standalone container will allow Samza to be > >> >> executed > >> >> > > on > >> >> > > > > YARN > >> >> > > > > >> > >>>>>>>>> (using Slider), Mesos (using > Marathon/Aurora), or > >> >> most > >> >> > > > other > >> >> > > > > >> > >>>>>>>>> in-house > >> >> > > > > >> > >>>>>>> deployment > >> >> > > > > >> > >>>>>>>>> systems. This should make life a lot easier > for > >> new > >> >> > > users. > >> >> > > > > >> > >>>>>>>>> Imagine > >> >> > > > > >> > >>>>>>> having > >> >> > > > > >> > >>>>>>>>> the hello-samza tutorial without YARN. The > drop > >> in > >> >> > > mailing > >> >> > > > > >> list > >> >> > > > > >> > >>>>>> traffic > >> >> > > > > >> > >>>>>>>>> will be pretty dramatic. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> Coupling with Kafka seems long overdue to me. > The > >> >> > > reality > >> >> > > > > is, > >> >> > > > > >> > >>>>> everyone > >> >> > > > > >> > >>>>>>>>> that > >> >> > > > > >> > >>>>>>>>> I'm aware of is using Samza with Kafka. We > >> basically > >> >> > > > require > >> >> > > > > >> it > >> >> > > > > >> > >>>>>> already > >> >> > > > > >> > >>>>>>> in > >> >> > > > > >> > >>>>>>>>> order for most features to work. Those that > are > >> >> using > >> >> > > > other > >> >> > > > > >> > >>>>>>>>> systems > >> >> > > > > >> > >>>>>> are > >> >> > > > > >> > >>>>>>>>> generally using it for ingest into Kafka (1), > and > >> >> then > >> >> > > > they > >> >> > > > > do > >> >> > > > > >> > >>>>>>>>> the processing on top. There is already > >> discussion ( > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>> > >> >> > > > > >> > >>>>>> > >> >> > > > > >> > >>>>> > >> >> > > > > >> > > >> >> > > > > > >> >> > > >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851 > >> >> > > > > >> > >>>>> 767 > >> >> > > > > >> > >>>>>>>>> ) > >> >> > > > > >> > >>>>>>>>> in Kafka to make ingesting into Kafka > extremely > >> >> easy. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> Once we make the call to couple with Kafka, we > >> can > >> >> > > > leverage > >> >> > > > > a > >> >> > > > > >> > >>>>>>>>> ton of > >> >> > > > > >> > >>>>>>> their > >> >> > > > > >> > >>>>>>>>> ecosystem. We no longer have to maintain our > own > >> >> > config, > >> >> > > > > >> > >>>>>>>>> metrics, > >> >> > > > > >> > >>>>> etc. > >> >> > > > > >> > >>>>>>> We > >> >> > > > > >> > >>>>>>>>> can all share the same libraries, and make > them > >> >> > better. > >> >> > > > This > >> >> > > > > >> > >>>>>>>>> will > >> >> > > > > >> > >>>>> also > >> >> > > > > >> > >>>>>>>>> allow us to share the consumer/producer APIs, > and > >> >> will > >> >> > > let > >> >> > > > > us > >> >> > > > > >> > >>>>> leverage > >> >> > > > > >> > >>>>>>>>> their offset management and partition > management, > >> >> > rather > >> >> > > > > than > >> >> > > > > >> > >>>>>>>>> having > >> >> > > > > >> > >>>>>> our > >> >> > > > > >> > >>>>>>>>> own. All of the coordinator stream code would > go > >> >> away, > >> >> > > as > >> >> > > > > >> would > >> >> > > > > >> > >>>>>>>>> most > >> >> > > > > >> > >>>>>> of > >> >> > > > > >> > >>>>>>>>> the > >> >> > > > > >> > >>>>>>>>> YARN AppMaster code. We'd probably have to > push > >> some > >> >> > > > > partition > >> >> > > > > >> > >>>>>>> management > >> >> > > > > >> > >>>>>>>>> features into the Kafka broker, but they're > >> already > >> >> > > moving > >> >> > > > > in > >> >> > > > > >> > >>>>>>>>> that direction with the new consumer API. The > >> >> features > >> >> > > we > >> >> > > > > have > >> >> > > > > >> > >>>>>>>>> for > >> >> > > > > >> > >>>>>> partition > >> >> > > > > >> > >>>>>>>>> assignment aren't unique to Samza, and seem > like > >> >> they > >> >> > > > should > >> >> > > > > >> be > >> >> > > > > >> > >>>>>>>>> in > >> >> > > > > >> > >>>>>> Kafka > >> >> > > > > >> > >>>>>>>>> anyway. There will always be some niche usages > >> which > >> >> > > will > >> >> > > > > >> > >>>>>>>>> require > >> >> > > > > >> > >>>>>> extra > >> >> > > > > >> > >>>>>>>>> care and hence full control over partition > >> >> assignments > >> >> > > > much > >> >> > > > > >> > >>>>>>>>> like the > >> >> > > > > >> > >>>>>>> Kafka > >> >> > > > > >> > >>>>>>>>> low level consumer api. These would continue > to > >> be > >> >> > > > > supported. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> These items will be good for the Samza > community. > >> >> > > They'll > >> >> > > > > make > >> >> > > > > >> > >>>>>>>>> Samza easier to use, and make it easier for > >> >> developers > >> >> > > to > >> >> > > > > add > >> >> > > > > >> > >>>>>>>>> new features. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> Obviously this is a fairly large (and somewhat > >> >> > backwards > >> >> > > > > >> > >>>>> incompatible > >> >> > > > > >> > >>>>>>>>> change). If we choose to go this route, it's > >> >> important > >> >> > > > that > >> >> > > > > we > >> >> > > > > >> > >>>>> openly > >> >> > > > > >> > >>>>>>>>> communicate how we're going to provide a > >> migration > >> >> > path > >> >> > > > from > >> >> > > > > >> > >>>>>>>>> the > >> >> > > > > >> > >>>>>>> existing > >> >> > > > > >> > >>>>>>>>> APIs to the new ones (if we make incompatible > >> >> > changes). > >> >> > > I > >> >> > > > > >> think > >> >> > > > > >> > >>>>>>>>> at a minimum, we'd probably need to provide a > >> >> wrapper > >> >> > to > >> >> > > > > allow > >> >> > > > > >> > >>>>>>>>> existing StreamTask implementations to > continue > >> >> > running > >> >> > > on > >> >> > > > > the > >> >> > > > > >> > >>>> new container. > >> >> > > > > >> > >>>>>>> It's > >> >> > > > > >> > >>>>>>>>> also important that we openly communicate > about > >> >> > timing, > >> >> > > > and > >> >> > > > > >> > >>>>>>>>> stages > >> >> > > > > >> > >>>>> of > >> >> > > > > >> > >>>>>>> the > >> >> > > > > >> > >>>>>>>>> migration. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> If you made it this far, I'm sure you have > >> opinions. > >> >> > :) > >> >> > > > > Please > >> >> > > > > >> > >>>>>>>>> send > >> >> > > > > >> > >>>>>> your > >> >> > > > > >> > >>>>>>>>> thoughts and feedback. > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>>> Cheers, > >> >> > > > > >> > >>>>>>>>> Chris > >> >> > > > > >> > >>>>>>>>> > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>>> > >> >> > > > > >> > >>>>>>> > >> >> > > > > >> > >>>>>> > >> >> > > > > >> > >>>>>> > >> >> > > > > >> > >>>>>> > >> >> > > > > >> > >>>>>> -- > >> >> > > > > >> > >>>>>> -- Guozhang > >> >> > > > > >> > >>>>>> > >> >> > > > > >> > >>>>> > >> >> > > > > >> > >>>> > >> >> > > > > >> > >> > >> >> > > > > >> > >> > >> >> > > > > >> > > >> >> > > > > >> > > >> >> > > > > >> > > >> >> > > > > >> > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >