{quote} maintained in a separate repository and retaining the existing committership but sharing as much else as possible (website, etc) {quote}
Overall, I agree on this idea. Now the question is more about "how to do it". On the other hand, one thing I want to point out is that, if we decide to go this way, how do we want to support otherSystem-transformation-otherSystem use case? Basically, there are four user groups here: 1. Kafka-transformation-Kafka 2. Kafka-transformation-otherSystem 3. otherSystem-transformation-Kafka 4. otherSystem-transformation-otherSystem For group 1, they can easily use the new Samza library to achieve. For group 2 and 3, they can use copyCat -> transformation -> Kafka or Kafka-> transformation -> copyCat. The problem is for group 4. Do we want to abandon this or still support it? Of course, this use case can be achieved by using copyCat -> transformation -> Kafka -> transformation -> copyCat, the thing is how we persuade them to do this long chain. If yes, it will also be a win for Kafka too. Or if there is no one in this community actually doing this so far, maybe ok to not support the group 4 directly. Thanks, Fang, Yan yanfang...@gmail.com On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps <j...@confluent.io> wrote: > Yeah I agree with this summary. I think there are kind of two questions > here: > 1. Technically does alignment/reliance on Kafka make sense > 2. Branding wise (naming, website, concepts, etc) does alignment with Kafka > make sense > > Personally I do think both of these things would be really valuable, and > would dramatically alter the trajectory of the project. > > My preference would be to see if people can mostly agree on a direction > rather than splintering things off. From my point of view the ideal outcome > of all the options discussed would be to make Samza a closely aligned > subproject, maintained in a separate repository and retaining the existing > committership but sharing as much else as possible (website, etc). No idea > about how these things work, Jacob, you probably know more. > > No discussion amongst the Kafka folks has happened on this, but likely we > should figure out what the Samza community actually wants first. > > I admit that this is a fairly radical departure from how things are. > > If that doesn't fly, I think, yeah we could leave Samza as it is and do the > more radical reboot inside Kafka. From my point of view that does leave > things in a somewhat confusing state since now there are two stream > processing systems more or less coupled to Kafka in large part made by the > same people. But, arguably that might be a cleaner way to make the cut-over > and perhaps less risky for Samza community since if it works people can > switch and if it doesn't nothing will have changed. Dunno, how do people > feel about this? > > -Jay > > On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan <jgho...@gmail.com> wrote: > > > > This leads me to thinking that merging projects and communities might > > be a good idea: with the union of experience from both communities, we > will > > probably build a better system that is better for users. > > Is this what's being proposed though? Merging the projects seems like > > a consequence of at most one of the three directions under discussion: > > 1) Samza 2.0: The Samza community relies more heavily on Kafka for > > configuration, etc. (to a greater or lesser extent to be determined) > > but the Samza community would not automatically merge withe Kafka > > community (the Phoenix/HBase example is a good one here). > > 2) Samza Reboot: The Samza community continues to exist with a limited > > project scope, but similarly would not need to be part of the Kafka > > community (ie given committership) to progress. Here, maybe the Samza > > team would become a subproject of Kafka (the Board frowns on > > subprojects at the moment, so I'm not sure if that's even feasible), > > but that would not be required. > > 3) Hey Samza! FYI, Kafka does streaming now: In this option the Kafka > > team builds its own streaming library, possibly off of Jay's > > prototype, which has not direct lineage to the Samza team. There's no > > reason for the Kafka team to bring in the Samza team. > > > > Is the Kafka community on board with this? > > > > To be clear, all three options under discussion are interesting, > > technically valid and likely healthy directions for the project. > > Also, they are not mutually exclusive. The Samza community could > > decide to pursue, say, 'Samza 2.0', while the Kafka community went > > forward with 'Hey Samza!' My points above are directed entirely at > > the community aspect of these choices. > > -Jakob > > > > On 10 July 2015 at 09:10, Roger Hoover <roger.hoo...@gmail.com> wrote: > > > That's great. Thanks, Jay. > > > > > > On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps <j...@confluent.io> wrote: > > > > > >> Yeah totally agree. I think you have this issue even today, right? > I.e. > > if > > >> you need to make a simple config change and you're running in YARN > today > > >> you end up bouncing the job which then rebuilds state. I think the fix > > is > > >> exactly what you described which is to have a long timeout on > partition > > >> movement for stateful jobs so that if a job is just getting bounced, > and > > >> the cluster manager (or admin) is smart enough to restart it on the > same > > >> host when possible, it can optimistically reuse any existing state it > > finds > > >> on disk (if it is valid). > > >> > > >> So in this model the charter of the CM is to place processes as > > stickily as > > >> possible and to restart or re-place failed processes. The charter of > the > > >> partition management system is to control the assignment of work to > > these > > >> processes. The nice thing about this is that the work assignment, > > timeouts, > > >> behavior, configs, and code will all be the same across all cluster > > >> managers. > > >> > > >> So I think that prototype would actually give you exactly what you > want > > >> today for any cluster manager (or manual placement + restart script) > > that > > >> was sticky in terms of host placement since there is already a > > configurable > > >> partition movement timeout and task-by-task state reuse with a check > on > > >> state validity. > > >> > > >> -Jay > > >> > > >> On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover <roger.hoo...@gmail.com > > > > >> wrote: > > >> > > >> > That would be great to let Kafka do as much heavy lifting as > possible > > and > > >> > make it easier for other languages to implement Samza apis. > > >> > > > >> > One thing to watch out for is the interplay between Kafka's group > > >> > management and the external scheduler/process manager's fault > > tolerance. > > >> > If a container dies, the Kafka group membership protocol will try to > > >> assign > > >> > it's tasks to other containers while at the same time the process > > manager > > >> > is trying to relaunch the container. Without some consideration for > > this > > >> > (like a configurable amount of time to wait before Kafka alters the > > group > > >> > membership), there may be thrashing going on which is especially bad > > for > > >> > containers with large amounts of local state. > > >> > > > >> > Someone else pointed this out already but I thought it might be > worth > > >> > calling out again. > > >> > > > >> > Cheers, > > >> > > > >> > Roger > > >> > > > >> > > > >> > On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps <j...@confluent.io> > wrote: > > >> > > > >> > > Hey Roger, > > >> > > > > >> > > I couldn't agree more. We spent a bunch of time talking to people > > and > > >> > that > > >> > > is exactly the stuff we heard time and again. What makes it hard, > of > > >> > > course, is that there is some tension between compatibility with > > what's > > >> > > there now and making things better for new users. > > >> > > > > >> > > I also strongly agree with the importance of multi-language > > support. We > > >> > are > > >> > > talking now about Java, but for application development use cases > > >> people > > >> > > want to work in whatever language they are using elsewhere. I > think > > >> > moving > > >> > > to a model where Kafka itself does the group membership, lifecycle > > >> > control, > > >> > > and partition assignment has the advantage of putting all that > > complex > > >> > > stuff behind a clean api that the clients are already going to be > > >> > > implementing for their consumer, so the added functionality for > > stream > > >> > > processing beyond a consumer becomes very minor. > > >> > > > > >> > > -Jay > > >> > > > > >> > > On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover < > > roger.hoo...@gmail.com> > > >> > > wrote: > > >> > > > > >> > > > Metamorphosis...nice. :) > > >> > > > > > >> > > > This has been a great discussion. As a user of Samza who's > > recently > > >> > > > integrated it into a relatively large organization, I just want > to > > >> add > > >> > > > support to a few points already made. > > >> > > > > > >> > > > The biggest hurdles to adoption of Samza as it currently exists > > that > > >> > I've > > >> > > > experienced are: > > >> > > > 1) YARN - YARN is overly complex in many environments where > Puppet > > >> > would > > >> > > do > > >> > > > just fine but it was the only mechanism to get fault tolerance. > > >> > > > 2) Configuration - I think I like the idea of configuring most > of > > the > > >> > job > > >> > > > in code rather than config files. In general, I think the goal > > >> should > > >> > be > > >> > > > to make it harder to make mistakes, especially of the kind where > > the > > >> > code > > >> > > > expects something and the config doesn't match. The current > > config > > >> is > > >> > > > quite intricate and error-prone. For example, the application > > logic > > >> > may > > >> > > > depend on bootstrapping a topic but rather than asserting that > in > > the > > >> > > code, > > >> > > > you have to rely on getting the config right. Likewise with > > serdes, > > >> > the > > >> > > > Java representations produced by various serdes (JSON, Avro, > etc.) > > >> are > > >> > > not > > >> > > > equivalent so you cannot just reconfigure a serde without > changing > > >> the > > >> > > > code. It would be nice for jobs to be able to assert what they > > >> expect > > >> > > > from their input topics in terms of partitioning. This is > > getting a > > >> > > little > > >> > > > off topic but I was even thinking about creating a "Samza config > > >> > linter" > > >> > > > that would sanity check a set of configs. Especially in > > >> organizations > > >> > > > where config is managed by a different team than the application > > >> > > developer, > > >> > > > it's very hard to get avoid config mistakes. > > >> > > > 3) Java/Scala centric - for many teams (especially DevOps-type > > >> folks), > > >> > > the > > >> > > > pain of the Java toolchain (maven, slow builds, weak command > line > > >> > > support, > > >> > > > configuration over convention) really inhibits productivity. As > > more > > >> > and > > >> > > > more high-quality clients become available for Kafka, I hope > > they'll > > >> > > follow > > >> > > > Samza's model. Not sure how much it affects the proposals in > this > > >> > thread > > >> > > > but please consider other languages in the ecosystem as well. > > From > > >> > what > > >> > > > I've heard, Spark has more Python users than Java/Scala. > > >> > > > (FYI, we added a Jython wrapper for the Samza API > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza > > >> > > > and are working on a Yeoman generator > > >> > > > https://github.com/Quantiply/generator-rico for Jython/Samza > > >> projects > > >> > to > > >> > > > alleviate some of the pain) > > >> > > > > > >> > > > I also want to underscore Jay's point about improving the user > > >> > > experience. > > >> > > > That's a very important factor for adoption. I think the goal > > should > > >> > be > > >> > > to > > >> > > > make Samza as easy to get started with as something like > Logstash. > > >> > > > Logstash is vastly inferior in terms of capabilities to Samza > but > > >> it's > > >> > > easy > > >> > > > to get started and that makes a big difference. > > >> > > > > > >> > > > Cheers, > > >> > > > > > >> > > > Roger > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De Francisci Morales < > > >> > > > g...@apache.org> wrote: > > >> > > > > > >> > > > > Forgot to add. On the naming issues, Kafka Metamorphosis is a > > clear > > >> > > > winner > > >> > > > > :) > > >> > > > > > > >> > > > > -- > > >> > > > > Gianmarco > > >> > > > > > > >> > > > > On 7 July 2015 at 13:26, Gianmarco De Francisci Morales < > > >> > > g...@apache.org > > >> > > > > > > >> > > > > wrote: > > >> > > > > > > >> > > > > > Hi, > > >> > > > > > > > >> > > > > > @Martin, thanks for you comments. > > >> > > > > > Maybe I'm missing some important point, but I think coupling > > the > > >> > > > releases > > >> > > > > > is actually a *good* thing. > > >> > > > > > To make an example, would it be better if the MR and HDFS > > >> > components > > >> > > of > > >> > > > > > Hadoop had different release schedules? > > >> > > > > > > > >> > > > > > Actually, keeping the discussion in a single place would > make > > >> > > agreeing > > >> > > > on > > >> > > > > > releases (and backwards compatibility) much easier, as > > everybody > > >> > > would > > >> > > > be > > >> > > > > > responsible for the whole codebase. > > >> > > > > > > > >> > > > > > That said, I like the idea of absorbing samza-core as a > > >> > sub-project, > > >> > > > and > > >> > > > > > leave the fancy stuff separate. > > >> > > > > > It probably gives 90% of the benefits we have been > discussing > > >> here. > > >> > > > > > > > >> > > > > > Cheers, > > >> > > > > > > > >> > > > > > -- > > >> > > > > > Gianmarco > > >> > > > > > > > >> > > > > > On 7 July 2015 at 02:30, Jay Kreps <jay.kr...@gmail.com> > > wrote: > > >> > > > > > > > >> > > > > >> Hey Martin, > > >> > > > > >> > > >> > > > > >> I agree coupling release schedules is a downside. > > >> > > > > >> > > >> > > > > >> Definitely we can try to solve some of the integration > > problems > > >> in > > >> > > > > >> Confluent Platform or in other distributions. But I think > > this > > >> > ends > > >> > > up > > >> > > > > >> being really shallow. I guess I feel to really get a good > > user > > >> > > > > experience > > >> > > > > >> the two systems have to kind of feel like part of the same > > thing > > >> > and > > >> > > > you > > >> > > > > >> can't really add that in later--you can put both in the > same > > >> > > > > downloadable > > >> > > > > >> tar file but it doesn't really give a very cohesive > feeling. > > I > > >> > agree > > >> > > > > that > > >> > > > > >> ultimately any of the project stuff is as much social and > > naming > > >> > as > > >> > > > > >> anything else--theoretically two totally independent > projects > > >> > could > > >> > > > work > > >> > > > > >> to > > >> > > > > >> tightly align. In practice this seems to be quite difficult > > >> > though. > > >> > > > > >> > > >> > > > > >> For the frameworks--totally agree it would be good to > > maintain > > >> the > > >> > > > > >> framework support with the project. In some cases there may > > not > > >> be > > >> > > too > > >> > > > > >> much > > >> > > > > >> there since the integration gets lighter but I think > whatever > > >> > stubs > > >> > > > you > > >> > > > > >> need should be included. So no I definitely wasn't trying > to > > >> imply > > >> > > > > >> dropping > > >> > > > > >> support for these frameworks, just making the integration > > >> lighter > > >> > by > > >> > > > > >> separating process management from partition management. > > >> > > > > >> > > >> > > > > >> You raise two good points we would have to figure out if we > > went > > >> > > down > > >> > > > > the > > >> > > > > >> alignment path: > > >> > > > > >> 1. With respect to the name, yeah I think the first > question > > is > > >> > > > whether > > >> > > > > >> some "re-branding" would be worth it. If so then I think we > > can > > >> > > have a > > >> > > > > big > > >> > > > > >> thread on the name. I'm definitely not set on Kafka > > Streaming or > > >> > > Kafka > > >> > > > > >> Streams I was just using them to be kind of illustrative. I > > >> agree > > >> > > with > > >> > > > > >> your > > >> > > > > >> critique of these names, though I think people would get > the > > >> idea. > > >> > > > > >> 2. Yeah you also raise a good point about how to "factor" > it. > > >> Here > > >> > > are > > >> > > > > the > > >> > > > > >> options I see (I could get enthusiastic about any of them): > > >> > > > > >> a. One repo for both Kafka and Samza > > >> > > > > >> b. Two repos, retaining the current seperation > > >> > > > > >> c. Two repos, the equivalent of samza-api and samza-core > > is > > >> > > > absorbed > > >> > > > > >> almost like a third client > > >> > > > > >> > > >> > > > > >> Cheers, > > >> > > > > >> > > >> > > > > >> -Jay > > >> > > > > >> > > >> > > > > >> On Mon, Jul 6, 2015 at 1:18 PM, Martin Kleppmann < > > >> > > > mar...@kleppmann.com> > > >> > > > > >> wrote: > > >> > > > > >> > > >> > > > > >> > Ok, thanks for the clarifications. Just a few follow-up > > >> > comments. > > >> > > > > >> > > > >> > > > > >> > - I see the appeal of merging with Kafka or becoming a > > >> > subproject: > > >> > > > the > > >> > > > > >> > reasons you mention are good. The risk I see is that > > release > > >> > > > schedules > > >> > > > > >> > become coupled to each other, which can slow everyone > down, > > >> and > > >> > > > large > > >> > > > > >> > projects with many contributors are harder to manage. > > (Jakob, > > >> > can > > >> > > > you > > >> > > > > >> speak > > >> > > > > >> > from experience, having seen a wider range of Hadoop > > ecosystem > > >> > > > > >> projects?) > > >> > > > > >> > > > >> > > > > >> > Some of the goals of a better unified developer > experience > > >> could > > >> > > > also > > >> > > > > be > > >> > > > > >> > solved by integrating Samza nicely into a Kafka > > distribution > > >> > (such > > >> > > > as > > >> > > > > >> > Confluent's). I'm not against merging projects if we > decide > > >> > that's > > >> > > > the > > >> > > > > >> way > > >> > > > > >> > to go, just pointing out the same goals can perhaps also > be > > >> > > achieved > > >> > > > > in > > >> > > > > >> > other ways. > > >> > > > > >> > > > >> > > > > >> > - With regard to dropping the YARN dependency: are you > > >> proposing > > >> > > > that > > >> > > > > >> > Samza doesn't give any help to people wanting to run on > > >> > > > > >> YARN/Mesos/AWS/etc? > > >> > > > > >> > So the docs would basically have a link to Slider and > > nothing > > >> > > else? > > >> > > > Or > > >> > > > > >> > would we maintain integrations with a bunch of popular > > >> > deployment > > >> > > > > >> methods > > >> > > > > >> > (e.g. the necessary glue and shell scripts to make Samza > > work > > >> > with > > >> > > > > >> Slider)? > > >> > > > > >> > > > >> > > > > >> > I absolutely think it's a good idea to have the "as a > > library" > > >> > and > > >> > > > > "as a > > >> > > > > >> > process" (using Yi's taxonomy) options for people who > want > > >> them, > > >> > > > but I > > >> > > > > >> > think there should also be a low-friction path for common > > "as > > >> a > > >> > > > > service" > > >> > > > > >> > deployment methods, for which we probably need to > maintain > > >> > > > > integrations. > > >> > > > > >> > > > >> > > > > >> > - Project naming: "Kafka Streams" seems odd to me, > because > > >> Kafka > > >> > > is > > >> > > > > all > > >> > > > > >> > about streams already. Perhaps "Kafka Transformers" or > > "Kafka > > >> > > > Filters" > > >> > > > > >> > would be more apt? > > >> > > > > >> > > > >> > > > > >> > One suggestion: perhaps the core of Samza (stream > > >> transformation > > >> > > > with > > >> > > > > >> > state management -- i.e. the "Samza as a library" bit) > > could > > >> > > become > > >> > > > > >> part of > > >> > > > > >> > Kafka, while higher-level tools such as streaming SQL and > > >> > > > integrations > > >> > > > > >> with > > >> > > > > >> > deployment frameworks remain in a separate project? In > > other > > >> > > words, > > >> > > > > >> Kafka > > >> > > > > >> > would absorb the proven, stable core of Samza, which > would > > >> > become > > >> > > > the > > >> > > > > >> > "third Kafka client" mentioned early in this thread. The > > Samza > > >> > > > project > > >> > > > > >> > would then target that third Kafka client as its base > API, > > and > > >> > the > > >> > > > > >> project > > >> > > > > >> > would be freed up to explore more experimental new > > horizons. > > >> > > > > >> > > > >> > > > > >> > Martin > > >> > > > > >> > > > >> > > > > >> > On 6 Jul 2015, at 18:51, Jay Kreps <jay.kr...@gmail.com> > > >> wrote: > > >> > > > > >> > > > >> > > > > >> > > Hey Martin, > > >> > > > > >> > > > > >> > > > > >> > > For the YARN/Mesos/etc decoupling I actually don't > think > > it > > >> > ties > > >> > > > our > > >> > > > > >> > hands > > >> > > > > >> > > at all, all it does is refactor things. The division of > > >> > > > > >> responsibility is > > >> > > > > >> > > that Samza core is responsible for task lifecycle, > state, > > >> and > > >> > > > > >> partition > > >> > > > > >> > > management (using the Kafka co-ordinator) but it is NOT > > >> > > > responsible > > >> > > > > >> for > > >> > > > > >> > > packaging, configuration deployment or execution of > > >> processes. > > >> > > The > > >> > > > > >> > problem > > >> > > > > >> > > of packaging and starting these processes is > > >> > > > > >> > > framework/environment-specific. This leaves individual > > >> > > frameworks > > >> > > > to > > >> > > > > >> be > > >> > > > > >> > as > > >> > > > > >> > > fancy or vanilla as they like. So you can get simple > > >> stateless > > >> > > > > >> support in > > >> > > > > >> > > YARN, Mesos, etc using their off-the-shelf app > framework > > >> > > (Slider, > > >> > > > > >> > Marathon, > > >> > > > > >> > > etc). These are well known by people and have nice UIs > > and a > > >> > lot > > >> > > > of > > >> > > > > >> > > flexibility. I don't think they have node affinity as a > > >> built > > >> > in > > >> > > > > >> option > > >> > > > > >> > > (though I could be wrong). So if we want that we can > > either > > >> > wait > > >> > > > for > > >> > > > > >> them > > >> > > > > >> > > to add it or do a custom framework to add that feature > > (as > > >> > now). > > >> > > > > >> > Obviously > > >> > > > > >> > > if you manage things with old-school ops tools > > >> > (puppet/chef/etc) > > >> > > > you > > >> > > > > >> get > > >> > > > > >> > > locality easily. The nice thing, though, is that all > the > > >> samza > > >> > > > > >> "business > > >> > > > > >> > > logic" around partition management and fault tolerance > > is in > > >> > > Samza > > >> > > > > >> core > > >> > > > > >> > so > > >> > > > > >> > > it is shared across frameworks and the framework > specific > > >> bit > > >> > is > > >> > > > > just > > >> > > > > >> > > whether it is smart enough to try to get the same host > > when > > >> a > > >> > > job > > >> > > > is > > >> > > > > >> > > restarted. > > >> > > > > >> > > > > >> > > > > >> > > With respect to the Kafka-alignment, yeah I think the > > goal > > >> > would > > >> > > > be > > >> > > > > >> (a) > > >> > > > > >> > > actually get better alignment in user experience, and > (b) > > >> > > express > > >> > > > > >> this in > > >> > > > > >> > > the naming and project branding. Specifically: > > >> > > > > >> > > 1. Website/docs, it would be nice for the > > "transformation" > > >> api > > >> > > to > > >> > > > be > > >> > > > > >> > > discoverable in the main Kafka docs--i.e. be able to > > explain > > >> > > when > > >> > > > to > > >> > > > > >> use > > >> > > > > >> > > the consumer and when to use the stream processing > > >> > functionality > > >> > > > and > > >> > > > > >> lead > > >> > > > > >> > > people into that experience. > > >> > > > > >> > > 2. Align releases so if you get Kafkza 1.4.2 (or > > whatever) > > >> > that > > >> > > > has > > >> > > > > >> both > > >> > > > > >> > > Kafka and the stream processing part and they actually > > work > > >> > > > > together. > > >> > > > > >> > > 3. Unify the programming experience so the client and > > Samza > > >> > api > > >> > > > > share > > >> > > > > >> > > config/monitoring/naming/packaging/etc. > > >> > > > > >> > > > > >> > > > > >> > > I think sub-projects keep separate committers and can > > have a > > >> > > > > separate > > >> > > > > >> > repo, > > >> > > > > >> > > but I'm actually not really sure (I can't find a > > definition > > >> > of a > > >> > > > > >> > subproject > > >> > > > > >> > > in Apache). > > >> > > > > >> > > > > >> > > > > >> > > Basically at a high-level you want the experience to > > "feel" > > >> > > like a > > >> > > > > >> single > > >> > > > > >> > > system, not to relatively independent things that are > > kind > > >> of > > >> > > > > >> awkwardly > > >> > > > > >> > > glued together. > > >> > > > > >> > > > > >> > > > > >> > > I think if we did that they having naming or branding > > like > > >> > > "kafka > > >> > > > > >> > > streaming" or "kafka streams" or something like that > > would > > >> > > > actually > > >> > > > > >> do a > > >> > > > > >> > > good job of conveying what it is. I do that this would > > help > > >> > > > adoption > > >> > > > > >> > quite > > >> > > > > >> > > a lot as it would correctly convey that using Kafka > > >> Streaming > > >> > > with > > >> > > > > >> Kafka > > >> > > > > >> > is > > >> > > > > >> > > a fairly seamless experience and Kafka is pretty > heavily > > >> > adopted > > >> > > > at > > >> > > > > >> this > > >> > > > > >> > > point. > > >> > > > > >> > > > > >> > > > > >> > > Fwiw we actually considered this model originally when > > open > > >> > > > sourcing > > >> > > > > >> > Samza, > > >> > > > > >> > > however at that time Kafka was relatively unknown and > we > > >> > decided > > >> > > > not > > >> > > > > >> to > > >> > > > > >> > do > > >> > > > > >> > > it since we felt it would be limiting. From my point of > > view > > >> > the > > >> > > > > three > > >> > > > > >> > > things have changed (1) Kafka is now really heavily > used > > for > > >> > > > stream > > >> > > > > >> > > processing, (2) we learned that abstracting out the > > stream > > >> > well > > >> > > is > > >> > > > > >> > > basically impossible, (3) we learned it is really hard > to > > >> keep > > >> > > the > > >> > > > > two > > >> > > > > >> > > things feeling like a single product. > > >> > > > > >> > > > > >> > > > > >> > > -Jay > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > On Mon, Jul 6, 2015 at 3:37 AM, Martin Kleppmann < > > >> > > > > >> mar...@kleppmann.com> > > >> > > > > >> > > wrote: > > >> > > > > >> > > > > >> > > > > >> > >> Hi all, > > >> > > > > >> > >> > > >> > > > > >> > >> Lots of good thoughts here. > > >> > > > > >> > >> > > >> > > > > >> > >> I agree with the general philosophy of tying Samza > more > > >> > firmly > > >> > > to > > >> > > > > >> Kafka. > > >> > > > > >> > >> After I spent a while looking at integrating other > > message > > >> > > > brokers > > >> > > > > >> (e.g. > > >> > > > > >> > >> Kinesis) with SystemConsumer, I came to the conclusion > > that > > >> > > > > >> > SystemConsumer > > >> > > > > >> > >> tacitly assumes a model so much like Kafka's that > pretty > > >> much > > >> > > > > nobody > > >> > > > > >> but > > >> > > > > >> > >> Kafka actually implements it. (Databus is perhaps an > > >> > exception, > > >> > > > but > > >> > > > > >> it > > >> > > > > >> > >> isn't widely used outside of LinkedIn.) Thus, making > > Samza > > >> > > fully > > >> > > > > >> > dependent > > >> > > > > >> > >> on Kafka acknowledges that the system-independence was > > >> never > > >> > as > > >> > > > > real > > >> > > > > >> as > > >> > > > > >> > we > > >> > > > > >> > >> perhaps made it out to be. The gains of code reuse are > > >> real. > > >> > > > > >> > >> > > >> > > > > >> > >> The idea of decoupling Samza from YARN has also always > > been > > >> > > > > >> appealing to > > >> > > > > >> > >> me, for various reasons already mentioned in this > > thread. > > >> > > > Although > > >> > > > > >> > making > > >> > > > > >> > >> Samza jobs deployable on anything (YARN/Mesos/AWS/etc) > > >> seems > > >> > > > > >> laudable, > > >> > > > > >> > I am > > >> > > > > >> > >> a little concerned that it will restrict us to a > lowest > > >> > common > > >> > > > > >> > denominator. > > >> > > > > >> > >> For example, would host affinity (SAMZA-617) still be > > >> > possible? > > >> > > > For > > >> > > > > >> jobs > > >> > > > > >> > >> with large amounts of state, I think SAMZA-617 would > be > > a > > >> big > > >> > > > boon, > > >> > > > > >> > since > > >> > > > > >> > >> restoring state off the changelog on every single > > restart > > >> is > > >> > > > > painful, > > >> > > > > >> > due > > >> > > > > >> > >> to long recovery times. It would be a shame if the > > >> decoupling > > >> > > > from > > >> > > > > >> YARN > > >> > > > > >> > >> made host affinity impossible. > > >> > > > > >> > >> > > >> > > > > >> > >> Jay, a question about the proposed API for > > instantiating a > > >> > job > > >> > > in > > >> > > > > >> code > > >> > > > > >> > >> (rather than a properties file): when submitting a job > > to a > > >> > > > > cluster, > > >> > > > > >> is > > >> > > > > >> > the > > >> > > > > >> > >> idea that the instantiation code runs on a client > > >> somewhere, > > >> > > > which > > >> > > > > >> then > > >> > > > > >> > >> pokes the necessary endpoints on YARN/Mesos/AWS/etc? > Or > > >> does > > >> > > that > > >> > > > > >> code > > >> > > > > >> > run > > >> > > > > >> > >> on each container that is part of the job (in which > > case, > > >> how > > >> > > > does > > >> > > > > >> the > > >> > > > > >> > job > > >> > > > > >> > >> submission to the cluster work)? > > >> > > > > >> > >> > > >> > > > > >> > >> I agree with Garry that it doesn't feel right to make > a > > 1.0 > > >> > > > release > > >> > > > > >> > with a > > >> > > > > >> > >> plan for it to be immediately obsolete. So if this is > > going > > >> > to > > >> > > > > >> happen, I > > >> > > > > >> > >> think it would be more honest to stick with 0.* > version > > >> > numbers > > >> > > > > until > > >> > > > > >> > the > > >> > > > > >> > >> library-ified Samza has been implemented, is stable > and > > >> > widely > > >> > > > > used. > > >> > > > > >> > >> > > >> > > > > >> > >> Should the new Samza be a subproject of Kafka? There > is > > >> > > precedent > > >> > > > > for > > >> > > > > >> > >> tight coupling between different Apache projects (e.g. > > >> > Curator > > >> > > > and > > >> > > > > >> > >> Zookeeper, or Slider and YARN), so I think remaining > > >> separate > > >> > > > would > > >> > > > > >> be > > >> > > > > >> > ok. > > >> > > > > >> > >> Even if Samza is fully dependent on Kafka, there is > > enough > > >> > > > > substance > > >> > > > > >> in > > >> > > > > >> > >> Samza that it warrants being a separate project. An > > >> argument > > >> > in > > >> > > > > >> favour > > >> > > > > >> > of > > >> > > > > >> > >> merging would be if we think Kafka has a much stronger > > >> "brand > > >> > > > > >> presence" > > >> > > > > >> > >> than Samza; I'm ambivalent on that one. If the Kafka > > >> project > > >> > is > > >> > > > > >> willing > > >> > > > > >> > to > > >> > > > > >> > >> endorse Samza as the "official" way of doing stateful > > >> stream > > >> > > > > >> > >> transformations, that would probably have much the > same > > >> > effect > > >> > > as > > >> > > > > >> > >> re-branding Samza as "Kafka Stream Processors" or > > suchlike. > > >> > > Close > > >> > > > > >> > >> collaboration between the two projects will be needed > in > > >> any > > >> > > > case. > > >> > > > > >> > >> > > >> > > > > >> > >> From a project management perspective, I guess the > "new > > >> > Samza" > > >> > > > > would > > >> > > > > >> > have > > >> > > > > >> > >> to be developed on a branch alongside ongoing > > maintenance > > >> of > > >> > > the > > >> > > > > >> current > > >> > > > > >> > >> line of development? I think it would be important to > > >> > continue > > >> > > > > >> > supporting > > >> > > > > >> > >> existing users, and provide a graceful migration path > to > > >> the > > >> > > new > > >> > > > > >> > version. > > >> > > > > >> > >> Leaving the current versions unsupported and forcing > > people > > >> > to > > >> > > > > >> rewrite > > >> > > > > >> > >> their jobs would send a bad signal. > > >> > > > > >> > >> > > >> > > > > >> > >> Best, > > >> > > > > >> > >> Martin > > >> > > > > >> > >> > > >> > > > > >> > >> On 2 Jul 2015, at 16:59, Jay Kreps <j...@confluent.io> > > >> wrote: > > >> > > > > >> > >> > > >> > > > > >> > >>> Hey Garry, > > >> > > > > >> > >>> > > >> > > > > >> > >>> Yeah that's super frustrating. I'd be happy to chat > > more > > >> > about > > >> > > > > this > > >> > > > > >> if > > >> > > > > >> > >>> you'd be interested. I think Chris and I started with > > the > > >> > idea > > >> > > > of > > >> > > > > >> "what > > >> > > > > >> > >>> would it take to make Samza a kick-ass ingestion > tool" > > but > > >> > > > > >> ultimately > > >> > > > > >> > we > > >> > > > > >> > >>> kind of came around to the idea that ingestion and > > >> > > > transformation > > >> > > > > >> had > > >> > > > > >> > >>> pretty different needs and coupling the two made > things > > >> > hard. > > >> > > > > >> > >>> > > >> > > > > >> > >>> For what it's worth I think copycat (KIP-26) actually > > will > > >> > do > > >> > > > what > > >> > > > > >> you > > >> > > > > >> > >> are > > >> > > > > >> > >>> looking for. > > >> > > > > >> > >>> > > >> > > > > >> > >>> With regard to your point about slider, I don't > > >> necessarily > > >> > > > > >> disagree. > > >> > > > > >> > >> But I > > >> > > > > >> > >>> think getting good YARN support is quite doable and I > > >> think > > >> > we > > >> > > > can > > >> > > > > >> make > > >> > > > > >> > >>> that work well. I think the issue this proposal > solves > > is > > >> > that > > >> > > > > >> > >> technically > > >> > > > > >> > >>> it is pretty hard to support multiple cluster > > management > > >> > > systems > > >> > > > > the > > >> > > > > >> > way > > >> > > > > >> > >>> things are now, you need to write an "app master" or > > >> > > "framework" > > >> > > > > for > > >> > > > > >> > each > > >> > > > > >> > >>> and they are all a little different so testing is > > really > > >> > hard. > > >> > > > In > > >> > > > > >> the > > >> > > > > >> > >>> absence of this we have been stuck with just YARN > which > > >> has > > >> > > > > >> fantastic > > >> > > > > >> > >>> penetration in the Hadoopy part of the org, but zero > > >> > > penetration > > >> > > > > >> > >> elsewhere. > > >> > > > > >> > >>> Given the huge amount of work being put in to slider, > > >> > > marathon, > > >> > > > > aws > > >> > > > > >> > >>> tooling, not to mention the umpteen related packaging > > >> > > > technologies > > >> > > > > >> > people > > >> > > > > >> > >>> want to use (Docker, Kubernetes, various > cloud-specific > > >> > deploy > > >> > > > > >> tools, > > >> > > > > >> > >> etc) > > >> > > > > >> > >>> I really think it is important to get this right. > > >> > > > > >> > >>> > > >> > > > > >> > >>> -Jay > > >> > > > > >> > >>> > > >> > > > > >> > >>> On Thu, Jul 2, 2015 at 4:17 AM, Garry Turkington < > > >> > > > > >> > >>> g.turking...@improvedigital.com> wrote: > > >> > > > > >> > >>> > > >> > > > > >> > >>>> Hi all, > > >> > > > > >> > >>>> > > >> > > > > >> > >>>> I think the question below re does Samza become a > > >> > sub-project > > >> > > > of > > >> > > > > >> Kafka > > >> > > > > >> > >>>> highlights the broader point around migration. Chris > > >> > mentions > > >> > > > > >> Samza's > > >> > > > > >> > >>>> maturity is heading towards a v1 release but I'm not > > sure > > >> > it > > >> > > > > feels > > >> > > > > >> > >> right to > > >> > > > > >> > >>>> launch a v1 then immediately plan to deprecate most > of > > >> it. > > >> > > > > >> > >>>> > > >> > > > > >> > >>>> From a selfish perspective I have some guys who have > > >> > started > > >> > > > > >> working > > >> > > > > >> > >> with > > >> > > > > >> > >>>> Samza and building some new consumers/producers was > > next > > >> > up. > > >> > > > > Sounds > > >> > > > > >> > like > > >> > > > > >> > >>>> that is absolutely not the direction to go. I need > to > > >> look > > >> > > into > > >> > > > > the > > >> > > > > >> > KIP > > >> > > > > >> > >> in > > >> > > > > >> > >>>> more detail but for me the attractiveness of adding > > new > > >> > Samza > > >> > > > > >> > >>>> consumer/producers -- even if yes all they were > doing > > was > > >> > > > really > > >> > > > > >> > getting > > >> > > > > >> > >>>> data into and out of Kafka -- was to avoid having > to > > >> > worry > > >> > > > > about > > >> > > > > >> the > > >> > > > > >> > >>>> lifecycle management of external clients. If there > is > > a > > >> > > generic > > >> > > > > >> Kafka > > >> > > > > >> > >>>> ingress/egress layer that I can plug a new connector > > into > > >> > and > > >> > > > > have > > >> > > > > >> a > > >> > > > > >> > >> lot of > > >> > > > > >> > >>>> the heavy lifting re scale and reliability done for > me > > >> then > > >> > > it > > >> > > > > >> gives > > >> > > > > >> > me > > >> > > > > >> > >> all > > >> > > > > >> > >>>> the pushing new consumers/producers would. If not > > then it > > >> > > > > >> complicates > > >> > > > > >> > my > > >> > > > > >> > >>>> operational deployments. > > >> > > > > >> > >>>> > > >> > > > > >> > >>>> Which is similar to my other question with the > > proposal > > >> -- > > >> > if > > >> > > > we > > >> > > > > >> > build a > > >> > > > > >> > >>>> fully available/stand-alone Samza plus the requisite > > >> shims > > >> > to > > >> > > > > >> > integrate > > >> > > > > >> > >>>> with Slider etc I suspect the former may be a lot > more > > >> work > > >> > > > than > > >> > > > > we > > >> > > > > >> > >> think. > > >> > > > > >> > >>>> We may make it much easier for a newcomer to get > > >> something > > >> > > > > running > > >> > > > > >> but > > >> > > > > >> > >>>> having them step up and get a reliable production > > >> > deployment > > >> > > > may > > >> > > > > >> still > > >> > > > > >> > >>>> dominate mailing list traffic, if for different > > reasons > > >> > than > > >> > > > > >> today. > > >> > > > > >> > >>>> > > >> > > > > >> > >>>> Don't get me wrong -- I'm comfortable with making > the > > >> Samza > > >> > > > > >> dependency > > >> > > > > >> > >> on > > >> > > > > >> > >>>> Kafka much more explicit and I absolutely see the > > >> benefits > > >> > > in > > >> > > > > the > > >> > > > > >> > >>>> reduction of duplication and clashing > > >> > > > terminologies/abstractions > > >> > > > > >> that > > >> > > > > >> > >>>> Chris/Jay describe. Samza as a library would likely > > be a > > >> > very > > >> > > > > nice > > >> > > > > >> > tool > > >> > > > > >> > >> to > > >> > > > > >> > >>>> add to the Kafka ecosystem. I just have the concerns > > >> above > > >> > re > > >> > > > the > > >> > > > > >> > >>>> operational side. > > >> > > > > >> > >>>> > > >> > > > > >> > >>>> Garry > > >> > > > > >> > >>>> > > >> > > > > >> > >>>> -----Original Message----- > > >> > > > > >> > >>>> From: Gianmarco De Francisci Morales [mailto: > > >> > g...@apache.org > > >> > > ] > > >> > > > > >> > >>>> Sent: 02 July 2015 12:56 > > >> > > > > >> > >>>> To: dev@samza.apache.org > > >> > > > > >> > >>>> Subject: Re: Thoughts and obesrvations on Samza > > >> > > > > >> > >>>> > > >> > > > > >> > >>>> Very interesting thoughts. > > >> > > > > >> > >>>> From outside, I have always perceived Samza as a > > >> computing > > >> > > > layer > > >> > > > > >> over > > >> > > > > >> > >>>> Kafka. > > >> > > > > >> > >>>> > > >> > > > > >> > >>>> The question, maybe a bit provocative, is "should > > Samza > > >> be > > >> > a > > >> > > > > >> > sub-project > > >> > > > > >> > >>>> of Kafka then?" > > >> > > > > >> > >>>> Or does it make sense to keep it as a separate > project > > >> > with a > > >> > > > > >> separate > > >> > > > > >> > >>>> governance? > > >> > > > > >> > >>>> > > >> > > > > >> > >>>> Cheers, > > >> > > > > >> > >>>> > > >> > > > > >> > >>>> -- > > >> > > > > >> > >>>> Gianmarco > > >> > > > > >> > >>>> > > >> > > > > >> > >>>> On 2 July 2015 at 08:59, Yan Fang < > > yanfang...@gmail.com> > > >> > > > wrote: > > >> > > > > >> > >>>> > > >> > > > > >> > >>>>> Overall, I agree to couple with Kafka more tightly. > > >> > Because > > >> > > > > Samza > > >> > > > > >> de > > >> > > > > >> > >>>>> facto is based on Kafka, and it should leverage > what > > >> Kafka > > >> > > > has. > > >> > > > > At > > >> > > > > >> > the > > >> > > > > >> > >>>>> same time, Kafka does not need to reinvent what > Samza > > >> > > already > > >> > > > > >> has. I > > >> > > > > >> > >>>>> also like the idea of separating the ingestion and > > >> > > > > transformation. > > >> > > > > >> > >>>>> > > >> > > > > >> > >>>>> But it is a little difficult for me to image how > the > > >> Samza > > >> > > > will > > >> > > > > >> look > > >> > > > > >> > >>>> like. > > >> > > > > >> > >>>>> And I feel Chris and Jay have a little difference > in > > >> terms > > >> > > of > > >> > > > > how > > >> > > > > >> > >>>>> Samza should look like. > > >> > > > > >> > >>>>> > > >> > > > > >> > >>>>> *** Will it look like what Jay's code shows (A > > client of > > >> > > > Kakfa) > > >> > > > > ? > > >> > > > > >> And > > >> > > > > >> > >>>>> user's application code calls this client? > > >> > > > > >> > >>>>> > > >> > > > > >> > >>>>> 1. If we make Samza be a library of Kafka (like > what > > the > > >> > > code > > >> > > > > >> shows), > > >> > > > > >> > >>>>> how do we implement auto-balance and > fault-tolerance? > > >> Are > > >> > > they > > >> > > > > >> taken > > >> > > > > >> > >>>>> care by the Kafka broker or other mechanism, such > as > > >> > "Samza > > >> > > > > >> worker" > > >> > > > > >> > >>>>> (just make up the name) ? > > >> > > > > >> > >>>>> > > >> > > > > >> > >>>>> 2. What about other features, such as auto-scaling, > > >> shared > > >> > > > > state, > > >> > > > > >> > >>>>> monitoring? > > >> > > > > >> > >>>>> > > >> > > > > >> > >>>>> > > >> > > > > >> > >>>>> *** If we have Samza standalone, (is this what > Chris > > >> > > > suggests?) > > >> > > > > >> > >>>>> > > >> > > > > >> > >>>>> 1. we still need to ingest data from Kakfa and > > produce > > >> to > > >> > > it. > > >> > > > > >> Then it > > >> > > > > >> > >>>>> becomes the same as what Samza looks like now, > > except it > > >> > > does > > >> > > > > not > > >> > > > > >> > rely > > >> > > > > >> > >>>>> on Yarn anymore. > > >> > > > > >> > >>>>> > > >> > > > > >> > >>>>> 2. if it is standalone, how can it leverage Kafka's > > >> > metrics, > > >> > > > > logs, > > >> > > > > >> > >>>>> etc? Use Kafka code as the dependency? > > >> > > > > >> > >>>>> > > >> > > > > >> > >>>>> > > >> > > > > >> > >>>>> Thanks, > > >> > > > > >> > >>>>> > > >> > > > > >> > >>>>> Fang, Yan > > >> > > > > >> > >>>>> yanfang...@gmail.com > > >> > > > > >> > >>>>> > > >> > > > > >> > >>>>> On Wed, Jul 1, 2015 at 5:46 PM, Guozhang Wang < > > >> > > > > wangg...@gmail.com > > >> > > > > >> > > > >> > > > > >> > >>>> wrote: > > >> > > > > >> > >>>>> > > >> > > > > >> > >>>>>> Read through the code example and it looks good to > > me. > > >> A > > >> > > few > > >> > > > > >> > >>>>>> thoughts regarding deployment: > > >> > > > > >> > >>>>>> > > >> > > > > >> > >>>>>> Today Samza deploys as executable runnable like: > > >> > > > > >> > >>>>>> > > >> > > > > >> > >>>>>> deploy/samza/bin/run-job.sh --config-factory=... > > >> > > > > >> > >>>> --config-path=file://... > > >> > > > > >> > >>>>>> > > >> > > > > >> > >>>>>> And this proposal advocate for deploying Samza > more > > as > > >> > > > embedded > > >> > > > > >> > >>>>>> libraries in user application code (ignoring the > > >> > > terminology > > >> > > > > >> since > > >> > > > > >> > >>>>>> it is not the > > >> > > > > >> > >>>>> same > > >> > > > > >> > >>>>>> as the prototype code): > > >> > > > > >> > >>>>>> > > >> > > > > >> > >>>>>> StreamTask task = new MyStreamTask(configs); > Thread > > >> > thread > > >> > > = > > >> > > > > new > > >> > > > > >> > >>>>>> Thread(task); thread.start(); > > >> > > > > >> > >>>>>> > > >> > > > > >> > >>>>>> I think both of these deployment modes are > important > > >> for > > >> > > > > >> different > > >> > > > > >> > >>>>>> types > > >> > > > > >> > >>>>> of > > >> > > > > >> > >>>>>> users. That said, I think making Samza purely > > >> standalone > > >> > is > > >> > > > > still > > >> > > > > >> > >>>>>> sufficient for either runnable or library modes. > > >> > > > > >> > >>>>>> > > >> > > > > >> > >>>>>> Guozhang > > >> > > > > >> > >>>>>> > > >> > > > > >> > >>>>>> On Tue, Jun 30, 2015 at 11:33 PM, Jay Kreps < > > >> > > > j...@confluent.io> > > >> > > > > >> > wrote: > > >> > > > > >> > >>>>>> > > >> > > > > >> > >>>>>>> Looks like gmail mangled the code example, it was > > >> > supposed > > >> > > > to > > >> > > > > >> look > > >> > > > > >> > >>>>>>> like > > >> > > > > >> > >>>>>>> this: > > >> > > > > >> > >>>>>>> > > >> > > > > >> > >>>>>>> Properties props = new Properties(); > > >> > > > > >> > >>>>>>> props.put("bootstrap.servers", "localhost:4242"); > > >> > > > > >> StreamingConfig > > >> > > > > >> > >>>>>>> config = new StreamingConfig(props); > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1", "test-topic-2"); > > >> > > > > >> > >>>>>>> config.processor(ExampleStreamProcessor.class); > > >> > > > > >> > >>>>>>> config.serialization(new StringSerializer(), new > > >> > > > > >> > >>>>>>> StringDeserializer()); KafkaStreaming container = > > new > > >> > > > > >> > >>>>>>> KafkaStreaming(config); container.run(); > > >> > > > > >> > >>>>>>> > > >> > > > > >> > >>>>>>> -Jay > > >> > > > > >> > >>>>>>> > > >> > > > > >> > >>>>>>> On Tue, Jun 30, 2015 at 11:32 PM, Jay Kreps < > > >> > > > j...@confluent.io > > >> > > > > > > > >> > > > > >> > >>>> wrote: > > >> > > > > >> > >>>>>>> > > >> > > > > >> > >>>>>>>> Hey guys, > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> This came out of some conversations Chris and I > > were > > >> > > having > > >> > > > > >> > >>>>>>>> around > > >> > > > > >> > >>>>>>> whether > > >> > > > > >> > >>>>>>>> it would make sense to use Samza as a kind of > data > > >> > > > ingestion > > >> > > > > >> > >>>>> framework > > >> > > > > >> > >>>>>>> for > > >> > > > > >> > >>>>>>>> Kafka (which ultimately lead to KIP-26 > "copycat"). > > >> This > > >> > > > kind > > >> > > > > of > > >> > > > > >> > >>>>>> combined > > >> > > > > >> > >>>>>>>> with complaints around config and YARN and the > > >> > discussion > > >> > > > > >> around > > >> > > > > >> > >>>>>>>> how > > >> > > > > >> > >>>>> to > > >> > > > > >> > >>>>>>>> best do a standalone mode. > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> So the thought experiment was, given that Samza > > was > > >> > > > basically > > >> > > > > >> > >>>>>>>> already totally Kafka specific, what if you just > > >> > embraced > > >> > > > > that > > >> > > > > >> > >>>>>>>> and turned it > > >> > > > > >> > >>>>>> into > > >> > > > > >> > >>>>>>>> something less like a heavyweight framework and > > more > > >> > > like a > > >> > > > > >> > >>>>>>>> third > > >> > > > > >> > >>>>> Kafka > > >> > > > > >> > >>>>>>>> client--a kind of "producing consumer" with > state > > >> > > > management > > >> > > > > >> > >>>>>> facilities. > > >> > > > > >> > >>>>>>>> Basically a library. Instead of a complex stream > > >> > > processing > > >> > > > > >> > >>>>>>>> framework > > >> > > > > >> > >>>>>>> this > > >> > > > > >> > >>>>>>>> would actually be a very simple thing, not much > > more > > >> > > > > >> complicated > > >> > > > > >> > >>>>>>>> to > > >> > > > > >> > >>>>> use > > >> > > > > >> > >>>>>>> or > > >> > > > > >> > >>>>>>>> operate than a Kafka consumer. As Chris said we > > >> thought > > >> > > > about > > >> > > > > >> it > > >> > > > > >> > >>>>>>>> a > > >> > > > > >> > >>>>> lot > > >> > > > > >> > >>>>>> of > > >> > > > > >> > >>>>>>>> what Samza (and the other stream processing > > systems > > >> > were > > >> > > > > doing) > > >> > > > > >> > >>>>> seemed > > >> > > > > >> > >>>>>>> like > > >> > > > > >> > >>>>>>>> kind of a hangover from MapReduce. > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> Of course you need to ingest/output data to and > > from > > >> > the > > >> > > > > stream > > >> > > > > >> > >>>>>>>> processing. But when we actually looked into how > > that > > >> > > would > > >> > > > > >> > >>>>>>>> work, > > >> > > > > >> > >>>>> Samza > > >> > > > > >> > >>>>>>>> isn't really an ideal data ingestion framework > > for a > > >> > > bunch > > >> > > > of > > >> > > > > >> > >>>>> reasons. > > >> > > > > >> > >>>>>> To > > >> > > > > >> > >>>>>>>> really do that right you need a pretty different > > >> > internal > > >> > > > > data > > >> > > > > >> > >>>>>>>> model > > >> > > > > >> > >>>>>> and > > >> > > > > >> > >>>>>>>> set of apis. So what if you split them and had > an > > api > > >> > for > > >> > > > > Kafka > > >> > > > > >> > >>>>>>>> ingress/egress (copycat AKA KIP-26) and a > separate > > >> api > > >> > > for > > >> > > > > >> Kafka > > >> > > > > >> > >>>>>>>> transformation (Samza). > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> This would also allow really embracing the same > > >> > > terminology > > >> > > > > and > > >> > > > > >> > >>>>>>>> conventions. One complaint about the current > > state is > > >> > > that > > >> > > > > the > > >> > > > > >> > >>>>>>>> two > > >> > > > > >> > >>>>>>> systems > > >> > > > > >> > >>>>>>>> kind of feel bolted on. Terminology like > "stream" > > vs > > >> > > > "topic" > > >> > > > > >> and > > >> > > > > >> > >>>>>>> different > > >> > > > > >> > >>>>>>>> config and monitoring systems means you kind of > > have > > >> to > > >> > > > learn > > >> > > > > >> > >>>>>>>> Kafka's > > >> > > > > >> > >>>>>>> way, > > >> > > > > >> > >>>>>>>> then learn Samza's slightly different way, then > > kind > > >> of > > >> > > > > >> > >>>>>>>> understand > > >> > > > > >> > >>>>> how > > >> > > > > >> > >>>>>>> they > > >> > > > > >> > >>>>>>>> map to each other, which having walked a few > > people > > >> > > through > > >> > > > > >> this > > >> > > > > >> > >>>>>>>> is surprisingly tricky for folks to get. > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> Since I have been spending a lot of time on > > >> airplanes I > > >> > > > > hacked > > >> > > > > >> > >>>>>>>> up an ernest but still somewhat incomplete > > prototype > > >> of > > >> > > > what > > >> > > > > >> > >>>>>>>> this would > > >> > > > > >> > >>>>> look > > >> > > > > >> > >>>>>>>> like. This is just unceremoniously dumped into > > Kafka > > >> as > > >> > > it > > >> > > > > >> > >>>>>>>> required a > > >> > > > > >> > >>>>>> few > > >> > > > > >> > >>>>>>>> changes to the new consumer. Here is the code: > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>> > > >> > > > > >> > >>>>>> > > >> > > > > >> > >>>>> > > >> > > > > >> > > > >> > > > > > > >> > > > https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org > > >> > > > > >> > >>>>> /apache/kafka/clients/streaming > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> For the purpose of the prototype I just > liberally > > >> > renamed > > >> > > > > >> > >>>>>>>> everything > > >> > > > > >> > >>>>> to > > >> > > > > >> > >>>>>>>> try to align it with Kafka with no regard for > > >> > > > compatibility. > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> To use this would be something like this: > > >> > > > > >> > >>>>>>>> Properties props = new Properties(); > > >> > > > > >> > >>>>>>>> props.put("bootstrap.servers", > "localhost:4242"); > > >> > > > > >> > >>>>>>>> StreamingConfig config = new > > >> > > > > >> > >>>>> StreamingConfig(props); > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1", > > >> > > > > >> > >>>>>>>> "test-topic-2"); > > >> > > > > >> config.processor(ExampleStreamProcessor.class); > > >> > > > > >> > >>>>>>> config.serialization(new > > >> > > > > >> > >>>>>>>> StringSerializer(), new StringDeserializer()); > > >> > > > KafkaStreaming > > >> > > > > >> > >>>>>> container = > > >> > > > > >> > >>>>>>>> new KafkaStreaming(config); container.run(); > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> KafkaStreaming is basically the SamzaContainer; > > >> > > > > StreamProcessor > > >> > > > > >> > >>>>>>>> is basically StreamTask. > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> So rather than putting all the class names in a > > file > > >> > and > > >> > > > then > > >> > > > > >> > >>>>>>>> having > > >> > > > > >> > >>>>>> the > > >> > > > > >> > >>>>>>>> job assembled by reflection, you just > instantiate > > the > > >> > > > > container > > >> > > > > >> > >>>>>>>> programmatically. Work is balanced over however > > many > > >> > > > > instances > > >> > > > > >> > >>>>>>>> of > > >> > > > > >> > >>>>> this > > >> > > > > >> > >>>>>>> are > > >> > > > > >> > >>>>>>>> alive at any time (i.e. if an instance dies, new > > >> tasks > > >> > > are > > >> > > > > >> added > > >> > > > > >> > >>>>>>>> to > > >> > > > > >> > >>>>> the > > >> > > > > >> > >>>>>>>> existing containers without shutting them down). > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> We would provide some glue for running this > stuff > > in > > >> > YARN > > >> > > > via > > >> > > > > >> > >>>>>>>> Slider, Mesos via Marathon, and AWS using some > of > > >> their > > >> > > > tools > > >> > > > > >> > >>>>>>>> but from the > > >> > > > > >> > >>>>>> point > > >> > > > > >> > >>>>>>> of > > >> > > > > >> > >>>>>>>> view of these frameworks these stream processing > > jobs > > >> > are > > >> > > > > just > > >> > > > > >> > >>>>>> stateless > > >> > > > > >> > >>>>>>>> services that can come and go and expand and > > contract > > >> > at > > >> > > > > will. > > >> > > > > >> > >>>>>>>> There > > >> > > > > >> > >>>>> is > > >> > > > > >> > >>>>>>> no > > >> > > > > >> > >>>>>>>> more custom scheduler. > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> Here are some relevant details: > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> 1. It is only ~1300 lines of code, it would get > > >> larger > > >> > > if > > >> > > > we > > >> > > > > >> > >>>>>>>> productionized but not vastly larger. We really > > do > > >> > get a > > >> > > > ton > > >> > > > > >> > >>>>>>>> of > > >> > > > > >> > >>>>>>> leverage > > >> > > > > >> > >>>>>>>> out of Kafka. > > >> > > > > >> > >>>>>>>> 2. Partition management is fully delegated to > the > > >> new > > >> > > > > >> consumer. > > >> > > > > >> > >>>>> This > > >> > > > > >> > >>>>>>>> is nice since now any partition management > > strategy > > >> > > > > available > > >> > > > > >> > >>>>>>>> to > > >> > > > > >> > >>>>>> Kafka > > >> > > > > >> > >>>>>>>> consumer is also available to Samza (and vice > > versa) > > >> > and > > >> > > > > with > > >> > > > > >> > >>>>>>>> the > > >> > > > > >> > >>>>>>> exact > > >> > > > > >> > >>>>>>>> same configs. > > >> > > > > >> > >>>>>>>> 3. It supports state as well as state reuse > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> Anyhow take a look, hopefully it is thought > > >> provoking. > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> -Jay > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> On Tue, Jun 30, 2015 at 6:55 PM, Chris > Riccomini < > > >> > > > > >> > >>>>>> criccom...@apache.org> > > >> > > > > >> > >>>>>>>> wrote: > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>>> Hey all, > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> I have had some discussions with Samza > engineers > > at > > >> > > > LinkedIn > > >> > > > > >> > >>>>>>>>> and > > >> > > > > >> > >>>>>>> Confluent > > >> > > > > >> > >>>>>>>>> and we came up with a few observations and > would > > >> like > > >> > to > > >> > > > > >> > >>>>>>>>> propose > > >> > > > > >> > >>>>> some > > >> > > > > >> > >>>>>>>>> changes. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> We've observed some things that I want to call > > out > > >> > about > > >> > > > > >> > >>>>>>>>> Samza's > > >> > > > > >> > >>>>>> design, > > >> > > > > >> > >>>>>>>>> and I'd like to propose some changes. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> * Samza is dependent upon a dynamic deployment > > >> system. > > >> > > > > >> > >>>>>>>>> * Samza is too pluggable. > > >> > > > > >> > >>>>>>>>> * Samza's SystemConsumer/SystemProducer and > > Kafka's > > >> > > > consumer > > >> > > > > >> > >>>>>>>>> APIs > > >> > > > > >> > >>>>> are > > >> > > > > >> > >>>>>>>>> trying to solve a lot of the same problems. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> All three of these issues are related, but I'll > > >> > address > > >> > > > them > > >> > > > > >> in > > >> > > > > >> > >>>>> order. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> Deployment > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> Samza strongly depends on the use of a dynamic > > >> > > deployment > > >> > > > > >> > >>>>>>>>> scheduler > > >> > > > > >> > >>>>>> such > > >> > > > > >> > >>>>>>>>> as > > >> > > > > >> > >>>>>>>>> YARN, Mesos, etc. When we initially built > Samza, > > we > > >> > bet > > >> > > > that > > >> > > > > >> > >>>>>>>>> there > > >> > > > > >> > >>>>>> would > > >> > > > > >> > >>>>>>>>> be > > >> > > > > >> > >>>>>>>>> one or two winners in this area, and we could > > >> support > > >> > > > them, > > >> > > > > >> and > > >> > > > > >> > >>>>>>>>> the > > >> > > > > >> > >>>>>> rest > > >> > > > > >> > >>>>>>>>> would go away. In reality, there are many > > >> variations. > > >> > > > > >> > >>>>>>>>> Furthermore, > > >> > > > > >> > >>>>>> many > > >> > > > > >> > >>>>>>>>> people still prefer to just start their > > processors > > >> > like > > >> > > > > normal > > >> > > > > >> > >>>>>>>>> Java processes, and use traditional deployment > > >> scripts > > >> > > > such > > >> > > > > as > > >> > > > > >> > >>>>>>>>> Fabric, > > >> > > > > >> > >>>>>> Chef, > > >> > > > > >> > >>>>>>>>> Ansible, etc. Forcing a deployment system on > > users > > >> > makes > > >> > > > the > > >> > > > > >> > >>>>>>>>> Samza start-up process really painful for first > > time > > >> > > > users. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> Dynamic deployment as a requirement was also a > > bit > > >> of > > >> > a > > >> > > > > >> > >>>>>>>>> mis-fire > > >> > > > > >> > >>>>>> because > > >> > > > > >> > >>>>>>>>> of > > >> > > > > >> > >>>>>>>>> a fundamental misunderstanding between the > > nature of > > >> > > batch > > >> > > > > >> jobs > > >> > > > > >> > >>>>>>>>> and > > >> > > > > >> > >>>>>>> stream > > >> > > > > >> > >>>>>>>>> processing jobs. Early on, we made conscious > > effort > > >> to > > >> > > > favor > > >> > > > > >> > >>>>>>>>> the > > >> > > > > >> > >>>>>> Hadoop > > >> > > > > >> > >>>>>>>>> (Map/Reduce) way of doing things, since it > worked > > >> and > > >> > > was > > >> > > > > well > > >> > > > > >> > >>>>>>> understood. > > >> > > > > >> > >>>>>>>>> One thing that we missed was that batch jobs > > have a > > >> > > > definite > > >> > > > > >> > >>>>>> beginning, > > >> > > > > >> > >>>>>>>>> and > > >> > > > > >> > >>>>>>>>> end, and stream processing jobs don't > (usually). > > >> This > > >> > > > leads > > >> > > > > to > > >> > > > > >> > >>>>>>>>> a > > >> > > > > >> > >>>>> much > > >> > > > > >> > >>>>>>>>> simpler scheduling problem for stream > processors. > > >> You > > >> > > > > >> basically > > >> > > > > >> > >>>>>>>>> just > > >> > > > > >> > >>>>>>> need > > >> > > > > >> > >>>>>>>>> to find a place to start the processor, and > start > > >> it. > > >> > > The > > >> > > > > way > > >> > > > > >> > >>>>>>>>> we run grids, at LinkedIn, there's no concept > of > > a > > >> > > cluster > > >> > > > > >> > >>>>>>>>> being "full". We always > > >> > > > > >> > >>>>>> add > > >> > > > > >> > >>>>>>>>> more machines. The problem with coupling Samza > > with > > >> a > > >> > > > > >> scheduler > > >> > > > > >> > >>>>>>>>> is > > >> > > > > >> > >>>>>> that > > >> > > > > >> > >>>>>>>>> Samza (as a framework) now has to handle > > deployment. > > >> > > This > > >> > > > > >> pulls > > >> > > > > >> > >>>>>>>>> in a > > >> > > > > >> > >>>>>>> bunch > > >> > > > > >> > >>>>>>>>> of things such as configuration distribution > > (config > > >> > > > > stream), > > >> > > > > >> > >>>>>>>>> shell > > >> > > > > >> > >>>>>>> scrips > > >> > > > > >> > >>>>>>>>> (bin/run-job.sh, JobRunner), packaging (all the > > .tgz > > >> > > > stuff), > > >> > > > > >> etc. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> Another reason for requiring dynamic deployment > > was > > >> to > > >> > > > > support > > >> > > > > >> > >>>>>>>>> data locality. If you want to have locality, > you > > >> need > > >> > to > > >> > > > put > > >> > > > > >> > >>>>>>>>> your > > >> > > > > >> > >>>>>> processors > > >> > > > > >> > >>>>>>>>> close to the data they're processing. Upon > > further > > >> > > > > >> > >>>>>>>>> investigation, > > >> > > > > >> > >>>>>>> though, > > >> > > > > >> > >>>>>>>>> this feature is not that beneficial. There is > > some > > >> > good > > >> > > > > >> > >>>>>>>>> discussion > > >> > > > > >> > >>>>>> about > > >> > > > > >> > >>>>>>>>> some problems with it on SAMZA-335. Again, we > > took > > >> the > > >> > > > > >> > >>>>>>>>> Map/Reduce > > >> > > > > >> > >>>>>> path, > > >> > > > > >> > >>>>>>>>> but > > >> > > > > >> > >>>>>>>>> there are some fundamental differences between > > HDFS > > >> > and > > >> > > > > Kafka. > > >> > > > > >> > >>>>>>>>> HDFS > > >> > > > > >> > >>>>>> has > > >> > > > > >> > >>>>>>>>> blocks, while Kafka has partitions. This leads > to > > >> less > > >> > > > > >> > >>>>>>>>> optimization potential with stream processors > on > > top > > >> > of > > >> > > > > Kafka. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> This feature is also used as a crutch. Samza > > doesn't > > >> > > have > > >> > > > > any > > >> > > > > >> > >>>>>>>>> built > > >> > > > > >> > >>>>> in > > >> > > > > >> > >>>>>>>>> fault-tolerance logic. Instead, it depends on > the > > >> > > dynamic > > >> > > > > >> > >>>>>>>>> deployment scheduling system to handle restarts > > >> when a > > >> > > > > >> > >>>>>>>>> processor dies. This has > > >> > > > > >> > >>>>>>> made > > >> > > > > >> > >>>>>>>>> it very difficult to write a standalone Samza > > >> > container > > >> > > > > >> > >>>> (SAMZA-516). > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> Pluggability > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> In some cases pluggability is good, but I think > > that > > >> > > we've > > >> > > > > >> gone > > >> > > > > >> > >>>>>>>>> too > > >> > > > > >> > >>>>>> far > > >> > > > > >> > >>>>>>>>> with it. Currently, Samza has: > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> * Pluggable config. > > >> > > > > >> > >>>>>>>>> * Pluggable metrics. > > >> > > > > >> > >>>>>>>>> * Pluggable deployment systems. > > >> > > > > >> > >>>>>>>>> * Pluggable streaming systems (SystemConsumer, > > >> > > > > SystemProducer, > > >> > > > > >> > >>>> etc). > > >> > > > > >> > >>>>>>>>> * Pluggable serdes. > > >> > > > > >> > >>>>>>>>> * Pluggable storage engines. > > >> > > > > >> > >>>>>>>>> * Pluggable strategies for just about every > > >> component > > >> > > > > >> > >>>>> (MessageChooser, > > >> > > > > >> > >>>>>>>>> SystemStreamPartitionGrouper, ConfigRewriter, > > etc). > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> There's probably more that I've forgotten, as > > well. > > >> > Some > > >> > > > of > > >> > > > > >> > >>>>>>>>> these > > >> > > > > >> > >>>>> are > > >> > > > > >> > >>>>>>>>> useful, but some have proven not to be. This > all > > >> comes > > >> > > at > > >> > > > a > > >> > > > > >> cost: > > >> > > > > >> > >>>>>>>>> complexity. This complexity is making it harder > > for > > >> > our > > >> > > > > users > > >> > > > > >> > >>>>>>>>> to > > >> > > > > >> > >>>>> pick > > >> > > > > >> > >>>>>> up > > >> > > > > >> > >>>>>>>>> and use Samza out of the box. It also makes it > > >> > difficult > > >> > > > for > > >> > > > > >> > >>>>>>>>> Samza developers to reason about what the > > >> > > characteristics > > >> > > > of > > >> > > > > >> > >>>>>>>>> the container (since the characteristics change > > >> > > depending > > >> > > > on > > >> > > > > >> > >>>>>>>>> which plugins are use). > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> The issues with pluggability are most visible > in > > the > > >> > > > System > > >> > > > > >> APIs. > > >> > > > > >> > >>>>> What > > >> > > > > >> > >>>>>>>>> Samza really requires to be functional is Kafka > > as > > >> its > > >> > > > > >> > >>>>>>>>> transport > > >> > > > > >> > >>>>>> layer. > > >> > > > > >> > >>>>>>>>> But > > >> > > > > >> > >>>>>>>>> we've conflated two unrelated use cases into > one > > >> API: > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> 1. Get data into/out of Kafka. > > >> > > > > >> > >>>>>>>>> 2. Process the data in Kafka. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> The current System API supports both of these > use > > >> > cases. > > >> > > > The > > >> > > > > >> > >>>>>>>>> problem > > >> > > > > >> > >>>>>> is, > > >> > > > > >> > >>>>>>>>> we > > >> > > > > >> > >>>>>>>>> actually want different features for each use > > case. > > >> By > > >> > > > > >> papering > > >> > > > > >> > >>>>>>>>> over > > >> > > > > >> > >>>>>>> these > > >> > > > > >> > >>>>>>>>> two use cases, and providing a single API, > we've > > >> > > > introduced > > >> > > > > a > > >> > > > > >> > >>>>>>>>> ton of > > >> > > > > >> > >>>>>>> leaky > > >> > > > > >> > >>>>>>>>> abstractions. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> For example, what we'd really like in (2) is to > > have > > >> > > > > >> > >>>>>>>>> monotonically increasing longs for offsets > (like > > >> > Kafka). > > >> > > > > This > > >> > > > > >> > >>>>>>>>> would be at odds > > >> > > > > >> > >>>>> with > > >> > > > > >> > >>>>>>> (1), > > >> > > > > >> > >>>>>>>>> though, since different systems have different > > >> > > > > >> > >>>>>>> SCNs/Offsets/UUIDs/vectors. > > >> > > > > >> > >>>>>>>>> There was discussion both on the mailing list > and > > >> the > > >> > > SQL > > >> > > > > >> JIRAs > > >> > > > > >> > >>>>> about > > >> > > > > >> > >>>>>>> the > > >> > > > > >> > >>>>>>>>> need for this. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> The same thing holds true for replayability. > > Kafka > > >> > > allows > > >> > > > us > > >> > > > > >> to > > >> > > > > >> > >>>>> rewind > > >> > > > > >> > >>>>>>>>> when > > >> > > > > >> > >>>>>>>>> we have a failure. Many other systems don't. In > > some > > >> > > > cases, > > >> > > > > >> > >>>>>>>>> systems > > >> > > > > >> > >>>>>>> return > > >> > > > > >> > >>>>>>>>> null for their offsets (e.g. > > >> WikipediaSystemConsumer) > > >> > > > > because > > >> > > > > >> > >>>>>>>>> they > > >> > > > > >> > >>>>>> have > > >> > > > > >> > >>>>>>> no > > >> > > > > >> > >>>>>>>>> offsets. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> Partitioning is another example. Kafka supports > > >> > > > > partitioning, > > >> > > > > >> > >>>>>>>>> but > > >> > > > > >> > >>>>> many > > >> > > > > >> > >>>>>>>>> systems don't. We model this by having a single > > >> > > partition > > >> > > > > for > > >> > > > > >> > >>>>>>>>> those systems. Still, other systems model > > >> partitioning > > >> > > > > >> > >>>> differently (e.g. > > >> > > > > >> > >>>>>>>>> Kinesis). > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> The SystemAdmin interface is also a mess. > > Creating > > >> > > streams > > >> > > > > in > > >> > > > > >> a > > >> > > > > >> > >>>>>>>>> system-agnostic way is almost impossible. As is > > >> > modeling > > >> > > > > >> > >>>>>>>>> metadata > > >> > > > > >> > >>>>> for > > >> > > > > >> > >>>>>>> the > > >> > > > > >> > >>>>>>>>> system (replication factor, partitions, > location, > > >> > etc). > > >> > > > The > > >> > > > > >> > >>>>>>>>> list > > >> > > > > >> > >>>>> goes > > >> > > > > >> > >>>>>>> on. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> Duplicate work > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> At the time that we began writing Samza, > Kafka's > > >> > > consumer > > >> > > > > and > > >> > > > > >> > >>>>> producer > > >> > > > > >> > >>>>>>>>> APIs > > >> > > > > >> > >>>>>>>>> had a relatively weak feature set. On the > > >> > consumer-side, > > >> > > > you > > >> > > > > >> > >>>>>>>>> had two > > >> > > > > >> > >>>>>>>>> options: use the high level consumer, or the > > simple > > >> > > > > consumer. > > >> > > > > >> > >>>>>>>>> The > > >> > > > > >> > >>>>>>> problem > > >> > > > > >> > >>>>>>>>> with the high-level consumer was that it > > controlled > > >> > your > > >> > > > > >> > >>>>>>>>> offsets, partition assignments, and the order > in > > >> which > > >> > > you > > >> > > > > >> > >>>>>>>>> received messages. The > > >> > > > > >> > >>>>> problem > > >> > > > > >> > >>>>>>>>> with > > >> > > > > >> > >>>>>>>>> the simple consumer is that it's not simple. > It's > > >> > basic. > > >> > > > You > > >> > > > > >> > >>>>>>>>> end up > > >> > > > > >> > >>>>>>> having > > >> > > > > >> > >>>>>>>>> to handle a lot of really low-level stuff that > > you > > >> > > > > shouldn't. > > >> > > > > >> > >>>>>>>>> We > > >> > > > > >> > >>>>>> spent a > > >> > > > > >> > >>>>>>>>> lot of time to make Samza's KafkaSystemConsumer > > very > > >> > > > robust. > > >> > > > > >> It > > >> > > > > >> > >>>>>>>>> also allows us to support some cool features: > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> * Per-partition message ordering and > > prioritization. > > >> > > > > >> > >>>>>>>>> * Tight control over partition assignment to > > support > > >> > > > joins, > > >> > > > > >> > >>>>>>>>> global > > >> > > > > >> > >>>>>> state > > >> > > > > >> > >>>>>>>>> (if we want to implement it :)), etc. > > >> > > > > >> > >>>>>>>>> * Tight control over offset checkpointing. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> What we didn't realize at the time is that > these > > >> > > features > > >> > > > > >> > >>>>>>>>> should > > >> > > > > >> > >>>>>>> actually > > >> > > > > >> > >>>>>>>>> be in Kafka. A lot of Kafka consumers (not just > > >> Samza > > >> > > > stream > > >> > > > > >> > >>>>>> processors) > > >> > > > > >> > >>>>>>>>> end up wanting to do things like joins and > > partition > > >> > > > > >> > >>>>>>>>> assignment. The > > >> > > > > >> > >>>>>>> Kafka > > >> > > > > >> > >>>>>>>>> community has come to the same conclusion. > > They're > > >> > > adding > > >> > > > a > > >> > > > > >> ton > > >> > > > > >> > >>>>>>>>> of upgrades into their new Kafka consumer > > >> > > implementation. > > >> > > > > To a > > >> > > > > >> > >>>>>>>>> large extent, > > >> > > > > >> > >>>>> it's > > >> > > > > >> > >>>>>>>>> duplicate work to what we've already done in > > Samza. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> On top of this, Kafka ended up taking a very > > similar > > >> > > > > approach > > >> > > > > >> > >>>>>>>>> to > > >> > > > > >> > >>>>>> Samza's > > >> > > > > >> > >>>>>>>>> KafkaCheckpointManager implementation for > > handling > > >> > > offset > > >> > > > > >> > >>>>>> checkpointing. > > >> > > > > >> > >>>>>>>>> Like Samza, Kafka's new offset management > feature > > >> > stores > > >> > > > > >> offset > > >> > > > > >> > >>>>>>>>> checkpoints in a topic, and allows you to fetch > > them > > >> > > from > > >> > > > > the > > >> > > > > >> > >>>>>>>>> broker. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> A lot of this seems like a waste, since we > could > > >> have > > >> > > > shared > > >> > > > > >> > >>>>>>>>> the > > >> > > > > >> > >>>>> work > > >> > > > > >> > >>>>>> if > > >> > > > > >> > >>>>>>>>> it > > >> > > > > >> > >>>>>>>>> had been done in Kafka from the get-go. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> Vision > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> All of this leads me to a rather radical > > proposal. > > >> > Samza > > >> > > > is > > >> > > > > >> > >>>>> relatively > > >> > > > > >> > >>>>>>>>> stable at this point. I'd venture to say that > > we're > > >> > > near a > > >> > > > > 1.0 > > >> > > > > >> > >>>>>> release. > > >> > > > > >> > >>>>>>>>> I'd > > >> > > > > >> > >>>>>>>>> like to propose that we take what we've > learned, > > and > > >> > > begin > > >> > > > > >> > >>>>>>>>> thinking > > >> > > > > >> > >>>>>>> about > > >> > > > > >> > >>>>>>>>> Samza beyond 1.0. What would we change if we > were > > >> > > starting > > >> > > > > >> from > > >> > > > > >> > >>>>>> scratch? > > >> > > > > >> > >>>>>>>>> My > > >> > > > > >> > >>>>>>>>> proposal is to: > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> 1. Make Samza standalone the *only* way to run > > Samza > > >> > > > > >> > >>>>>>>>> processors, and eliminate all direct > dependences > > on > > >> > > YARN, > > >> > > > > >> Mesos, > > >> > > > > >> > >>>> etc. > > >> > > > > >> > >>>>>>>>> 2. Make a definitive call to support only Kafka > > as > > >> the > > >> > > > > stream > > >> > > > > >> > >>>>>> processing > > >> > > > > >> > >>>>>>>>> layer. > > >> > > > > >> > >>>>>>>>> 3. Eliminate Samza's metrics, logging, > > >> serialization, > > >> > > and > > >> > > > > >> > >>>>>>>>> config > > >> > > > > >> > >>>>>>> systems, > > >> > > > > >> > >>>>>>>>> and simply use Kafka's instead. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> This would fix all of the issues that I > outlined > > >> > above. > > >> > > It > > >> > > > > >> > >>>>>>>>> should > > >> > > > > >> > >>>>> also > > >> > > > > >> > >>>>>>>>> shrink the Samza code base pretty dramatically. > > >> > > Supporting > > >> > > > > >> only > > >> > > > > >> > >>>>>>>>> a standalone container will allow Samza to be > > >> executed > > >> > > on > > >> > > > > YARN > > >> > > > > >> > >>>>>>>>> (using Slider), Mesos (using Marathon/Aurora), > or > > >> most > > >> > > > other > > >> > > > > >> > >>>>>>>>> in-house > > >> > > > > >> > >>>>>>> deployment > > >> > > > > >> > >>>>>>>>> systems. This should make life a lot easier for > > new > > >> > > users. > > >> > > > > >> > >>>>>>>>> Imagine > > >> > > > > >> > >>>>>>> having > > >> > > > > >> > >>>>>>>>> the hello-samza tutorial without YARN. The drop > > in > > >> > > mailing > > >> > > > > >> list > > >> > > > > >> > >>>>>> traffic > > >> > > > > >> > >>>>>>>>> will be pretty dramatic. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> Coupling with Kafka seems long overdue to me. > The > > >> > > reality > > >> > > > > is, > > >> > > > > >> > >>>>> everyone > > >> > > > > >> > >>>>>>>>> that > > >> > > > > >> > >>>>>>>>> I'm aware of is using Samza with Kafka. We > > basically > > >> > > > require > > >> > > > > >> it > > >> > > > > >> > >>>>>> already > > >> > > > > >> > >>>>>>> in > > >> > > > > >> > >>>>>>>>> order for most features to work. Those that are > > >> using > > >> > > > other > > >> > > > > >> > >>>>>>>>> systems > > >> > > > > >> > >>>>>> are > > >> > > > > >> > >>>>>>>>> generally using it for ingest into Kafka (1), > and > > >> then > > >> > > > they > > >> > > > > do > > >> > > > > >> > >>>>>>>>> the processing on top. There is already > > discussion ( > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>> > > >> > > > > >> > >>>>>> > > >> > > > > >> > >>>>> > > >> > > > > >> > > > >> > > > > > > >> > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851 > > >> > > > > >> > >>>>> 767 > > >> > > > > >> > >>>>>>>>> ) > > >> > > > > >> > >>>>>>>>> in Kafka to make ingesting into Kafka extremely > > >> easy. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> Once we make the call to couple with Kafka, we > > can > > >> > > > leverage > > >> > > > > a > > >> > > > > >> > >>>>>>>>> ton of > > >> > > > > >> > >>>>>>> their > > >> > > > > >> > >>>>>>>>> ecosystem. We no longer have to maintain our > own > > >> > config, > > >> > > > > >> > >>>>>>>>> metrics, > > >> > > > > >> > >>>>> etc. > > >> > > > > >> > >>>>>>> We > > >> > > > > >> > >>>>>>>>> can all share the same libraries, and make them > > >> > better. > > >> > > > This > > >> > > > > >> > >>>>>>>>> will > > >> > > > > >> > >>>>> also > > >> > > > > >> > >>>>>>>>> allow us to share the consumer/producer APIs, > and > > >> will > > >> > > let > > >> > > > > us > > >> > > > > >> > >>>>> leverage > > >> > > > > >> > >>>>>>>>> their offset management and partition > management, > > >> > rather > > >> > > > > than > > >> > > > > >> > >>>>>>>>> having > > >> > > > > >> > >>>>>> our > > >> > > > > >> > >>>>>>>>> own. All of the coordinator stream code would > go > > >> away, > > >> > > as > > >> > > > > >> would > > >> > > > > >> > >>>>>>>>> most > > >> > > > > >> > >>>>>> of > > >> > > > > >> > >>>>>>>>> the > > >> > > > > >> > >>>>>>>>> YARN AppMaster code. We'd probably have to push > > some > > >> > > > > partition > > >> > > > > >> > >>>>>>> management > > >> > > > > >> > >>>>>>>>> features into the Kafka broker, but they're > > already > > >> > > moving > > >> > > > > in > > >> > > > > >> > >>>>>>>>> that direction with the new consumer API. The > > >> features > > >> > > we > > >> > > > > have > > >> > > > > >> > >>>>>>>>> for > > >> > > > > >> > >>>>>> partition > > >> > > > > >> > >>>>>>>>> assignment aren't unique to Samza, and seem > like > > >> they > > >> > > > should > > >> > > > > >> be > > >> > > > > >> > >>>>>>>>> in > > >> > > > > >> > >>>>>> Kafka > > >> > > > > >> > >>>>>>>>> anyway. There will always be some niche usages > > which > > >> > > will > > >> > > > > >> > >>>>>>>>> require > > >> > > > > >> > >>>>>> extra > > >> > > > > >> > >>>>>>>>> care and hence full control over partition > > >> assignments > > >> > > > much > > >> > > > > >> > >>>>>>>>> like the > > >> > > > > >> > >>>>>>> Kafka > > >> > > > > >> > >>>>>>>>> low level consumer api. These would continue to > > be > > >> > > > > supported. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> These items will be good for the Samza > community. > > >> > > They'll > > >> > > > > make > > >> > > > > >> > >>>>>>>>> Samza easier to use, and make it easier for > > >> developers > > >> > > to > > >> > > > > add > > >> > > > > >> > >>>>>>>>> new features. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> Obviously this is a fairly large (and somewhat > > >> > backwards > > >> > > > > >> > >>>>> incompatible > > >> > > > > >> > >>>>>>>>> change). If we choose to go this route, it's > > >> important > > >> > > > that > > >> > > > > we > > >> > > > > >> > >>>>> openly > > >> > > > > >> > >>>>>>>>> communicate how we're going to provide a > > migration > > >> > path > > >> > > > from > > >> > > > > >> > >>>>>>>>> the > > >> > > > > >> > >>>>>>> existing > > >> > > > > >> > >>>>>>>>> APIs to the new ones (if we make incompatible > > >> > changes). > > >> > > I > > >> > > > > >> think > > >> > > > > >> > >>>>>>>>> at a minimum, we'd probably need to provide a > > >> wrapper > > >> > to > > >> > > > > allow > > >> > > > > >> > >>>>>>>>> existing StreamTask implementations to continue > > >> > running > > >> > > on > > >> > > > > the > > >> > > > > >> > >>>> new container. > > >> > > > > >> > >>>>>>> It's > > >> > > > > >> > >>>>>>>>> also important that we openly communicate about > > >> > timing, > > >> > > > and > > >> > > > > >> > >>>>>>>>> stages > > >> > > > > >> > >>>>> of > > >> > > > > >> > >>>>>>> the > > >> > > > > >> > >>>>>>>>> migration. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> If you made it this far, I'm sure you have > > opinions. > > >> > :) > > >> > > > > Please > > >> > > > > >> > >>>>>>>>> send > > >> > > > > >> > >>>>>> your > > >> > > > > >> > >>>>>>>>> thoughts and feedback. > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>>> Cheers, > > >> > > > > >> > >>>>>>>>> Chris > > >> > > > > >> > >>>>>>>>> > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>>> > > >> > > > > >> > >>>>>>> > > >> > > > > >> > >>>>>> > > >> > > > > >> > >>>>>> > > >> > > > > >> > >>>>>> > > >> > > > > >> > >>>>>> -- > > >> > > > > >> > >>>>>> -- Guozhang > > >> > > > > >> > >>>>>> > > >> > > > > >> > >>>>> > > >> > > > > >> > >>>> > > >> > > > > >> > >> > > >> > > > > >> > >> > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >