That's great. Thanks, Jay. On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps <j...@confluent.io> wrote:
> Yeah totally agree. I think you have this issue even today, right? I.e. if > you need to make a simple config change and you're running in YARN today > you end up bouncing the job which then rebuilds state. I think the fix is > exactly what you described which is to have a long timeout on partition > movement for stateful jobs so that if a job is just getting bounced, and > the cluster manager (or admin) is smart enough to restart it on the same > host when possible, it can optimistically reuse any existing state it finds > on disk (if it is valid). > > So in this model the charter of the CM is to place processes as stickily as > possible and to restart or re-place failed processes. The charter of the > partition management system is to control the assignment of work to these > processes. The nice thing about this is that the work assignment, timeouts, > behavior, configs, and code will all be the same across all cluster > managers. > > So I think that prototype would actually give you exactly what you want > today for any cluster manager (or manual placement + restart script) that > was sticky in terms of host placement since there is already a configurable > partition movement timeout and task-by-task state reuse with a check on > state validity. > > -Jay > > On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover <roger.hoo...@gmail.com> > wrote: > > > That would be great to let Kafka do as much heavy lifting as possible and > > make it easier for other languages to implement Samza apis. > > > > One thing to watch out for is the interplay between Kafka's group > > management and the external scheduler/process manager's fault tolerance. > > If a container dies, the Kafka group membership protocol will try to > assign > > it's tasks to other containers while at the same time the process manager > > is trying to relaunch the container. Without some consideration for this > > (like a configurable amount of time to wait before Kafka alters the group > > membership), there may be thrashing going on which is especially bad for > > containers with large amounts of local state. > > > > Someone else pointed this out already but I thought it might be worth > > calling out again. > > > > Cheers, > > > > Roger > > > > > > On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps <j...@confluent.io> wrote: > > > > > Hey Roger, > > > > > > I couldn't agree more. We spent a bunch of time talking to people and > > that > > > is exactly the stuff we heard time and again. What makes it hard, of > > > course, is that there is some tension between compatibility with what's > > > there now and making things better for new users. > > > > > > I also strongly agree with the importance of multi-language support. We > > are > > > talking now about Java, but for application development use cases > people > > > want to work in whatever language they are using elsewhere. I think > > moving > > > to a model where Kafka itself does the group membership, lifecycle > > control, > > > and partition assignment has the advantage of putting all that complex > > > stuff behind a clean api that the clients are already going to be > > > implementing for their consumer, so the added functionality for stream > > > processing beyond a consumer becomes very minor. > > > > > > -Jay > > > > > > On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover <roger.hoo...@gmail.com> > > > wrote: > > > > > > > Metamorphosis...nice. :) > > > > > > > > This has been a great discussion. As a user of Samza who's recently > > > > integrated it into a relatively large organization, I just want to > add > > > > support to a few points already made. > > > > > > > > The biggest hurdles to adoption of Samza as it currently exists that > > I've > > > > experienced are: > > > > 1) YARN - YARN is overly complex in many environments where Puppet > > would > > > do > > > > just fine but it was the only mechanism to get fault tolerance. > > > > 2) Configuration - I think I like the idea of configuring most of the > > job > > > > in code rather than config files. In general, I think the goal > should > > be > > > > to make it harder to make mistakes, especially of the kind where the > > code > > > > expects something and the config doesn't match. The current config > is > > > > quite intricate and error-prone. For example, the application logic > > may > > > > depend on bootstrapping a topic but rather than asserting that in the > > > code, > > > > you have to rely on getting the config right. Likewise with serdes, > > the > > > > Java representations produced by various serdes (JSON, Avro, etc.) > are > > > not > > > > equivalent so you cannot just reconfigure a serde without changing > the > > > > code. It would be nice for jobs to be able to assert what they > expect > > > > from their input topics in terms of partitioning. This is getting a > > > little > > > > off topic but I was even thinking about creating a "Samza config > > linter" > > > > that would sanity check a set of configs. Especially in > organizations > > > > where config is managed by a different team than the application > > > developer, > > > > it's very hard to get avoid config mistakes. > > > > 3) Java/Scala centric - for many teams (especially DevOps-type > folks), > > > the > > > > pain of the Java toolchain (maven, slow builds, weak command line > > > support, > > > > configuration over convention) really inhibits productivity. As more > > and > > > > more high-quality clients become available for Kafka, I hope they'll > > > follow > > > > Samza's model. Not sure how much it affects the proposals in this > > thread > > > > but please consider other languages in the ecosystem as well. From > > what > > > > I've heard, Spark has more Python users than Java/Scala. > > > > (FYI, we added a Jython wrapper for the Samza API > > > > > > > > > > > > > > https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza > > > > and are working on a Yeoman generator > > > > https://github.com/Quantiply/generator-rico for Jython/Samza > projects > > to > > > > alleviate some of the pain) > > > > > > > > I also want to underscore Jay's point about improving the user > > > experience. > > > > That's a very important factor for adoption. I think the goal should > > be > > > to > > > > make Samza as easy to get started with as something like Logstash. > > > > Logstash is vastly inferior in terms of capabilities to Samza but > it's > > > easy > > > > to get started and that makes a big difference. > > > > > > > > Cheers, > > > > > > > > Roger > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De Francisci Morales < > > > > g...@apache.org> wrote: > > > > > > > > > Forgot to add. On the naming issues, Kafka Metamorphosis is a clear > > > > winner > > > > > :) > > > > > > > > > > -- > > > > > Gianmarco > > > > > > > > > > On 7 July 2015 at 13:26, Gianmarco De Francisci Morales < > > > g...@apache.org > > > > > > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > @Martin, thanks for you comments. > > > > > > Maybe I'm missing some important point, but I think coupling the > > > > releases > > > > > > is actually a *good* thing. > > > > > > To make an example, would it be better if the MR and HDFS > > components > > > of > > > > > > Hadoop had different release schedules? > > > > > > > > > > > > Actually, keeping the discussion in a single place would make > > > agreeing > > > > on > > > > > > releases (and backwards compatibility) much easier, as everybody > > > would > > > > be > > > > > > responsible for the whole codebase. > > > > > > > > > > > > That said, I like the idea of absorbing samza-core as a > > sub-project, > > > > and > > > > > > leave the fancy stuff separate. > > > > > > It probably gives 90% of the benefits we have been discussing > here. > > > > > > > > > > > > Cheers, > > > > > > > > > > > > -- > > > > > > Gianmarco > > > > > > > > > > > > On 7 July 2015 at 02:30, Jay Kreps <jay.kr...@gmail.com> wrote: > > > > > > > > > > > >> Hey Martin, > > > > > >> > > > > > >> I agree coupling release schedules is a downside. > > > > > >> > > > > > >> Definitely we can try to solve some of the integration problems > in > > > > > >> Confluent Platform or in other distributions. But I think this > > ends > > > up > > > > > >> being really shallow. I guess I feel to really get a good user > > > > > experience > > > > > >> the two systems have to kind of feel like part of the same thing > > and > > > > you > > > > > >> can't really add that in later--you can put both in the same > > > > > downloadable > > > > > >> tar file but it doesn't really give a very cohesive feeling. I > > agree > > > > > that > > > > > >> ultimately any of the project stuff is as much social and naming > > as > > > > > >> anything else--theoretically two totally independent projects > > could > > > > work > > > > > >> to > > > > > >> tightly align. In practice this seems to be quite difficult > > though. > > > > > >> > > > > > >> For the frameworks--totally agree it would be good to maintain > the > > > > > >> framework support with the project. In some cases there may not > be > > > too > > > > > >> much > > > > > >> there since the integration gets lighter but I think whatever > > stubs > > > > you > > > > > >> need should be included. So no I definitely wasn't trying to > imply > > > > > >> dropping > > > > > >> support for these frameworks, just making the integration > lighter > > by > > > > > >> separating process management from partition management. > > > > > >> > > > > > >> You raise two good points we would have to figure out if we went > > > down > > > > > the > > > > > >> alignment path: > > > > > >> 1. With respect to the name, yeah I think the first question is > > > > whether > > > > > >> some "re-branding" would be worth it. If so then I think we can > > > have a > > > > > big > > > > > >> thread on the name. I'm definitely not set on Kafka Streaming or > > > Kafka > > > > > >> Streams I was just using them to be kind of illustrative. I > agree > > > with > > > > > >> your > > > > > >> critique of these names, though I think people would get the > idea. > > > > > >> 2. Yeah you also raise a good point about how to "factor" it. > Here > > > are > > > > > the > > > > > >> options I see (I could get enthusiastic about any of them): > > > > > >> a. One repo for both Kafka and Samza > > > > > >> b. Two repos, retaining the current seperation > > > > > >> c. Two repos, the equivalent of samza-api and samza-core is > > > > absorbed > > > > > >> almost like a third client > > > > > >> > > > > > >> Cheers, > > > > > >> > > > > > >> -Jay > > > > > >> > > > > > >> On Mon, Jul 6, 2015 at 1:18 PM, Martin Kleppmann < > > > > mar...@kleppmann.com> > > > > > >> wrote: > > > > > >> > > > > > >> > Ok, thanks for the clarifications. Just a few follow-up > > comments. > > > > > >> > > > > > > >> > - I see the appeal of merging with Kafka or becoming a > > subproject: > > > > the > > > > > >> > reasons you mention are good. The risk I see is that release > > > > schedules > > > > > >> > become coupled to each other, which can slow everyone down, > and > > > > large > > > > > >> > projects with many contributors are harder to manage. (Jakob, > > can > > > > you > > > > > >> speak > > > > > >> > from experience, having seen a wider range of Hadoop ecosystem > > > > > >> projects?) > > > > > >> > > > > > > >> > Some of the goals of a better unified developer experience > could > > > > also > > > > > be > > > > > >> > solved by integrating Samza nicely into a Kafka distribution > > (such > > > > as > > > > > >> > Confluent's). I'm not against merging projects if we decide > > that's > > > > the > > > > > >> way > > > > > >> > to go, just pointing out the same goals can perhaps also be > > > achieved > > > > > in > > > > > >> > other ways. > > > > > >> > > > > > > >> > - With regard to dropping the YARN dependency: are you > proposing > > > > that > > > > > >> > Samza doesn't give any help to people wanting to run on > > > > > >> YARN/Mesos/AWS/etc? > > > > > >> > So the docs would basically have a link to Slider and nothing > > > else? > > > > Or > > > > > >> > would we maintain integrations with a bunch of popular > > deployment > > > > > >> methods > > > > > >> > (e.g. the necessary glue and shell scripts to make Samza work > > with > > > > > >> Slider)? > > > > > >> > > > > > > >> > I absolutely think it's a good idea to have the "as a library" > > and > > > > > "as a > > > > > >> > process" (using Yi's taxonomy) options for people who want > them, > > > > but I > > > > > >> > think there should also be a low-friction path for common "as > a > > > > > service" > > > > > >> > deployment methods, for which we probably need to maintain > > > > > integrations. > > > > > >> > > > > > > >> > - Project naming: "Kafka Streams" seems odd to me, because > Kafka > > > is > > > > > all > > > > > >> > about streams already. Perhaps "Kafka Transformers" or "Kafka > > > > Filters" > > > > > >> > would be more apt? > > > > > >> > > > > > > >> > One suggestion: perhaps the core of Samza (stream > transformation > > > > with > > > > > >> > state management -- i.e. the "Samza as a library" bit) could > > > become > > > > > >> part of > > > > > >> > Kafka, while higher-level tools such as streaming SQL and > > > > integrations > > > > > >> with > > > > > >> > deployment frameworks remain in a separate project? In other > > > words, > > > > > >> Kafka > > > > > >> > would absorb the proven, stable core of Samza, which would > > become > > > > the > > > > > >> > "third Kafka client" mentioned early in this thread. The Samza > > > > project > > > > > >> > would then target that third Kafka client as its base API, and > > the > > > > > >> project > > > > > >> > would be freed up to explore more experimental new horizons. > > > > > >> > > > > > > >> > Martin > > > > > >> > > > > > > >> > On 6 Jul 2015, at 18:51, Jay Kreps <jay.kr...@gmail.com> > wrote: > > > > > >> > > > > > > >> > > Hey Martin, > > > > > >> > > > > > > > >> > > For the YARN/Mesos/etc decoupling I actually don't think it > > ties > > > > our > > > > > >> > hands > > > > > >> > > at all, all it does is refactor things. The division of > > > > > >> responsibility is > > > > > >> > > that Samza core is responsible for task lifecycle, state, > and > > > > > >> partition > > > > > >> > > management (using the Kafka co-ordinator) but it is NOT > > > > responsible > > > > > >> for > > > > > >> > > packaging, configuration deployment or execution of > processes. > > > The > > > > > >> > problem > > > > > >> > > of packaging and starting these processes is > > > > > >> > > framework/environment-specific. This leaves individual > > > frameworks > > > > to > > > > > >> be > > > > > >> > as > > > > > >> > > fancy or vanilla as they like. So you can get simple > stateless > > > > > >> support in > > > > > >> > > YARN, Mesos, etc using their off-the-shelf app framework > > > (Slider, > > > > > >> > Marathon, > > > > > >> > > etc). These are well known by people and have nice UIs and a > > lot > > > > of > > > > > >> > > flexibility. I don't think they have node affinity as a > built > > in > > > > > >> option > > > > > >> > > (though I could be wrong). So if we want that we can either > > wait > > > > for > > > > > >> them > > > > > >> > > to add it or do a custom framework to add that feature (as > > now). > > > > > >> > Obviously > > > > > >> > > if you manage things with old-school ops tools > > (puppet/chef/etc) > > > > you > > > > > >> get > > > > > >> > > locality easily. The nice thing, though, is that all the > samza > > > > > >> "business > > > > > >> > > logic" around partition management and fault tolerance is in > > > Samza > > > > > >> core > > > > > >> > so > > > > > >> > > it is shared across frameworks and the framework specific > bit > > is > > > > > just > > > > > >> > > whether it is smart enough to try to get the same host when > a > > > job > > > > is > > > > > >> > > restarted. > > > > > >> > > > > > > > >> > > With respect to the Kafka-alignment, yeah I think the goal > > would > > > > be > > > > > >> (a) > > > > > >> > > actually get better alignment in user experience, and (b) > > > express > > > > > >> this in > > > > > >> > > the naming and project branding. Specifically: > > > > > >> > > 1. Website/docs, it would be nice for the "transformation" > api > > > to > > > > be > > > > > >> > > discoverable in the main Kafka docs--i.e. be able to explain > > > when > > > > to > > > > > >> use > > > > > >> > > the consumer and when to use the stream processing > > functionality > > > > and > > > > > >> lead > > > > > >> > > people into that experience. > > > > > >> > > 2. Align releases so if you get Kafkza 1.4.2 (or whatever) > > that > > > > has > > > > > >> both > > > > > >> > > Kafka and the stream processing part and they actually work > > > > > together. > > > > > >> > > 3. Unify the programming experience so the client and Samza > > api > > > > > share > > > > > >> > > config/monitoring/naming/packaging/etc. > > > > > >> > > > > > > > >> > > I think sub-projects keep separate committers and can have a > > > > > separate > > > > > >> > repo, > > > > > >> > > but I'm actually not really sure (I can't find a definition > > of a > > > > > >> > subproject > > > > > >> > > in Apache). > > > > > >> > > > > > > > >> > > Basically at a high-level you want the experience to "feel" > > > like a > > > > > >> single > > > > > >> > > system, not to relatively independent things that are kind > of > > > > > >> awkwardly > > > > > >> > > glued together. > > > > > >> > > > > > > > >> > > I think if we did that they having naming or branding like > > > "kafka > > > > > >> > > streaming" or "kafka streams" or something like that would > > > > actually > > > > > >> do a > > > > > >> > > good job of conveying what it is. I do that this would help > > > > adoption > > > > > >> > quite > > > > > >> > > a lot as it would correctly convey that using Kafka > Streaming > > > with > > > > > >> Kafka > > > > > >> > is > > > > > >> > > a fairly seamless experience and Kafka is pretty heavily > > adopted > > > > at > > > > > >> this > > > > > >> > > point. > > > > > >> > > > > > > > >> > > Fwiw we actually considered this model originally when open > > > > sourcing > > > > > >> > Samza, > > > > > >> > > however at that time Kafka was relatively unknown and we > > decided > > > > not > > > > > >> to > > > > > >> > do > > > > > >> > > it since we felt it would be limiting. From my point of view > > the > > > > > three > > > > > >> > > things have changed (1) Kafka is now really heavily used for > > > > stream > > > > > >> > > processing, (2) we learned that abstracting out the stream > > well > > > is > > > > > >> > > basically impossible, (3) we learned it is really hard to > keep > > > the > > > > > two > > > > > >> > > things feeling like a single product. > > > > > >> > > > > > > > >> > > -Jay > > > > > >> > > > > > > > >> > > > > > > > >> > > On Mon, Jul 6, 2015 at 3:37 AM, Martin Kleppmann < > > > > > >> mar...@kleppmann.com> > > > > > >> > > wrote: > > > > > >> > > > > > > > >> > >> Hi all, > > > > > >> > >> > > > > > >> > >> Lots of good thoughts here. > > > > > >> > >> > > > > > >> > >> I agree with the general philosophy of tying Samza more > > firmly > > > to > > > > > >> Kafka. > > > > > >> > >> After I spent a while looking at integrating other message > > > > brokers > > > > > >> (e.g. > > > > > >> > >> Kinesis) with SystemConsumer, I came to the conclusion that > > > > > >> > SystemConsumer > > > > > >> > >> tacitly assumes a model so much like Kafka's that pretty > much > > > > > nobody > > > > > >> but > > > > > >> > >> Kafka actually implements it. (Databus is perhaps an > > exception, > > > > but > > > > > >> it > > > > > >> > >> isn't widely used outside of LinkedIn.) Thus, making Samza > > > fully > > > > > >> > dependent > > > > > >> > >> on Kafka acknowledges that the system-independence was > never > > as > > > > > real > > > > > >> as > > > > > >> > we > > > > > >> > >> perhaps made it out to be. The gains of code reuse are > real. > > > > > >> > >> > > > > > >> > >> The idea of decoupling Samza from YARN has also always been > > > > > >> appealing to > > > > > >> > >> me, for various reasons already mentioned in this thread. > > > > Although > > > > > >> > making > > > > > >> > >> Samza jobs deployable on anything (YARN/Mesos/AWS/etc) > seems > > > > > >> laudable, > > > > > >> > I am > > > > > >> > >> a little concerned that it will restrict us to a lowest > > common > > > > > >> > denominator. > > > > > >> > >> For example, would host affinity (SAMZA-617) still be > > possible? > > > > For > > > > > >> jobs > > > > > >> > >> with large amounts of state, I think SAMZA-617 would be a > big > > > > boon, > > > > > >> > since > > > > > >> > >> restoring state off the changelog on every single restart > is > > > > > painful, > > > > > >> > due > > > > > >> > >> to long recovery times. It would be a shame if the > decoupling > > > > from > > > > > >> YARN > > > > > >> > >> made host affinity impossible. > > > > > >> > >> > > > > > >> > >> Jay, a question about the proposed API for instantiating a > > job > > > in > > > > > >> code > > > > > >> > >> (rather than a properties file): when submitting a job to a > > > > > cluster, > > > > > >> is > > > > > >> > the > > > > > >> > >> idea that the instantiation code runs on a client > somewhere, > > > > which > > > > > >> then > > > > > >> > >> pokes the necessary endpoints on YARN/Mesos/AWS/etc? Or > does > > > that > > > > > >> code > > > > > >> > run > > > > > >> > >> on each container that is part of the job (in which case, > how > > > > does > > > > > >> the > > > > > >> > job > > > > > >> > >> submission to the cluster work)? > > > > > >> > >> > > > > > >> > >> I agree with Garry that it doesn't feel right to make a 1.0 > > > > release > > > > > >> > with a > > > > > >> > >> plan for it to be immediately obsolete. So if this is going > > to > > > > > >> happen, I > > > > > >> > >> think it would be more honest to stick with 0.* version > > numbers > > > > > until > > > > > >> > the > > > > > >> > >> library-ified Samza has been implemented, is stable and > > widely > > > > > used. > > > > > >> > >> > > > > > >> > >> Should the new Samza be a subproject of Kafka? There is > > > precedent > > > > > for > > > > > >> > >> tight coupling between different Apache projects (e.g. > > Curator > > > > and > > > > > >> > >> Zookeeper, or Slider and YARN), so I think remaining > separate > > > > would > > > > > >> be > > > > > >> > ok. > > > > > >> > >> Even if Samza is fully dependent on Kafka, there is enough > > > > > substance > > > > > >> in > > > > > >> > >> Samza that it warrants being a separate project. An > argument > > in > > > > > >> favour > > > > > >> > of > > > > > >> > >> merging would be if we think Kafka has a much stronger > "brand > > > > > >> presence" > > > > > >> > >> than Samza; I'm ambivalent on that one. If the Kafka > project > > is > > > > > >> willing > > > > > >> > to > > > > > >> > >> endorse Samza as the "official" way of doing stateful > stream > > > > > >> > >> transformations, that would probably have much the same > > effect > > > as > > > > > >> > >> re-branding Samza as "Kafka Stream Processors" or suchlike. > > > Close > > > > > >> > >> collaboration between the two projects will be needed in > any > > > > case. > > > > > >> > >> > > > > > >> > >> From a project management perspective, I guess the "new > > Samza" > > > > > would > > > > > >> > have > > > > > >> > >> to be developed on a branch alongside ongoing maintenance > of > > > the > > > > > >> current > > > > > >> > >> line of development? I think it would be important to > > continue > > > > > >> > supporting > > > > > >> > >> existing users, and provide a graceful migration path to > the > > > new > > > > > >> > version. > > > > > >> > >> Leaving the current versions unsupported and forcing people > > to > > > > > >> rewrite > > > > > >> > >> their jobs would send a bad signal. > > > > > >> > >> > > > > > >> > >> Best, > > > > > >> > >> Martin > > > > > >> > >> > > > > > >> > >> On 2 Jul 2015, at 16:59, Jay Kreps <j...@confluent.io> > wrote: > > > > > >> > >> > > > > > >> > >>> Hey Garry, > > > > > >> > >>> > > > > > >> > >>> Yeah that's super frustrating. I'd be happy to chat more > > about > > > > > this > > > > > >> if > > > > > >> > >>> you'd be interested. I think Chris and I started with the > > idea > > > > of > > > > > >> "what > > > > > >> > >>> would it take to make Samza a kick-ass ingestion tool" but > > > > > >> ultimately > > > > > >> > we > > > > > >> > >>> kind of came around to the idea that ingestion and > > > > transformation > > > > > >> had > > > > > >> > >>> pretty different needs and coupling the two made things > > hard. > > > > > >> > >>> > > > > > >> > >>> For what it's worth I think copycat (KIP-26) actually will > > do > > > > what > > > > > >> you > > > > > >> > >> are > > > > > >> > >>> looking for. > > > > > >> > >>> > > > > > >> > >>> With regard to your point about slider, I don't > necessarily > > > > > >> disagree. > > > > > >> > >> But I > > > > > >> > >>> think getting good YARN support is quite doable and I > think > > we > > > > can > > > > > >> make > > > > > >> > >>> that work well. I think the issue this proposal solves is > > that > > > > > >> > >> technically > > > > > >> > >>> it is pretty hard to support multiple cluster management > > > systems > > > > > the > > > > > >> > way > > > > > >> > >>> things are now, you need to write an "app master" or > > > "framework" > > > > > for > > > > > >> > each > > > > > >> > >>> and they are all a little different so testing is really > > hard. > > > > In > > > > > >> the > > > > > >> > >>> absence of this we have been stuck with just YARN which > has > > > > > >> fantastic > > > > > >> > >>> penetration in the Hadoopy part of the org, but zero > > > penetration > > > > > >> > >> elsewhere. > > > > > >> > >>> Given the huge amount of work being put in to slider, > > > marathon, > > > > > aws > > > > > >> > >>> tooling, not to mention the umpteen related packaging > > > > technologies > > > > > >> > people > > > > > >> > >>> want to use (Docker, Kubernetes, various cloud-specific > > deploy > > > > > >> tools, > > > > > >> > >> etc) > > > > > >> > >>> I really think it is important to get this right. > > > > > >> > >>> > > > > > >> > >>> -Jay > > > > > >> > >>> > > > > > >> > >>> On Thu, Jul 2, 2015 at 4:17 AM, Garry Turkington < > > > > > >> > >>> g.turking...@improvedigital.com> wrote: > > > > > >> > >>> > > > > > >> > >>>> Hi all, > > > > > >> > >>>> > > > > > >> > >>>> I think the question below re does Samza become a > > sub-project > > > > of > > > > > >> Kafka > > > > > >> > >>>> highlights the broader point around migration. Chris > > mentions > > > > > >> Samza's > > > > > >> > >>>> maturity is heading towards a v1 release but I'm not sure > > it > > > > > feels > > > > > >> > >> right to > > > > > >> > >>>> launch a v1 then immediately plan to deprecate most of > it. > > > > > >> > >>>> > > > > > >> > >>>> From a selfish perspective I have some guys who have > > started > > > > > >> working > > > > > >> > >> with > > > > > >> > >>>> Samza and building some new consumers/producers was next > > up. > > > > > Sounds > > > > > >> > like > > > > > >> > >>>> that is absolutely not the direction to go. I need to > look > > > into > > > > > the > > > > > >> > KIP > > > > > >> > >> in > > > > > >> > >>>> more detail but for me the attractiveness of adding new > > Samza > > > > > >> > >>>> consumer/producers -- even if yes all they were doing was > > > > really > > > > > >> > getting > > > > > >> > >>>> data into and out of Kafka -- was to avoid having to > > worry > > > > > about > > > > > >> the > > > > > >> > >>>> lifecycle management of external clients. If there is a > > > generic > > > > > >> Kafka > > > > > >> > >>>> ingress/egress layer that I can plug a new connector into > > and > > > > > have > > > > > >> a > > > > > >> > >> lot of > > > > > >> > >>>> the heavy lifting re scale and reliability done for me > then > > > it > > > > > >> gives > > > > > >> > me > > > > > >> > >> all > > > > > >> > >>>> the pushing new consumers/producers would. If not then it > > > > > >> complicates > > > > > >> > my > > > > > >> > >>>> operational deployments. > > > > > >> > >>>> > > > > > >> > >>>> Which is similar to my other question with the proposal > -- > > if > > > > we > > > > > >> > build a > > > > > >> > >>>> fully available/stand-alone Samza plus the requisite > shims > > to > > > > > >> > integrate > > > > > >> > >>>> with Slider etc I suspect the former may be a lot more > work > > > > than > > > > > we > > > > > >> > >> think. > > > > > >> > >>>> We may make it much easier for a newcomer to get > something > > > > > running > > > > > >> but > > > > > >> > >>>> having them step up and get a reliable production > > deployment > > > > may > > > > > >> still > > > > > >> > >>>> dominate mailing list traffic, if for different reasons > > than > > > > > >> today. > > > > > >> > >>>> > > > > > >> > >>>> Don't get me wrong -- I'm comfortable with making the > Samza > > > > > >> dependency > > > > > >> > >> on > > > > > >> > >>>> Kafka much more explicit and I absolutely see the > benefits > > > in > > > > > the > > > > > >> > >>>> reduction of duplication and clashing > > > > terminologies/abstractions > > > > > >> that > > > > > >> > >>>> Chris/Jay describe. Samza as a library would likely be a > > very > > > > > nice > > > > > >> > tool > > > > > >> > >> to > > > > > >> > >>>> add to the Kafka ecosystem. I just have the concerns > above > > re > > > > the > > > > > >> > >>>> operational side. > > > > > >> > >>>> > > > > > >> > >>>> Garry > > > > > >> > >>>> > > > > > >> > >>>> -----Original Message----- > > > > > >> > >>>> From: Gianmarco De Francisci Morales [mailto: > > g...@apache.org > > > ] > > > > > >> > >>>> Sent: 02 July 2015 12:56 > > > > > >> > >>>> To: dev@samza.apache.org > > > > > >> > >>>> Subject: Re: Thoughts and obesrvations on Samza > > > > > >> > >>>> > > > > > >> > >>>> Very interesting thoughts. > > > > > >> > >>>> From outside, I have always perceived Samza as a > computing > > > > layer > > > > > >> over > > > > > >> > >>>> Kafka. > > > > > >> > >>>> > > > > > >> > >>>> The question, maybe a bit provocative, is "should Samza > be > > a > > > > > >> > sub-project > > > > > >> > >>>> of Kafka then?" > > > > > >> > >>>> Or does it make sense to keep it as a separate project > > with a > > > > > >> separate > > > > > >> > >>>> governance? > > > > > >> > >>>> > > > > > >> > >>>> Cheers, > > > > > >> > >>>> > > > > > >> > >>>> -- > > > > > >> > >>>> Gianmarco > > > > > >> > >>>> > > > > > >> > >>>> On 2 July 2015 at 08:59, Yan Fang <yanfang...@gmail.com> > > > > wrote: > > > > > >> > >>>> > > > > > >> > >>>>> Overall, I agree to couple with Kafka more tightly. > > Because > > > > > Samza > > > > > >> de > > > > > >> > >>>>> facto is based on Kafka, and it should leverage what > Kafka > > > > has. > > > > > At > > > > > >> > the > > > > > >> > >>>>> same time, Kafka does not need to reinvent what Samza > > > already > > > > > >> has. I > > > > > >> > >>>>> also like the idea of separating the ingestion and > > > > > transformation. > > > > > >> > >>>>> > > > > > >> > >>>>> But it is a little difficult for me to image how the > Samza > > > > will > > > > > >> look > > > > > >> > >>>> like. > > > > > >> > >>>>> And I feel Chris and Jay have a little difference in > terms > > > of > > > > > how > > > > > >> > >>>>> Samza should look like. > > > > > >> > >>>>> > > > > > >> > >>>>> *** Will it look like what Jay's code shows (A client of > > > > Kakfa) > > > > > ? > > > > > >> And > > > > > >> > >>>>> user's application code calls this client? > > > > > >> > >>>>> > > > > > >> > >>>>> 1. If we make Samza be a library of Kafka (like what the > > > code > > > > > >> shows), > > > > > >> > >>>>> how do we implement auto-balance and fault-tolerance? > Are > > > they > > > > > >> taken > > > > > >> > >>>>> care by the Kafka broker or other mechanism, such as > > "Samza > > > > > >> worker" > > > > > >> > >>>>> (just make up the name) ? > > > > > >> > >>>>> > > > > > >> > >>>>> 2. What about other features, such as auto-scaling, > shared > > > > > state, > > > > > >> > >>>>> monitoring? > > > > > >> > >>>>> > > > > > >> > >>>>> > > > > > >> > >>>>> *** If we have Samza standalone, (is this what Chris > > > > suggests?) > > > > > >> > >>>>> > > > > > >> > >>>>> 1. we still need to ingest data from Kakfa and produce > to > > > it. > > > > > >> Then it > > > > > >> > >>>>> becomes the same as what Samza looks like now, except it > > > does > > > > > not > > > > > >> > rely > > > > > >> > >>>>> on Yarn anymore. > > > > > >> > >>>>> > > > > > >> > >>>>> 2. if it is standalone, how can it leverage Kafka's > > metrics, > > > > > logs, > > > > > >> > >>>>> etc? Use Kafka code as the dependency? > > > > > >> > >>>>> > > > > > >> > >>>>> > > > > > >> > >>>>> Thanks, > > > > > >> > >>>>> > > > > > >> > >>>>> Fang, Yan > > > > > >> > >>>>> yanfang...@gmail.com > > > > > >> > >>>>> > > > > > >> > >>>>> On Wed, Jul 1, 2015 at 5:46 PM, Guozhang Wang < > > > > > wangg...@gmail.com > > > > > >> > > > > > > >> > >>>> wrote: > > > > > >> > >>>>> > > > > > >> > >>>>>> Read through the code example and it looks good to me. > A > > > few > > > > > >> > >>>>>> thoughts regarding deployment: > > > > > >> > >>>>>> > > > > > >> > >>>>>> Today Samza deploys as executable runnable like: > > > > > >> > >>>>>> > > > > > >> > >>>>>> deploy/samza/bin/run-job.sh --config-factory=... > > > > > >> > >>>> --config-path=file://... > > > > > >> > >>>>>> > > > > > >> > >>>>>> And this proposal advocate for deploying Samza more as > > > > embedded > > > > > >> > >>>>>> libraries in user application code (ignoring the > > > terminology > > > > > >> since > > > > > >> > >>>>>> it is not the > > > > > >> > >>>>> same > > > > > >> > >>>>>> as the prototype code): > > > > > >> > >>>>>> > > > > > >> > >>>>>> StreamTask task = new MyStreamTask(configs); Thread > > thread > > > = > > > > > new > > > > > >> > >>>>>> Thread(task); thread.start(); > > > > > >> > >>>>>> > > > > > >> > >>>>>> I think both of these deployment modes are important > for > > > > > >> different > > > > > >> > >>>>>> types > > > > > >> > >>>>> of > > > > > >> > >>>>>> users. That said, I think making Samza purely > standalone > > is > > > > > still > > > > > >> > >>>>>> sufficient for either runnable or library modes. > > > > > >> > >>>>>> > > > > > >> > >>>>>> Guozhang > > > > > >> > >>>>>> > > > > > >> > >>>>>> On Tue, Jun 30, 2015 at 11:33 PM, Jay Kreps < > > > > j...@confluent.io> > > > > > >> > wrote: > > > > > >> > >>>>>> > > > > > >> > >>>>>>> Looks like gmail mangled the code example, it was > > supposed > > > > to > > > > > >> look > > > > > >> > >>>>>>> like > > > > > >> > >>>>>>> this: > > > > > >> > >>>>>>> > > > > > >> > >>>>>>> Properties props = new Properties(); > > > > > >> > >>>>>>> props.put("bootstrap.servers", "localhost:4242"); > > > > > >> StreamingConfig > > > > > >> > >>>>>>> config = new StreamingConfig(props); > > > > > >> > >>>>>>> config.subscribe("test-topic-1", "test-topic-2"); > > > > > >> > >>>>>>> config.processor(ExampleStreamProcessor.class); > > > > > >> > >>>>>>> config.serialization(new StringSerializer(), new > > > > > >> > >>>>>>> StringDeserializer()); KafkaStreaming container = new > > > > > >> > >>>>>>> KafkaStreaming(config); container.run(); > > > > > >> > >>>>>>> > > > > > >> > >>>>>>> -Jay > > > > > >> > >>>>>>> > > > > > >> > >>>>>>> On Tue, Jun 30, 2015 at 11:32 PM, Jay Kreps < > > > > j...@confluent.io > > > > > > > > > > > >> > >>>> wrote: > > > > > >> > >>>>>>> > > > > > >> > >>>>>>>> Hey guys, > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> This came out of some conversations Chris and I were > > > having > > > > > >> > >>>>>>>> around > > > > > >> > >>>>>>> whether > > > > > >> > >>>>>>>> it would make sense to use Samza as a kind of data > > > > ingestion > > > > > >> > >>>>> framework > > > > > >> > >>>>>>> for > > > > > >> > >>>>>>>> Kafka (which ultimately lead to KIP-26 "copycat"). > This > > > > kind > > > > > of > > > > > >> > >>>>>> combined > > > > > >> > >>>>>>>> with complaints around config and YARN and the > > discussion > > > > > >> around > > > > > >> > >>>>>>>> how > > > > > >> > >>>>> to > > > > > >> > >>>>>>>> best do a standalone mode. > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> So the thought experiment was, given that Samza was > > > > basically > > > > > >> > >>>>>>>> already totally Kafka specific, what if you just > > embraced > > > > > that > > > > > >> > >>>>>>>> and turned it > > > > > >> > >>>>>> into > > > > > >> > >>>>>>>> something less like a heavyweight framework and more > > > like a > > > > > >> > >>>>>>>> third > > > > > >> > >>>>> Kafka > > > > > >> > >>>>>>>> client--a kind of "producing consumer" with state > > > > management > > > > > >> > >>>>>> facilities. > > > > > >> > >>>>>>>> Basically a library. Instead of a complex stream > > > processing > > > > > >> > >>>>>>>> framework > > > > > >> > >>>>>>> this > > > > > >> > >>>>>>>> would actually be a very simple thing, not much more > > > > > >> complicated > > > > > >> > >>>>>>>> to > > > > > >> > >>>>> use > > > > > >> > >>>>>>> or > > > > > >> > >>>>>>>> operate than a Kafka consumer. As Chris said we > thought > > > > about > > > > > >> it > > > > > >> > >>>>>>>> a > > > > > >> > >>>>> lot > > > > > >> > >>>>>> of > > > > > >> > >>>>>>>> what Samza (and the other stream processing systems > > were > > > > > doing) > > > > > >> > >>>>> seemed > > > > > >> > >>>>>>> like > > > > > >> > >>>>>>>> kind of a hangover from MapReduce. > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> Of course you need to ingest/output data to and from > > the > > > > > stream > > > > > >> > >>>>>>>> processing. But when we actually looked into how that > > > would > > > > > >> > >>>>>>>> work, > > > > > >> > >>>>> Samza > > > > > >> > >>>>>>>> isn't really an ideal data ingestion framework for a > > > bunch > > > > of > > > > > >> > >>>>> reasons. > > > > > >> > >>>>>> To > > > > > >> > >>>>>>>> really do that right you need a pretty different > > internal > > > > > data > > > > > >> > >>>>>>>> model > > > > > >> > >>>>>> and > > > > > >> > >>>>>>>> set of apis. So what if you split them and had an api > > for > > > > > Kafka > > > > > >> > >>>>>>>> ingress/egress (copycat AKA KIP-26) and a separate > api > > > for > > > > > >> Kafka > > > > > >> > >>>>>>>> transformation (Samza). > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> This would also allow really embracing the same > > > terminology > > > > > and > > > > > >> > >>>>>>>> conventions. One complaint about the current state is > > > that > > > > > the > > > > > >> > >>>>>>>> two > > > > > >> > >>>>>>> systems > > > > > >> > >>>>>>>> kind of feel bolted on. Terminology like "stream" vs > > > > "topic" > > > > > >> and > > > > > >> > >>>>>>> different > > > > > >> > >>>>>>>> config and monitoring systems means you kind of have > to > > > > learn > > > > > >> > >>>>>>>> Kafka's > > > > > >> > >>>>>>> way, > > > > > >> > >>>>>>>> then learn Samza's slightly different way, then kind > of > > > > > >> > >>>>>>>> understand > > > > > >> > >>>>> how > > > > > >> > >>>>>>> they > > > > > >> > >>>>>>>> map to each other, which having walked a few people > > > through > > > > > >> this > > > > > >> > >>>>>>>> is surprisingly tricky for folks to get. > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> Since I have been spending a lot of time on > airplanes I > > > > > hacked > > > > > >> > >>>>>>>> up an ernest but still somewhat incomplete prototype > of > > > > what > > > > > >> > >>>>>>>> this would > > > > > >> > >>>>> look > > > > > >> > >>>>>>>> like. This is just unceremoniously dumped into Kafka > as > > > it > > > > > >> > >>>>>>>> required a > > > > > >> > >>>>>> few > > > > > >> > >>>>>>>> changes to the new consumer. Here is the code: > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>> > > > > > >> > >>>>>> > > > > > >> > >>>>> > > > > > >> > > > > > > > > https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org > > > > > >> > >>>>> /apache/kafka/clients/streaming > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> For the purpose of the prototype I just liberally > > renamed > > > > > >> > >>>>>>>> everything > > > > > >> > >>>>> to > > > > > >> > >>>>>>>> try to align it with Kafka with no regard for > > > > compatibility. > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> To use this would be something like this: > > > > > >> > >>>>>>>> Properties props = new Properties(); > > > > > >> > >>>>>>>> props.put("bootstrap.servers", "localhost:4242"); > > > > > >> > >>>>>>>> StreamingConfig config = new > > > > > >> > >>>>> StreamingConfig(props); > > > > > >> > >>>>>>> config.subscribe("test-topic-1", > > > > > >> > >>>>>>>> "test-topic-2"); > > > > > >> config.processor(ExampleStreamProcessor.class); > > > > > >> > >>>>>>> config.serialization(new > > > > > >> > >>>>>>>> StringSerializer(), new StringDeserializer()); > > > > KafkaStreaming > > > > > >> > >>>>>> container = > > > > > >> > >>>>>>>> new KafkaStreaming(config); container.run(); > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> KafkaStreaming is basically the SamzaContainer; > > > > > StreamProcessor > > > > > >> > >>>>>>>> is basically StreamTask. > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> So rather than putting all the class names in a file > > and > > > > then > > > > > >> > >>>>>>>> having > > > > > >> > >>>>>> the > > > > > >> > >>>>>>>> job assembled by reflection, you just instantiate the > > > > > container > > > > > >> > >>>>>>>> programmatically. Work is balanced over however many > > > > > instances > > > > > >> > >>>>>>>> of > > > > > >> > >>>>> this > > > > > >> > >>>>>>> are > > > > > >> > >>>>>>>> alive at any time (i.e. if an instance dies, new > tasks > > > are > > > > > >> added > > > > > >> > >>>>>>>> to > > > > > >> > >>>>> the > > > > > >> > >>>>>>>> existing containers without shutting them down). > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> We would provide some glue for running this stuff in > > YARN > > > > via > > > > > >> > >>>>>>>> Slider, Mesos via Marathon, and AWS using some of > their > > > > tools > > > > > >> > >>>>>>>> but from the > > > > > >> > >>>>>> point > > > > > >> > >>>>>>> of > > > > > >> > >>>>>>>> view of these frameworks these stream processing jobs > > are > > > > > just > > > > > >> > >>>>>> stateless > > > > > >> > >>>>>>>> services that can come and go and expand and contract > > at > > > > > will. > > > > > >> > >>>>>>>> There > > > > > >> > >>>>> is > > > > > >> > >>>>>>> no > > > > > >> > >>>>>>>> more custom scheduler. > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> Here are some relevant details: > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> 1. It is only ~1300 lines of code, it would get > larger > > > if > > > > we > > > > > >> > >>>>>>>> productionized but not vastly larger. We really do > > get a > > > > ton > > > > > >> > >>>>>>>> of > > > > > >> > >>>>>>> leverage > > > > > >> > >>>>>>>> out of Kafka. > > > > > >> > >>>>>>>> 2. Partition management is fully delegated to the > new > > > > > >> consumer. > > > > > >> > >>>>> This > > > > > >> > >>>>>>>> is nice since now any partition management strategy > > > > > available > > > > > >> > >>>>>>>> to > > > > > >> > >>>>>> Kafka > > > > > >> > >>>>>>>> consumer is also available to Samza (and vice versa) > > and > > > > > with > > > > > >> > >>>>>>>> the > > > > > >> > >>>>>>> exact > > > > > >> > >>>>>>>> same configs. > > > > > >> > >>>>>>>> 3. It supports state as well as state reuse > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> Anyhow take a look, hopefully it is thought > provoking. > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> -Jay > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> On Tue, Jun 30, 2015 at 6:55 PM, Chris Riccomini < > > > > > >> > >>>>>> criccom...@apache.org> > > > > > >> > >>>>>>>> wrote: > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>>> Hey all, > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> I have had some discussions with Samza engineers at > > > > LinkedIn > > > > > >> > >>>>>>>>> and > > > > > >> > >>>>>>> Confluent > > > > > >> > >>>>>>>>> and we came up with a few observations and would > like > > to > > > > > >> > >>>>>>>>> propose > > > > > >> > >>>>> some > > > > > >> > >>>>>>>>> changes. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> We've observed some things that I want to call out > > about > > > > > >> > >>>>>>>>> Samza's > > > > > >> > >>>>>> design, > > > > > >> > >>>>>>>>> and I'd like to propose some changes. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> * Samza is dependent upon a dynamic deployment > system. > > > > > >> > >>>>>>>>> * Samza is too pluggable. > > > > > >> > >>>>>>>>> * Samza's SystemConsumer/SystemProducer and Kafka's > > > > consumer > > > > > >> > >>>>>>>>> APIs > > > > > >> > >>>>> are > > > > > >> > >>>>>>>>> trying to solve a lot of the same problems. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> All three of these issues are related, but I'll > > address > > > > them > > > > > >> in > > > > > >> > >>>>> order. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> Deployment > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> Samza strongly depends on the use of a dynamic > > > deployment > > > > > >> > >>>>>>>>> scheduler > > > > > >> > >>>>>> such > > > > > >> > >>>>>>>>> as > > > > > >> > >>>>>>>>> YARN, Mesos, etc. When we initially built Samza, we > > bet > > > > that > > > > > >> > >>>>>>>>> there > > > > > >> > >>>>>> would > > > > > >> > >>>>>>>>> be > > > > > >> > >>>>>>>>> one or two winners in this area, and we could > support > > > > them, > > > > > >> and > > > > > >> > >>>>>>>>> the > > > > > >> > >>>>>> rest > > > > > >> > >>>>>>>>> would go away. In reality, there are many > variations. > > > > > >> > >>>>>>>>> Furthermore, > > > > > >> > >>>>>> many > > > > > >> > >>>>>>>>> people still prefer to just start their processors > > like > > > > > normal > > > > > >> > >>>>>>>>> Java processes, and use traditional deployment > scripts > > > > such > > > > > as > > > > > >> > >>>>>>>>> Fabric, > > > > > >> > >>>>>> Chef, > > > > > >> > >>>>>>>>> Ansible, etc. Forcing a deployment system on users > > makes > > > > the > > > > > >> > >>>>>>>>> Samza start-up process really painful for first time > > > > users. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> Dynamic deployment as a requirement was also a bit > of > > a > > > > > >> > >>>>>>>>> mis-fire > > > > > >> > >>>>>> because > > > > > >> > >>>>>>>>> of > > > > > >> > >>>>>>>>> a fundamental misunderstanding between the nature of > > > batch > > > > > >> jobs > > > > > >> > >>>>>>>>> and > > > > > >> > >>>>>>> stream > > > > > >> > >>>>>>>>> processing jobs. Early on, we made conscious effort > to > > > > favor > > > > > >> > >>>>>>>>> the > > > > > >> > >>>>>> Hadoop > > > > > >> > >>>>>>>>> (Map/Reduce) way of doing things, since it worked > and > > > was > > > > > well > > > > > >> > >>>>>>> understood. > > > > > >> > >>>>>>>>> One thing that we missed was that batch jobs have a > > > > definite > > > > > >> > >>>>>> beginning, > > > > > >> > >>>>>>>>> and > > > > > >> > >>>>>>>>> end, and stream processing jobs don't (usually). > This > > > > leads > > > > > to > > > > > >> > >>>>>>>>> a > > > > > >> > >>>>> much > > > > > >> > >>>>>>>>> simpler scheduling problem for stream processors. > You > > > > > >> basically > > > > > >> > >>>>>>>>> just > > > > > >> > >>>>>>> need > > > > > >> > >>>>>>>>> to find a place to start the processor, and start > it. > > > The > > > > > way > > > > > >> > >>>>>>>>> we run grids, at LinkedIn, there's no concept of a > > > cluster > > > > > >> > >>>>>>>>> being "full". We always > > > > > >> > >>>>>> add > > > > > >> > >>>>>>>>> more machines. The problem with coupling Samza with > a > > > > > >> scheduler > > > > > >> > >>>>>>>>> is > > > > > >> > >>>>>> that > > > > > >> > >>>>>>>>> Samza (as a framework) now has to handle deployment. > > > This > > > > > >> pulls > > > > > >> > >>>>>>>>> in a > > > > > >> > >>>>>>> bunch > > > > > >> > >>>>>>>>> of things such as configuration distribution (config > > > > > stream), > > > > > >> > >>>>>>>>> shell > > > > > >> > >>>>>>> scrips > > > > > >> > >>>>>>>>> (bin/run-job.sh, JobRunner), packaging (all the .tgz > > > > stuff), > > > > > >> etc. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> Another reason for requiring dynamic deployment was > to > > > > > support > > > > > >> > >>>>>>>>> data locality. If you want to have locality, you > need > > to > > > > put > > > > > >> > >>>>>>>>> your > > > > > >> > >>>>>> processors > > > > > >> > >>>>>>>>> close to the data they're processing. Upon further > > > > > >> > >>>>>>>>> investigation, > > > > > >> > >>>>>>> though, > > > > > >> > >>>>>>>>> this feature is not that beneficial. There is some > > good > > > > > >> > >>>>>>>>> discussion > > > > > >> > >>>>>> about > > > > > >> > >>>>>>>>> some problems with it on SAMZA-335. Again, we took > the > > > > > >> > >>>>>>>>> Map/Reduce > > > > > >> > >>>>>> path, > > > > > >> > >>>>>>>>> but > > > > > >> > >>>>>>>>> there are some fundamental differences between HDFS > > and > > > > > Kafka. > > > > > >> > >>>>>>>>> HDFS > > > > > >> > >>>>>> has > > > > > >> > >>>>>>>>> blocks, while Kafka has partitions. This leads to > less > > > > > >> > >>>>>>>>> optimization potential with stream processors on top > > of > > > > > Kafka. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> This feature is also used as a crutch. Samza doesn't > > > have > > > > > any > > > > > >> > >>>>>>>>> built > > > > > >> > >>>>> in > > > > > >> > >>>>>>>>> fault-tolerance logic. Instead, it depends on the > > > dynamic > > > > > >> > >>>>>>>>> deployment scheduling system to handle restarts > when a > > > > > >> > >>>>>>>>> processor dies. This has > > > > > >> > >>>>>>> made > > > > > >> > >>>>>>>>> it very difficult to write a standalone Samza > > container > > > > > >> > >>>> (SAMZA-516). > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> Pluggability > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> In some cases pluggability is good, but I think that > > > we've > > > > > >> gone > > > > > >> > >>>>>>>>> too > > > > > >> > >>>>>> far > > > > > >> > >>>>>>>>> with it. Currently, Samza has: > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> * Pluggable config. > > > > > >> > >>>>>>>>> * Pluggable metrics. > > > > > >> > >>>>>>>>> * Pluggable deployment systems. > > > > > >> > >>>>>>>>> * Pluggable streaming systems (SystemConsumer, > > > > > SystemProducer, > > > > > >> > >>>> etc). > > > > > >> > >>>>>>>>> * Pluggable serdes. > > > > > >> > >>>>>>>>> * Pluggable storage engines. > > > > > >> > >>>>>>>>> * Pluggable strategies for just about every > component > > > > > >> > >>>>> (MessageChooser, > > > > > >> > >>>>>>>>> SystemStreamPartitionGrouper, ConfigRewriter, etc). > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> There's probably more that I've forgotten, as well. > > Some > > > > of > > > > > >> > >>>>>>>>> these > > > > > >> > >>>>> are > > > > > >> > >>>>>>>>> useful, but some have proven not to be. This all > comes > > > at > > > > a > > > > > >> cost: > > > > > >> > >>>>>>>>> complexity. This complexity is making it harder for > > our > > > > > users > > > > > >> > >>>>>>>>> to > > > > > >> > >>>>> pick > > > > > >> > >>>>>> up > > > > > >> > >>>>>>>>> and use Samza out of the box. It also makes it > > difficult > > > > for > > > > > >> > >>>>>>>>> Samza developers to reason about what the > > > characteristics > > > > of > > > > > >> > >>>>>>>>> the container (since the characteristics change > > > depending > > > > on > > > > > >> > >>>>>>>>> which plugins are use). > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> The issues with pluggability are most visible in the > > > > System > > > > > >> APIs. > > > > > >> > >>>>> What > > > > > >> > >>>>>>>>> Samza really requires to be functional is Kafka as > its > > > > > >> > >>>>>>>>> transport > > > > > >> > >>>>>> layer. > > > > > >> > >>>>>>>>> But > > > > > >> > >>>>>>>>> we've conflated two unrelated use cases into one > API: > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> 1. Get data into/out of Kafka. > > > > > >> > >>>>>>>>> 2. Process the data in Kafka. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> The current System API supports both of these use > > cases. > > > > The > > > > > >> > >>>>>>>>> problem > > > > > >> > >>>>>> is, > > > > > >> > >>>>>>>>> we > > > > > >> > >>>>>>>>> actually want different features for each use case. > By > > > > > >> papering > > > > > >> > >>>>>>>>> over > > > > > >> > >>>>>>> these > > > > > >> > >>>>>>>>> two use cases, and providing a single API, we've > > > > introduced > > > > > a > > > > > >> > >>>>>>>>> ton of > > > > > >> > >>>>>>> leaky > > > > > >> > >>>>>>>>> abstractions. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> For example, what we'd really like in (2) is to have > > > > > >> > >>>>>>>>> monotonically increasing longs for offsets (like > > Kafka). > > > > > This > > > > > >> > >>>>>>>>> would be at odds > > > > > >> > >>>>> with > > > > > >> > >>>>>>> (1), > > > > > >> > >>>>>>>>> though, since different systems have different > > > > > >> > >>>>>>> SCNs/Offsets/UUIDs/vectors. > > > > > >> > >>>>>>>>> There was discussion both on the mailing list and > the > > > SQL > > > > > >> JIRAs > > > > > >> > >>>>> about > > > > > >> > >>>>>>> the > > > > > >> > >>>>>>>>> need for this. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> The same thing holds true for replayability. Kafka > > > allows > > > > us > > > > > >> to > > > > > >> > >>>>> rewind > > > > > >> > >>>>>>>>> when > > > > > >> > >>>>>>>>> we have a failure. Many other systems don't. In some > > > > cases, > > > > > >> > >>>>>>>>> systems > > > > > >> > >>>>>>> return > > > > > >> > >>>>>>>>> null for their offsets (e.g. > WikipediaSystemConsumer) > > > > > because > > > > > >> > >>>>>>>>> they > > > > > >> > >>>>>> have > > > > > >> > >>>>>>> no > > > > > >> > >>>>>>>>> offsets. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> Partitioning is another example. Kafka supports > > > > > partitioning, > > > > > >> > >>>>>>>>> but > > > > > >> > >>>>> many > > > > > >> > >>>>>>>>> systems don't. We model this by having a single > > > partition > > > > > for > > > > > >> > >>>>>>>>> those systems. Still, other systems model > partitioning > > > > > >> > >>>> differently (e.g. > > > > > >> > >>>>>>>>> Kinesis). > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> The SystemAdmin interface is also a mess. Creating > > > streams > > > > > in > > > > > >> a > > > > > >> > >>>>>>>>> system-agnostic way is almost impossible. As is > > modeling > > > > > >> > >>>>>>>>> metadata > > > > > >> > >>>>> for > > > > > >> > >>>>>>> the > > > > > >> > >>>>>>>>> system (replication factor, partitions, location, > > etc). > > > > The > > > > > >> > >>>>>>>>> list > > > > > >> > >>>>> goes > > > > > >> > >>>>>>> on. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> Duplicate work > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> At the time that we began writing Samza, Kafka's > > > consumer > > > > > and > > > > > >> > >>>>> producer > > > > > >> > >>>>>>>>> APIs > > > > > >> > >>>>>>>>> had a relatively weak feature set. On the > > consumer-side, > > > > you > > > > > >> > >>>>>>>>> had two > > > > > >> > >>>>>>>>> options: use the high level consumer, or the simple > > > > > consumer. > > > > > >> > >>>>>>>>> The > > > > > >> > >>>>>>> problem > > > > > >> > >>>>>>>>> with the high-level consumer was that it controlled > > your > > > > > >> > >>>>>>>>> offsets, partition assignments, and the order in > which > > > you > > > > > >> > >>>>>>>>> received messages. The > > > > > >> > >>>>> problem > > > > > >> > >>>>>>>>> with > > > > > >> > >>>>>>>>> the simple consumer is that it's not simple. It's > > basic. > > > > You > > > > > >> > >>>>>>>>> end up > > > > > >> > >>>>>>> having > > > > > >> > >>>>>>>>> to handle a lot of really low-level stuff that you > > > > > shouldn't. > > > > > >> > >>>>>>>>> We > > > > > >> > >>>>>> spent a > > > > > >> > >>>>>>>>> lot of time to make Samza's KafkaSystemConsumer very > > > > robust. > > > > > >> It > > > > > >> > >>>>>>>>> also allows us to support some cool features: > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> * Per-partition message ordering and prioritization. > > > > > >> > >>>>>>>>> * Tight control over partition assignment to support > > > > joins, > > > > > >> > >>>>>>>>> global > > > > > >> > >>>>>> state > > > > > >> > >>>>>>>>> (if we want to implement it :)), etc. > > > > > >> > >>>>>>>>> * Tight control over offset checkpointing. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> What we didn't realize at the time is that these > > > features > > > > > >> > >>>>>>>>> should > > > > > >> > >>>>>>> actually > > > > > >> > >>>>>>>>> be in Kafka. A lot of Kafka consumers (not just > Samza > > > > stream > > > > > >> > >>>>>> processors) > > > > > >> > >>>>>>>>> end up wanting to do things like joins and partition > > > > > >> > >>>>>>>>> assignment. The > > > > > >> > >>>>>>> Kafka > > > > > >> > >>>>>>>>> community has come to the same conclusion. They're > > > adding > > > > a > > > > > >> ton > > > > > >> > >>>>>>>>> of upgrades into their new Kafka consumer > > > implementation. > > > > > To a > > > > > >> > >>>>>>>>> large extent, > > > > > >> > >>>>> it's > > > > > >> > >>>>>>>>> duplicate work to what we've already done in Samza. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> On top of this, Kafka ended up taking a very similar > > > > > approach > > > > > >> > >>>>>>>>> to > > > > > >> > >>>>>> Samza's > > > > > >> > >>>>>>>>> KafkaCheckpointManager implementation for handling > > > offset > > > > > >> > >>>>>> checkpointing. > > > > > >> > >>>>>>>>> Like Samza, Kafka's new offset management feature > > stores > > > > > >> offset > > > > > >> > >>>>>>>>> checkpoints in a topic, and allows you to fetch them > > > from > > > > > the > > > > > >> > >>>>>>>>> broker. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> A lot of this seems like a waste, since we could > have > > > > shared > > > > > >> > >>>>>>>>> the > > > > > >> > >>>>> work > > > > > >> > >>>>>> if > > > > > >> > >>>>>>>>> it > > > > > >> > >>>>>>>>> had been done in Kafka from the get-go. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> Vision > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> All of this leads me to a rather radical proposal. > > Samza > > > > is > > > > > >> > >>>>> relatively > > > > > >> > >>>>>>>>> stable at this point. I'd venture to say that we're > > > near a > > > > > 1.0 > > > > > >> > >>>>>> release. > > > > > >> > >>>>>>>>> I'd > > > > > >> > >>>>>>>>> like to propose that we take what we've learned, and > > > begin > > > > > >> > >>>>>>>>> thinking > > > > > >> > >>>>>>> about > > > > > >> > >>>>>>>>> Samza beyond 1.0. What would we change if we were > > > starting > > > > > >> from > > > > > >> > >>>>>> scratch? > > > > > >> > >>>>>>>>> My > > > > > >> > >>>>>>>>> proposal is to: > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> 1. Make Samza standalone the *only* way to run Samza > > > > > >> > >>>>>>>>> processors, and eliminate all direct dependences on > > > YARN, > > > > > >> Mesos, > > > > > >> > >>>> etc. > > > > > >> > >>>>>>>>> 2. Make a definitive call to support only Kafka as > the > > > > > stream > > > > > >> > >>>>>> processing > > > > > >> > >>>>>>>>> layer. > > > > > >> > >>>>>>>>> 3. Eliminate Samza's metrics, logging, > serialization, > > > and > > > > > >> > >>>>>>>>> config > > > > > >> > >>>>>>> systems, > > > > > >> > >>>>>>>>> and simply use Kafka's instead. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> This would fix all of the issues that I outlined > > above. > > > It > > > > > >> > >>>>>>>>> should > > > > > >> > >>>>> also > > > > > >> > >>>>>>>>> shrink the Samza code base pretty dramatically. > > > Supporting > > > > > >> only > > > > > >> > >>>>>>>>> a standalone container will allow Samza to be > executed > > > on > > > > > YARN > > > > > >> > >>>>>>>>> (using Slider), Mesos (using Marathon/Aurora), or > most > > > > other > > > > > >> > >>>>>>>>> in-house > > > > > >> > >>>>>>> deployment > > > > > >> > >>>>>>>>> systems. This should make life a lot easier for new > > > users. > > > > > >> > >>>>>>>>> Imagine > > > > > >> > >>>>>>> having > > > > > >> > >>>>>>>>> the hello-samza tutorial without YARN. The drop in > > > mailing > > > > > >> list > > > > > >> > >>>>>> traffic > > > > > >> > >>>>>>>>> will be pretty dramatic. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> Coupling with Kafka seems long overdue to me. The > > > reality > > > > > is, > > > > > >> > >>>>> everyone > > > > > >> > >>>>>>>>> that > > > > > >> > >>>>>>>>> I'm aware of is using Samza with Kafka. We basically > > > > require > > > > > >> it > > > > > >> > >>>>>> already > > > > > >> > >>>>>>> in > > > > > >> > >>>>>>>>> order for most features to work. Those that are > using > > > > other > > > > > >> > >>>>>>>>> systems > > > > > >> > >>>>>> are > > > > > >> > >>>>>>>>> generally using it for ingest into Kafka (1), and > then > > > > they > > > > > do > > > > > >> > >>>>>>>>> the processing on top. There is already discussion ( > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>> > > > > > >> > >>>>>> > > > > > >> > >>>>> > > > > > >> > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851 > > > > > >> > >>>>> 767 > > > > > >> > >>>>>>>>> ) > > > > > >> > >>>>>>>>> in Kafka to make ingesting into Kafka extremely > easy. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> Once we make the call to couple with Kafka, we can > > > > leverage > > > > > a > > > > > >> > >>>>>>>>> ton of > > > > > >> > >>>>>>> their > > > > > >> > >>>>>>>>> ecosystem. We no longer have to maintain our own > > config, > > > > > >> > >>>>>>>>> metrics, > > > > > >> > >>>>> etc. > > > > > >> > >>>>>>> We > > > > > >> > >>>>>>>>> can all share the same libraries, and make them > > better. > > > > This > > > > > >> > >>>>>>>>> will > > > > > >> > >>>>> also > > > > > >> > >>>>>>>>> allow us to share the consumer/producer APIs, and > will > > > let > > > > > us > > > > > >> > >>>>> leverage > > > > > >> > >>>>>>>>> their offset management and partition management, > > rather > > > > > than > > > > > >> > >>>>>>>>> having > > > > > >> > >>>>>> our > > > > > >> > >>>>>>>>> own. All of the coordinator stream code would go > away, > > > as > > > > > >> would > > > > > >> > >>>>>>>>> most > > > > > >> > >>>>>> of > > > > > >> > >>>>>>>>> the > > > > > >> > >>>>>>>>> YARN AppMaster code. We'd probably have to push some > > > > > partition > > > > > >> > >>>>>>> management > > > > > >> > >>>>>>>>> features into the Kafka broker, but they're already > > > moving > > > > > in > > > > > >> > >>>>>>>>> that direction with the new consumer API. The > features > > > we > > > > > have > > > > > >> > >>>>>>>>> for > > > > > >> > >>>>>> partition > > > > > >> > >>>>>>>>> assignment aren't unique to Samza, and seem like > they > > > > should > > > > > >> be > > > > > >> > >>>>>>>>> in > > > > > >> > >>>>>> Kafka > > > > > >> > >>>>>>>>> anyway. There will always be some niche usages which > > > will > > > > > >> > >>>>>>>>> require > > > > > >> > >>>>>> extra > > > > > >> > >>>>>>>>> care and hence full control over partition > assignments > > > > much > > > > > >> > >>>>>>>>> like the > > > > > >> > >>>>>>> Kafka > > > > > >> > >>>>>>>>> low level consumer api. These would continue to be > > > > > supported. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> These items will be good for the Samza community. > > > They'll > > > > > make > > > > > >> > >>>>>>>>> Samza easier to use, and make it easier for > developers > > > to > > > > > add > > > > > >> > >>>>>>>>> new features. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> Obviously this is a fairly large (and somewhat > > backwards > > > > > >> > >>>>> incompatible > > > > > >> > >>>>>>>>> change). If we choose to go this route, it's > important > > > > that > > > > > we > > > > > >> > >>>>> openly > > > > > >> > >>>>>>>>> communicate how we're going to provide a migration > > path > > > > from > > > > > >> > >>>>>>>>> the > > > > > >> > >>>>>>> existing > > > > > >> > >>>>>>>>> APIs to the new ones (if we make incompatible > > changes). > > > I > > > > > >> think > > > > > >> > >>>>>>>>> at a minimum, we'd probably need to provide a > wrapper > > to > > > > > allow > > > > > >> > >>>>>>>>> existing StreamTask implementations to continue > > running > > > on > > > > > the > > > > > >> > >>>> new container. > > > > > >> > >>>>>>> It's > > > > > >> > >>>>>>>>> also important that we openly communicate about > > timing, > > > > and > > > > > >> > >>>>>>>>> stages > > > > > >> > >>>>> of > > > > > >> > >>>>>>> the > > > > > >> > >>>>>>>>> migration. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> If you made it this far, I'm sure you have opinions. > > :) > > > > > Please > > > > > >> > >>>>>>>>> send > > > > > >> > >>>>>> your > > > > > >> > >>>>>>>>> thoughts and feedback. > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>>> Cheers, > > > > > >> > >>>>>>>>> Chris > > > > > >> > >>>>>>>>> > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>>> > > > > > >> > >>>>>>> > > > > > >> > >>>>>> > > > > > >> > >>>>>> > > > > > >> > >>>>>> > > > > > >> > >>>>>> -- > > > > > >> > >>>>>> -- Guozhang > > > > > >> > >>>>>> > > > > > >> > >>>>> > > > > > >> > >>>> > > > > > >> > >> > > > > > >> > >> > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > >