Same here. Not that it matters as far as admission to the incubator (that vote is over now anyway), but I think a lot of people (including potential users of Samza) would like to see more about how it compares & contrasts with other stream oriented systems.
Phil This message optimized for indexing by NSA PRISM On Fri, Jul 26, 2013 at 8:27 PM, Alex Karasulu <akaras...@apache.org> wrote: > +1 > > I would love to see the "documents comparing and contrasting Samza with > MUPD8 and Storm." > > > On Sat, Jul 27, 2013 at 2:53 AM, Enis Söztutar <e...@apache.org> wrote: > >> +1 on incubation. >> >> Enis >> >> >> On Tue, Jul 23, 2013 at 7:17 PM, Chris Riccomini >> <criccomini....@gmail.com>wrote: >> >> > Hey Henry and Debo, >> > >> > Thanks for calling this out. Samza's feature set includes: >> > >> > - *Simpe API:* Unlike most low-level messaging system APIs, Samza >> > provides a very simple call-back based "process message" API that >> > should be >> > familiar to anyone that's used Map/Reduce. >> > - *Managed state:* Samza manages snapshotting and restoration of a >> > stream processor's state. Samza will restore a stream processor's >> state >> > to >> > a snapshot consistent with the processor's last read messages when the >> > processor is restarted. >> > - *Fault tolerance:* Samza will work with YARN to restart your stream >> > processor if there is a machine or processor failure. >> > - Durability: Samza uses Kafka to guarantee that no messages will ever >> > be lost. >> > - *Scalability:* Samza is partitioned and distributed at every level. >> > Kafka provides ordered, partitioned, replayable, fault-tolerant >> streams. >> > YARN provides a distributed environment for Samza containers to run >> in. >> > - *Pluggable:* Though Samza works out of the box with Kafka and YARN, >> > Samza provides a pluggable API that lets you run Samza with other >> > messaging >> > systems and execution environments. >> > - *Processor isolation:* Samza works with Apache YARN, which supports >> > processor security through Hadoop's security model, and resource >> > isolation >> > through Linux CGroups. >> > >> > Some of these feature are available in S4, and some are not. The same >> holds >> > true for Storm. >> > >> > The open source stream processing systems that are available are actually >> > quite young, and no single system offers a complete solution. Problems >> like >> > how a stream processor's state (e.g. counts) should be managed, whether a >> > stream should be buffered remotely on disk or not, what to do when >> > duplicate messages are received or messages are lost, and how to model >> > underlying messaging systems are all pretty new. >> > >> > Samza's main differentiators are: >> > >> > - State is modeled as a stream. When a processor fails and is >> restarted, >> > the state stream is entirely replayed to restore it. >> > - Streams are ordered, partitioned, replayable, and fault tolerant. >> > - YARN is used for processor isolation, security, and fault tolerance. >> > - All streams are materialized to Kafka. >> > >> > If you guys are interested, I have much more in-depth documents comparing >> > and contrasting Samza with MUPD8 and Storm. >> > >> > Cheers, >> > Chris >> > >> > >> > On Tue, Jul 23, 2013 at 6:48 PM, Henry Saputra <henry.sapu...@gmail.com >> > >wrote: >> > >> > > Looks like this is similar to S4 (http://incubator.apache.org/s4/) >> which >> > > allow stream and real time data processing via DAG? >> > > >> > > >> > > - Henry >> > > >> > > >> > > On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco < >> criccomini....@gmail.com >> > > >wrote: >> > > >> > > > Hey All, >> > > > >> > > > Sending along an incubator proposal for Samza. >> > > > >> > > > Thanks! >> > > > Chris >> > > > >> > > > https://wiki.apache.org/incubator/SamzaProposal >> > > > >> > > > -------------------------------------------- >> > > > >> > > > == Abstract == >> > > > >> > > > Samza is a stream processing system for running continuous >> computation >> > on >> > > > infinite streams of data. >> > > > >> > > > == Proposal == >> > > > >> > > > Samza provides a system for processing stream data from >> > publish-subscribe >> > > > systems such as Apache Kafka. The developer writes a stream >> processing >> > > > task, and executes it as a Samza job. Samza then routes messages >> > between >> > > > stream processing tasks and the publish-subscribe systems that the >> > > messages >> > > > are addressed to. >> > > > >> > > > == Background == >> > > > >> > > > Samza was developed at LinkedIn to enable easier processing of >> > streaming >> > > > data on top of Apache Kafka. Current use cases include content >> > processing >> > > > pipelines, aggregating operational log data, data ingestion into >> > > > distributed database infrastructure, and measuring user activity >> across >> > > > different aggregation types. >> > > > >> > > > Samza is focused on providing an easy to use framework to process >> > > streams. >> > > > It uses Apache YARN to provide a mechanism for deploying stream >> > > processing >> > > > tasks in a distributed cluster. Samza also takes advantage of YARN to >> > > make >> > > > decisions about stream processor locality, co-partition of streams, >> and >> > > > provide security. Apache Kafka is also leveraged to provide a >> mechanism >> > > to >> > > > pass messages from one stream processor to the next. Apache Kafka is >> > also >> > > > used to help manage a stream processor's state, so that it can be >> > > recovered >> > > > in the event of a failure. >> > > > >> > > > Samza is written in Scala. It was developed internally at LinkedIn to >> > > meet >> > > > our particular use cases, but will be useful to many organizations >> > > facing a >> > > > similar need to reliably process large amounts of streaming data. >> > > > Therefore, we would like to share it the ASF and begin developing a >> > > > community of developers and users within Apache. >> > > > >> > > > == Rationale == >> > > > >> > > > Many organizations can benefit from a reliable stream processing >> system >> > > > such as Samza. While our use case of processing events from a large >> > > website >> > > > like LinkedIn has driven the design of Samza, its uses are varied and >> > we >> > > > expect many new use cases to emerge. Samza provides a generic API to >> > > > process messages from streaming infrastructure and will appeal to >> many >> > > > users. >> > > > >> > > > == Current Status == >> > > > >> > > > === Meritocracy === >> > > > >> > > > Our intent with this incubator proposal is to start building a >> diverse >> > > > developer community around Samza following the Apache meritocracy >> > model. >> > > > Since Samza was initially developed in late 2011, we have had fast >> > > adoption >> > > > and contributions by multiple teams at LinkedIn. We plan to continue >> > > > support for new contributors and work with those who contribute >> > > > significantly to the project to make them committers. >> > > > >> > > > === Community === >> > > > >> > > > Samza is currently being used internally at LinkedIn. We hope to >> extend >> > > our >> > > > contributor base significantly and invite all those who are >> interested >> > in >> > > > building large-scale distributed systems to participate. >> > > > >> > > > === Core Developers === >> > > > >> > > > Samza is currently being developed by four engineers at LinkedIn: Jay >> > > > Kreps, Jakob Homan, Sriram Subramanian, and Chris Riccomini. Jakob is >> > an >> > > > ASF Member, Incubator PMC member and PMC member on Apache Hadoop, >> Kafka >> > > and >> > > > Giraph. Jay is a member of the Apache Kafka PMC and contributor to >> > > various >> > > > Apache projects. Chris has been an active contributor for several >> > > projects >> > > > including Apache Kafka and Apache YARN. Sriram has contributed to >> > Samza, >> > > as >> > > > well as Apache Kafka. >> > > > >> > > > === Alignment === >> > > > >> > > > The ASF is the natural choice to host the Samza project as its goal >> of >> > > > encouraging community-driven open-source projects fits with our >> vision >> > > for >> > > > Samza. Additionally, many other projects with which we are familiar >> > with >> > > > and expect Samza to integrate with, such as Apache ZooKeeper, YARN, >> > HDFS >> > > > and log4j are hosted by the ASF and we will benefit and provide >> benefit >> > > by >> > > > close proximity to them. >> > > > >> > > > == Known Risks == >> > > > >> > > > === Orphaned Products === >> > > > >> > > > The core developers plan to work full time on the project. There is >> > very >> > > > little risk of Samza being abandoned as it is part of LinkedIn's >> > internal >> > > > infrastructure. >> > > > >> > > > === Inexperience with Open Source === >> > > > >> > > > All of the core developers have experience with open source >> > development. >> > > > Jay and Chris has been involved with several open source projects >> > > released >> > > > by LinkedIn, and Jay is a committer on Apache Kafka. Jakob has been >> > > > actively involved with the ASF as a full-time Hadoop committer and >> PMC >> > > > member. Sriram is a contributor to Apache Kafka. >> > > > >> > > > === Homogeneous Developers === >> > > > >> > > > The current core developers are all from LinkedIn. However, we hope >> to >> > > > establish a developer community that includes contributors from >> several >> > > > corporations and we actively encouraging new contributors via the >> > mailing >> > > > lists and public presentations of Samza. >> > > > >> > > > === Reliance on Salaried Developers === >> > > > >> > > > Currently, the developers are paid to do work on Samza. However, once >> > the >> > > > project has a community built around it, we expect to get committers, >> > > > developers and community from outside the current core developers. >> > > However, >> > > > because LinkedIn relies on Samza internally, the reliance on salaried >> > > > developers is unlikely to change. >> > > > >> > > > === Relationships with Other Apache Products === >> > > > >> > > > Samza is deeply integrated with Apache products. Samza uses Apache >> > Kafka >> > > as >> > > > its underlying message passing system. Samza also uses Apache YARN >> for >> > > task >> > > > scheduling. Both YARN and Kafka, in turn, rely on Apache ZooKeeper >> for >> > > > coordination. In addition, we hope to integrate with Apache HDFS in >> the >> > > > near future. >> > > > >> > > > === An Excessive Fascination with the Apache Brand === >> > > > >> > > > While we respect the reputation of the Apache brand and have no >> doubts >> > > that >> > > > it will attract contributors and users, our interest is primarily to >> > give >> > > > Samza a solid home as an open source project following an established >> > > > development model. We have also given reasons in the Rationale and >> > > > Alignment sections. >> > > > >> > > > == Documentation == >> > > > >> > > > http://wiki.apache.org/incubator/SamzaProposal >> > > > >> > > > == Initial Source == >> > > > >> > > > Available upon request. >> > > > >> > > > == External Dependencies == >> > > > >> > > > The dependencies all have Apache compatible licenses. >> > > > >> > > > * metrics (Apache 2.0) >> > > > * zkclient (Apache 2.0) >> > > > * zookeeper (Apache 2.0) >> > > > * jetty (Apache 2.0) >> > > > * jackson (Apache 2.0) >> > > > * commons-httpclient (Apache 2.0) >> > > > * slf4j (MIT) >> > > > * avro (Apache 2.0) >> > > > * hadoop (Apache 2.0) >> > > > * junit (Common Public License) >> > > > * grizzled-slf4j (BSD) >> > > > * scalatra ( >> https://github.com/scalatra/scalatra/blob/develop/LICENSE >> > ) >> > > > * scala (http://www.scala-lang.org/node/146) >> > > > * joptsimple (MIT) >> > > > * kafka (Apache 2.0) >> > > > * scalate (Apache 2.0) >> > > > * leveldb jni (BSD) >> > > > >> > > > == Cryptography == >> > > > >> > > > Samza will depend on secure Hadoop, which can optionally use >> Kerberos. >> > > > >> > > > == Required Resources == >> > > > >> > > > === Mailing Lists === >> > > > >> > > > samza-private for private PMC discussions (with moderated >> > subscriptions) >> > > > samza-dev >> > > > samza-commits >> > > > samza-user >> > > > >> > > > === Subversion Directory === >> > > > >> > > > Git is the preferred source control system: git:// >> git.apache.org/samza >> > > > >> > > > === Issue Tracking === >> > > > >> > > > JIRA Samza (SAMZA) >> > > > >> > > > === Other Resources === >> > > > >> > > > The existing code already has unit tests, so we would like a Hudson >> > > > instance to run them whenever a new patch is submitted. This can be >> > added >> > > > after project creation. >> > > > >> > > > == Initial Committers == >> > > > >> > > > * Jay Kreps >> > > > * Jakob Homan >> > > > * Chris Riccomini >> > > > * Sriram Subramanian >> > > > >> > > > == Affiliations == >> > > > >> > > > * Jay Kreps (LinkedIn) >> > > > * Jakob Homan (LinkedIn) >> > > > * Chris Riccomini (LinkedIn) >> > > > * Sriram Subramanian (LinkedIn) >> > > > >> > > > == Sponsors == >> > > > >> > > > === Champion === >> > > > >> > > > Jakob Homan (Apache Member) >> > > > >> > > > === Nominated Mentors === >> > > > >> > > > * Arun C Murthy <acmurthy at apache dot org> >> > > > * Chris Douglas <cdouglas at apache dot org> >> > > > * Roman Shaposhnik <rvs at apache dot org> >> > > > >> > > > === Sponsoring Entity === >> > > > >> > > > We are requesting the Incubator to sponsor this project. >> > > > >> > > >> > >> > > > > -- > Best Regards, > -- Alex --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org