+1 I would love to see the "documents comparing and contrasting Samza with MUPD8 and Storm."
On Sat, Jul 27, 2013 at 2:53 AM, Enis Söztutar <e...@apache.org> wrote: > +1 on incubation. > > Enis > > > On Tue, Jul 23, 2013 at 7:17 PM, Chris Riccomini > <criccomini....@gmail.com>wrote: > > > Hey Henry and Debo, > > > > Thanks for calling this out. Samza's feature set includes: > > > > - *Simpe API:* Unlike most low-level messaging system APIs, Samza > > provides a very simple call-back based "process message" API that > > should be > > familiar to anyone that's used Map/Reduce. > > - *Managed state:* Samza manages snapshotting and restoration of a > > stream processor's state. Samza will restore a stream processor's > state > > to > > a snapshot consistent with the processor's last read messages when the > > processor is restarted. > > - *Fault tolerance:* Samza will work with YARN to restart your stream > > processor if there is a machine or processor failure. > > - Durability: Samza uses Kafka to guarantee that no messages will ever > > be lost. > > - *Scalability:* Samza is partitioned and distributed at every level. > > Kafka provides ordered, partitioned, replayable, fault-tolerant > streams. > > YARN provides a distributed environment for Samza containers to run > in. > > - *Pluggable:* Though Samza works out of the box with Kafka and YARN, > > Samza provides a pluggable API that lets you run Samza with other > > messaging > > systems and execution environments. > > - *Processor isolation:* Samza works with Apache YARN, which supports > > processor security through Hadoop's security model, and resource > > isolation > > through Linux CGroups. > > > > Some of these feature are available in S4, and some are not. The same > holds > > true for Storm. > > > > The open source stream processing systems that are available are actually > > quite young, and no single system offers a complete solution. Problems > like > > how a stream processor's state (e.g. counts) should be managed, whether a > > stream should be buffered remotely on disk or not, what to do when > > duplicate messages are received or messages are lost, and how to model > > underlying messaging systems are all pretty new. > > > > Samza's main differentiators are: > > > > - State is modeled as a stream. When a processor fails and is > restarted, > > the state stream is entirely replayed to restore it. > > - Streams are ordered, partitioned, replayable, and fault tolerant. > > - YARN is used for processor isolation, security, and fault tolerance. > > - All streams are materialized to Kafka. > > > > If you guys are interested, I have much more in-depth documents comparing > > and contrasting Samza with MUPD8 and Storm. > > > > Cheers, > > Chris > > > > > > On Tue, Jul 23, 2013 at 6:48 PM, Henry Saputra <henry.sapu...@gmail.com > > >wrote: > > > > > Looks like this is similar to S4 (http://incubator.apache.org/s4/) > which > > > allow stream and real time data processing via DAG? > > > > > > > > > - Henry > > > > > > > > > On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco < > criccomini....@gmail.com > > > >wrote: > > > > > > > Hey All, > > > > > > > > Sending along an incubator proposal for Samza. > > > > > > > > Thanks! > > > > Chris > > > > > > > > https://wiki.apache.org/incubator/SamzaProposal > > > > > > > > -------------------------------------------- > > > > > > > > == Abstract == > > > > > > > > Samza is a stream processing system for running continuous > computation > > on > > > > infinite streams of data. > > > > > > > > == Proposal == > > > > > > > > Samza provides a system for processing stream data from > > publish-subscribe > > > > systems such as Apache Kafka. The developer writes a stream > processing > > > > task, and executes it as a Samza job. Samza then routes messages > > between > > > > stream processing tasks and the publish-subscribe systems that the > > > messages > > > > are addressed to. > > > > > > > > == Background == > > > > > > > > Samza was developed at LinkedIn to enable easier processing of > > streaming > > > > data on top of Apache Kafka. Current use cases include content > > processing > > > > pipelines, aggregating operational log data, data ingestion into > > > > distributed database infrastructure, and measuring user activity > across > > > > different aggregation types. > > > > > > > > Samza is focused on providing an easy to use framework to process > > > streams. > > > > It uses Apache YARN to provide a mechanism for deploying stream > > > processing > > > > tasks in a distributed cluster. Samza also takes advantage of YARN to > > > make > > > > decisions about stream processor locality, co-partition of streams, > and > > > > provide security. Apache Kafka is also leveraged to provide a > mechanism > > > to > > > > pass messages from one stream processor to the next. Apache Kafka is > > also > > > > used to help manage a stream processor's state, so that it can be > > > recovered > > > > in the event of a failure. > > > > > > > > Samza is written in Scala. It was developed internally at LinkedIn to > > > meet > > > > our particular use cases, but will be useful to many organizations > > > facing a > > > > similar need to reliably process large amounts of streaming data. > > > > Therefore, we would like to share it the ASF and begin developing a > > > > community of developers and users within Apache. > > > > > > > > == Rationale == > > > > > > > > Many organizations can benefit from a reliable stream processing > system > > > > such as Samza. While our use case of processing events from a large > > > website > > > > like LinkedIn has driven the design of Samza, its uses are varied and > > we > > > > expect many new use cases to emerge. Samza provides a generic API to > > > > process messages from streaming infrastructure and will appeal to > many > > > > users. > > > > > > > > == Current Status == > > > > > > > > === Meritocracy === > > > > > > > > Our intent with this incubator proposal is to start building a > diverse > > > > developer community around Samza following the Apache meritocracy > > model. > > > > Since Samza was initially developed in late 2011, we have had fast > > > adoption > > > > and contributions by multiple teams at LinkedIn. We plan to continue > > > > support for new contributors and work with those who contribute > > > > significantly to the project to make them committers. > > > > > > > > === Community === > > > > > > > > Samza is currently being used internally at LinkedIn. We hope to > extend > > > our > > > > contributor base significantly and invite all those who are > interested > > in > > > > building large-scale distributed systems to participate. > > > > > > > > === Core Developers === > > > > > > > > Samza is currently being developed by four engineers at LinkedIn: Jay > > > > Kreps, Jakob Homan, Sriram Subramanian, and Chris Riccomini. Jakob is > > an > > > > ASF Member, Incubator PMC member and PMC member on Apache Hadoop, > Kafka > > > and > > > > Giraph. Jay is a member of the Apache Kafka PMC and contributor to > > > various > > > > Apache projects. Chris has been an active contributor for several > > > projects > > > > including Apache Kafka and Apache YARN. Sriram has contributed to > > Samza, > > > as > > > > well as Apache Kafka. > > > > > > > > === Alignment === > > > > > > > > The ASF is the natural choice to host the Samza project as its goal > of > > > > encouraging community-driven open-source projects fits with our > vision > > > for > > > > Samza. Additionally, many other projects with which we are familiar > > with > > > > and expect Samza to integrate with, such as Apache ZooKeeper, YARN, > > HDFS > > > > and log4j are hosted by the ASF and we will benefit and provide > benefit > > > by > > > > close proximity to them. > > > > > > > > == Known Risks == > > > > > > > > === Orphaned Products === > > > > > > > > The core developers plan to work full time on the project. There is > > very > > > > little risk of Samza being abandoned as it is part of LinkedIn's > > internal > > > > infrastructure. > > > > > > > > === Inexperience with Open Source === > > > > > > > > All of the core developers have experience with open source > > development. > > > > Jay and Chris has been involved with several open source projects > > > released > > > > by LinkedIn, and Jay is a committer on Apache Kafka. Jakob has been > > > > actively involved with the ASF as a full-time Hadoop committer and > PMC > > > > member. Sriram is a contributor to Apache Kafka. > > > > > > > > === Homogeneous Developers === > > > > > > > > The current core developers are all from LinkedIn. However, we hope > to > > > > establish a developer community that includes contributors from > several > > > > corporations and we actively encouraging new contributors via the > > mailing > > > > lists and public presentations of Samza. > > > > > > > > === Reliance on Salaried Developers === > > > > > > > > Currently, the developers are paid to do work on Samza. However, once > > the > > > > project has a community built around it, we expect to get committers, > > > > developers and community from outside the current core developers. > > > However, > > > > because LinkedIn relies on Samza internally, the reliance on salaried > > > > developers is unlikely to change. > > > > > > > > === Relationships with Other Apache Products === > > > > > > > > Samza is deeply integrated with Apache products. Samza uses Apache > > Kafka > > > as > > > > its underlying message passing system. Samza also uses Apache YARN > for > > > task > > > > scheduling. Both YARN and Kafka, in turn, rely on Apache ZooKeeper > for > > > > coordination. In addition, we hope to integrate with Apache HDFS in > the > > > > near future. > > > > > > > > === An Excessive Fascination with the Apache Brand === > > > > > > > > While we respect the reputation of the Apache brand and have no > doubts > > > that > > > > it will attract contributors and users, our interest is primarily to > > give > > > > Samza a solid home as an open source project following an established > > > > development model. We have also given reasons in the Rationale and > > > > Alignment sections. > > > > > > > > == Documentation == > > > > > > > > http://wiki.apache.org/incubator/SamzaProposal > > > > > > > > == Initial Source == > > > > > > > > Available upon request. > > > > > > > > == External Dependencies == > > > > > > > > The dependencies all have Apache compatible licenses. > > > > > > > > * metrics (Apache 2.0) > > > > * zkclient (Apache 2.0) > > > > * zookeeper (Apache 2.0) > > > > * jetty (Apache 2.0) > > > > * jackson (Apache 2.0) > > > > * commons-httpclient (Apache 2.0) > > > > * slf4j (MIT) > > > > * avro (Apache 2.0) > > > > * hadoop (Apache 2.0) > > > > * junit (Common Public License) > > > > * grizzled-slf4j (BSD) > > > > * scalatra ( > https://github.com/scalatra/scalatra/blob/develop/LICENSE > > ) > > > > * scala (http://www.scala-lang.org/node/146) > > > > * joptsimple (MIT) > > > > * kafka (Apache 2.0) > > > > * scalate (Apache 2.0) > > > > * leveldb jni (BSD) > > > > > > > > == Cryptography == > > > > > > > > Samza will depend on secure Hadoop, which can optionally use > Kerberos. > > > > > > > > == Required Resources == > > > > > > > > === Mailing Lists === > > > > > > > > samza-private for private PMC discussions (with moderated > > subscriptions) > > > > samza-dev > > > > samza-commits > > > > samza-user > > > > > > > > === Subversion Directory === > > > > > > > > Git is the preferred source control system: git:// > git.apache.org/samza > > > > > > > > === Issue Tracking === > > > > > > > > JIRA Samza (SAMZA) > > > > > > > > === Other Resources === > > > > > > > > The existing code already has unit tests, so we would like a Hudson > > > > instance to run them whenever a new patch is submitted. This can be > > added > > > > after project creation. > > > > > > > > == Initial Committers == > > > > > > > > * Jay Kreps > > > > * Jakob Homan > > > > * Chris Riccomini > > > > * Sriram Subramanian > > > > > > > > == Affiliations == > > > > > > > > * Jay Kreps (LinkedIn) > > > > * Jakob Homan (LinkedIn) > > > > * Chris Riccomini (LinkedIn) > > > > * Sriram Subramanian (LinkedIn) > > > > > > > > == Sponsors == > > > > > > > > === Champion === > > > > > > > > Jakob Homan (Apache Member) > > > > > > > > === Nominated Mentors === > > > > > > > > * Arun C Murthy <acmurthy at apache dot org> > > > > * Chris Douglas <cdouglas at apache dot org> > > > > * Roman Shaposhnik <rvs at apache dot org> > > > > > > > > === Sponsoring Entity === > > > > > > > > We are requesting the Incubator to sponsor this project. > > > > > > > > > > -- Best Regards, -- Alex