Re: [PROPOSAL] Samza Proposal

Enis Söztutar Fri, 26 Jul 2013 16:54:51 -0700

+1 on incubation.

Enis



On Tue, Jul 23, 2013 at 7:17 PM, Chris Riccomini
<criccomini....@gmail.com>wrote:

> Hey Henry and Debo,
>
> Thanks for calling this out. Samza's feature set includes:
>
>    - *Simpe API:* Unlike most low-level messaging system APIs, Samza
>    provides a very simple call-back based "process message" API that
> should be
>    familiar to anyone that's used Map/Reduce.
>    - *Managed state:* Samza manages snapshotting and restoration of a
>    stream processor's state. Samza will restore a stream processor's state
> to
>    a snapshot consistent with the processor's last read messages when the
>    processor is restarted.
>    - *Fault tolerance:* Samza will work with YARN to restart your stream
>    processor if there is a machine or processor failure.
>    - Durability: Samza uses Kafka to guarantee that no messages will ever
>    be lost.
>    - *Scalability:* Samza is partitioned and distributed at every level.
>    Kafka provides ordered, partitioned, replayable, fault-tolerant streams.
>    YARN provides a distributed environment for Samza containers to run in.
>    - *Pluggable:* Though Samza works out of the box with Kafka and YARN,
>    Samza provides a pluggable API that lets you run Samza with other
> messaging
>    systems and execution environments.
>    - *Processor isolation:* Samza works with Apache YARN, which supports
>    processor security through Hadoop's security model, and resource
> isolation
>    through Linux CGroups.
>
> Some of these feature are available in S4, and some are not. The same holds
> true for Storm.
>
> The open source stream processing systems that are available are actually
> quite young, and no single system offers a complete solution. Problems like
> how a stream processor's state (e.g. counts) should be managed, whether a
> stream should be buffered remotely on disk or not, what to do when
> duplicate messages are received or messages are lost, and how to model
> underlying messaging systems are all pretty new.
>
> Samza's main differentiators are:
>
>    - State is modeled as a stream. When a processor fails and is restarted,
>    the state stream is entirely replayed to restore it.
>    - Streams are ordered, partitioned, replayable, and fault tolerant.
>    - YARN is used for processor isolation, security, and fault tolerance.
>    - All streams are materialized to Kafka.
>
> If you guys are interested, I have much more in-depth documents comparing
> and contrasting Samza with MUPD8 and Storm.
>
> Cheers,
> Chris
>
>
> On Tue, Jul 23, 2013 at 6:48 PM, Henry Saputra <henry.sapu...@gmail.com
> >wrote:
>
> > Looks like this is similar to S4 (http://incubator.apache.org/s4/) which
> > allow stream and real time data processing via DAG?
> >
> >
> > - Henry
> >
> >
> > On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco <criccomini....@gmail.com
> > >wrote:
> >
> > > Hey All,
> > >
> > > Sending along an incubator proposal for Samza.
> > >
> > > Thanks!
> > > Chris
> > >
> > > https://wiki.apache.org/incubator/SamzaProposal
> > >
> > > --------------------------------------------
> > >
> > > == Abstract ==
> > >
> > > Samza is a stream processing system for running continuous computation
> on
> > > infinite streams of data.
> > >
> > > == Proposal ==
> > >
> > > Samza provides a system for processing stream data from
> publish-subscribe
> > > systems such as Apache Kafka. The developer writes a stream processing
> > > task, and executes it as a Samza job. Samza then routes messages
> between
> > > stream processing tasks and the publish-subscribe systems that the
> > messages
> > > are addressed to.
> > >
> > > == Background ==
> > >
> > > Samza was developed at LinkedIn to enable easier processing of
> streaming
> > > data on top of Apache Kafka. Current use cases include content
> processing
> > > pipelines, aggregating operational log data, data ingestion into
> > > distributed database infrastructure, and measuring user activity across
> > > different aggregation types.
> > >
> > > Samza is focused on providing an easy to use framework to process
> > streams.
> > > It uses Apache YARN to provide a mechanism for deploying stream
> > processing
> > > tasks in a distributed cluster. Samza also takes advantage of YARN to
> > make
> > > decisions about stream processor locality, co-partition of streams, and
> > > provide security. Apache Kafka is also leveraged to provide a mechanism
> > to
> > > pass messages from one stream processor to the next. Apache Kafka is
> also
> > > used to help manage a stream processor's state, so that it can be
> > recovered
> > > in the event of a failure.
> > >
> > > Samza is written in Scala. It was developed internally at LinkedIn to
> > meet
> > > our particular use cases, but will be useful to many organizations
> > facing a
> > > similar need to reliably process large amounts of streaming data.
> > > Therefore, we would like to share it the ASF and begin developing a
> > > community of developers and users within Apache.
> > >
> > > == Rationale ==
> > >
> > > Many organizations can benefit from a reliable stream processing system
> > > such as Samza. While our use case of processing events from a large
> > website
> > > like LinkedIn has driven the design of Samza, its uses are varied and
> we
> > > expect many new use cases to emerge. Samza provides a generic API to
> > > process messages from streaming infrastructure and will appeal to many
> > > users.
> > >
> > > == Current Status ==
> > >
> > > === Meritocracy ===
> > >
> > > Our intent with this incubator proposal is to start building a diverse
> > > developer community around Samza following the Apache meritocracy
> model.
> > > Since Samza was initially developed in late 2011, we have had fast
> > adoption
> > > and contributions by multiple teams at LinkedIn. We plan to continue
> > > support for new contributors and work with those who contribute
> > > significantly to the project to make them committers.
> > >
> > > === Community ===
> > >
> > > Samza is currently being used internally at LinkedIn. We hope to extend
> > our
> > > contributor base significantly and invite all those who are interested
> in
> > > building large-scale distributed systems to participate.
> > >
> > > === Core Developers ===
> > >
> > > Samza is currently being developed by four engineers at LinkedIn: Jay
> > > Kreps, Jakob Homan, Sriram Subramanian, and Chris Riccomini. Jakob is
> an
> > > ASF Member, Incubator PMC member and PMC member on Apache Hadoop, Kafka
> > and
> > > Giraph. Jay is a member of the Apache Kafka PMC and contributor to
> > various
> > > Apache projects. Chris has been an active contributor for several
> > projects
> > > including Apache Kafka and Apache YARN. Sriram has contributed to
> Samza,
> > as
> > > well as Apache Kafka.
> > >
> > > === Alignment ===
> > >
> > > The ASF is the natural choice to host the Samza project as its goal of
> > > encouraging community-driven open-source projects fits with our vision
> > for
> > > Samza. Additionally, many other projects with which we are familiar
> with
> > > and expect Samza to integrate with, such as Apache ZooKeeper, YARN,
> HDFS
> > > and log4j are hosted by the ASF and we will benefit and provide benefit
> > by
> > > close proximity to them.
> > >
> > > == Known Risks ==
> > >
> > > === Orphaned Products ===
> > >
> > > The core developers plan to work full time on the project. There is
> very
> > > little risk of Samza being abandoned as it is part of LinkedIn's
> internal
> > > infrastructure.
> > >
> > > === Inexperience with Open Source ===
> > >
> > > All of the core developers have experience with open source
> development.
> > > Jay and Chris has been involved with several open source projects
> > released
> > > by LinkedIn, and Jay is a committer on Apache Kafka. Jakob has been
> > > actively involved with the ASF as a full-time Hadoop committer and PMC
> > > member. Sriram is a contributor to Apache Kafka.
> > >
> > > === Homogeneous Developers ===
> > >
> > > The current core developers are all from LinkedIn. However, we hope to
> > > establish a developer community that includes contributors from several
> > > corporations and we actively encouraging new contributors via the
> mailing
> > > lists and public presentations of Samza.
> > >
> > > === Reliance on Salaried Developers ===
> > >
> > > Currently, the developers are paid to do work on Samza. However, once
> the
> > > project has a community built around it, we expect to get committers,
> > > developers and community from outside the current core developers.
> > However,
> > > because LinkedIn relies on Samza internally, the reliance on salaried
> > > developers is unlikely to change.
> > >
> > > === Relationships with Other Apache Products ===
> > >
> > > Samza is deeply integrated with Apache products. Samza uses Apache
> Kafka
> > as
> > > its underlying message passing system. Samza also uses Apache YARN for
> > task
> > > scheduling. Both YARN and Kafka, in turn, rely on Apache ZooKeeper for
> > > coordination. In addition, we hope to integrate with Apache HDFS in the
> > > near future.
> > >
> > > === An Excessive Fascination with the Apache Brand ===
> > >
> > > While we respect the reputation of the Apache brand and have no doubts
> > that
> > > it will attract contributors and users, our interest is primarily to
> give
> > > Samza a solid home as an open source project following an established
> > > development model. We have also given reasons in the Rationale and
> > > Alignment sections.
> > >
> > > == Documentation ==
> > >
> > > http://wiki.apache.org/incubator/SamzaProposal
> > >
> > > == Initial Source ==
> > >
> > > Available upon request.
> > >
> > > == External Dependencies ==
> > >
> > > The dependencies all have Apache compatible licenses.
> > >
> > >  * metrics (Apache 2.0)
> > >  * zkclient (Apache 2.0)
> > >  * zookeeper (Apache 2.0)
> > >  * jetty (Apache 2.0)
> > >  * jackson (Apache 2.0)
> > >  * commons-httpclient (Apache 2.0)
> > >  * slf4j (MIT)
> > >  * avro (Apache 2.0)
> > >  * hadoop (Apache 2.0)
> > >  * junit (Common Public License)
> > >  * grizzled-slf4j (BSD)
> > >  * scalatra (https://github.com/scalatra/scalatra/blob/develop/LICENSE
> )
> > >  * scala (http://www.scala-lang.org/node/146)
> > >  * joptsimple (MIT)
> > >  * kafka (Apache 2.0)
> > >  * scalate (Apache 2.0)
> > >  * leveldb jni (BSD)
> > >
> > > == Cryptography ==
> > >
> > > Samza will depend on secure Hadoop, which can optionally use Kerberos.
> > >
> > > == Required Resources ==
> > >
> > > === Mailing Lists ===
> > >
> > > samza-private for private PMC discussions (with moderated
> subscriptions)
> > > samza-dev
> > > samza-commits
> > > samza-user
> > >
> > > === Subversion Directory ===
> > >
> > > Git is the preferred source control system: git://git.apache.org/samza
> > >
> > > === Issue Tracking ===
> > >
> > > JIRA Samza (SAMZA)
> > >
> > > === Other Resources ===
> > >
> > > The existing code already has unit tests, so we would like a Hudson
> > > instance to run them whenever a new patch is submitted. This can be
> added
> > > after project creation.
> > >
> > > == Initial Committers ==
> > >
> > >  * Jay Kreps
> > >  * Jakob Homan
> > >  * Chris Riccomini
> > >  * Sriram Subramanian
> > >
> > > == Affiliations ==
> > >
> > >  * Jay Kreps (LinkedIn)
> > >  * Jakob Homan (LinkedIn)
> > >  * Chris Riccomini (LinkedIn)
> > >  * Sriram Subramanian (LinkedIn)
> > >
> > > == Sponsors ==
> > >
> > > === Champion ===
> > >
> > > Jakob Homan (Apache Member)
> > >
> > > === Nominated Mentors ===
> > >
> > >  * Arun C Murthy <acmurthy at apache dot org>
> > >  * Chris Douglas <cdouglas at apache dot org>
> > >  * Roman Shaposhnik <rvs at apache dot org>
> > >
> > > === Sponsoring Entity ===
> > >
> > > We are requesting the Incubator to sponsor this project.
> > >
> >
>

Re: [PROPOSAL] Samza Proposal

Reply via email to