Re: [PROPOSAL] Samza Proposal

Chris Riccomini Tue, 23 Jul 2013 19:18:39 -0700

Hey Henry and Debo,

Thanks for calling this out. Samza's feature set includes:


   - *Simpe API:* Unlike most low-level messaging system APIs, Samza
   provides a very simple call-back based "process message" API that should be
   familiar to anyone that's used Map/Reduce.
   - *Managed state:* Samza manages snapshotting and restoration of a
   stream processor's state. Samza will restore a stream processor's state to
   a snapshot consistent with the processor's last read messages when the
   processor is restarted.
   - *Fault tolerance:* Samza will work with YARN to restart your stream
   processor if there is a machine or processor failure.
   - Durability: Samza uses Kafka to guarantee that no messages will ever
   be lost.
   - *Scalability:* Samza is partitioned and distributed at every level.
   Kafka provides ordered, partitioned, replayable, fault-tolerant streams.
   YARN provides a distributed environment for Samza containers to run in.
   - *Pluggable:* Though Samza works out of the box with Kafka and YARN,
   Samza provides a pluggable API that lets you run Samza with other messaging
   systems and execution environments.
   - *Processor isolation:* Samza works with Apache YARN, which supports
   processor security through Hadoop's security model, and resource isolation
   through Linux CGroups.

Some of these feature are available in S4, and some are not. The same holds
true for Storm.

The open source stream processing systems that are available are actually
quite young, and no single system offers a complete solution. Problems like
how a stream processor's state (e.g. counts) should be managed, whether a
stream should be buffered remotely on disk or not, what to do when
duplicate messages are received or messages are lost, and how to model
underlying messaging systems are all pretty new.

Samza's main differentiators are:

   - State is modeled as a stream. When a processor fails and is restarted,
   the state stream is entirely replayed to restore it.
   - Streams are ordered, partitioned, replayable, and fault tolerant.
   - YARN is used for processor isolation, security, and fault tolerance.
   - All streams are materialized to Kafka.

If you guys are interested, I have much more in-depth documents comparing
and contrasting Samza with MUPD8 and Storm.

Cheers,
Chris


On Tue, Jul 23, 2013 at 6:48 PM, Henry Saputra <henry.sapu...@gmail.com>wrote:

> Looks like this is similar to S4 (http://incubator.apache.org/s4/) which
> allow stream and real time data processing via DAG?
>
>
> - Henry
>
>
> On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco <criccomini....@gmail.com
> >wrote:
>
> > Hey All,
> >
> > Sending along an incubator proposal for Samza.
> >
> > Thanks!
> > Chris
> >
> > https://wiki.apache.org/incubator/SamzaProposal
> >
> > --------------------------------------------
> >
> > == Abstract ==
> >
> > Samza is a stream processing system for running continuous computation on
> > infinite streams of data.
> >
> > == Proposal ==
> >
> > Samza provides a system for processing stream data from publish-subscribe
> > systems such as Apache Kafka. The developer writes a stream processing
> > task, and executes it as a Samza job. Samza then routes messages between
> > stream processing tasks and the publish-subscribe systems that the
> messages
> > are addressed to.
> >
> > == Background ==
> >
> > Samza was developed at LinkedIn to enable easier processing of streaming
> > data on top of Apache Kafka. Current use cases include content processing
> > pipelines, aggregating operational log data, data ingestion into
> > distributed database infrastructure, and measuring user activity across
> > different aggregation types.
> >
> > Samza is focused on providing an easy to use framework to process
> streams.
> > It uses Apache YARN to provide a mechanism for deploying stream
> processing
> > tasks in a distributed cluster. Samza also takes advantage of YARN to
> make
> > decisions about stream processor locality, co-partition of streams, and
> > provide security. Apache Kafka is also leveraged to provide a mechanism
> to
> > pass messages from one stream processor to the next. Apache Kafka is also
> > used to help manage a stream processor's state, so that it can be
> recovered
> > in the event of a failure.
> >
> > Samza is written in Scala. It was developed internally at LinkedIn to
> meet
> > our particular use cases, but will be useful to many organizations
> facing a
> > similar need to reliably process large amounts of streaming data.
> > Therefore, we would like to share it the ASF and begin developing a
> > community of developers and users within Apache.
> >
> > == Rationale ==
> >
> > Many organizations can benefit from a reliable stream processing system
> > such as Samza. While our use case of processing events from a large
> website
> > like LinkedIn has driven the design of Samza, its uses are varied and we
> > expect many new use cases to emerge. Samza provides a generic API to
> > process messages from streaming infrastructure and will appeal to many
> > users.
> >
> > == Current Status ==
> >
> > === Meritocracy ===
> >
> > Our intent with this incubator proposal is to start building a diverse
> > developer community around Samza following the Apache meritocracy model.
> > Since Samza was initially developed in late 2011, we have had fast
> adoption
> > and contributions by multiple teams at LinkedIn. We plan to continue
> > support for new contributors and work with those who contribute
> > significantly to the project to make them committers.
> >
> > === Community ===
> >
> > Samza is currently being used internally at LinkedIn. We hope to extend
> our
> > contributor base significantly and invite all those who are interested in
> > building large-scale distributed systems to participate.
> >
> > === Core Developers ===
> >
> > Samza is currently being developed by four engineers at LinkedIn: Jay
> > Kreps, Jakob Homan, Sriram Subramanian, and Chris Riccomini. Jakob is an
> > ASF Member, Incubator PMC member and PMC member on Apache Hadoop, Kafka
> and
> > Giraph. Jay is a member of the Apache Kafka PMC and contributor to
> various
> > Apache projects. Chris has been an active contributor for several
> projects
> > including Apache Kafka and Apache YARN. Sriram has contributed to Samza,
> as
> > well as Apache Kafka.
> >
> > === Alignment ===
> >
> > The ASF is the natural choice to host the Samza project as its goal of
> > encouraging community-driven open-source projects fits with our vision
> for
> > Samza. Additionally, many other projects with which we are familiar with
> > and expect Samza to integrate with, such as Apache ZooKeeper, YARN, HDFS
> > and log4j are hosted by the ASF and we will benefit and provide benefit
> by
> > close proximity to them.
> >
> > == Known Risks ==
> >
> > === Orphaned Products ===
> >
> > The core developers plan to work full time on the project. There is very
> > little risk of Samza being abandoned as it is part of LinkedIn's internal
> > infrastructure.
> >
> > === Inexperience with Open Source ===
> >
> > All of the core developers have experience with open source development.
> > Jay and Chris has been involved with several open source projects
> released
> > by LinkedIn, and Jay is a committer on Apache Kafka. Jakob has been
> > actively involved with the ASF as a full-time Hadoop committer and PMC
> > member. Sriram is a contributor to Apache Kafka.
> >
> > === Homogeneous Developers ===
> >
> > The current core developers are all from LinkedIn. However, we hope to
> > establish a developer community that includes contributors from several
> > corporations and we actively encouraging new contributors via the mailing
> > lists and public presentations of Samza.
> >
> > === Reliance on Salaried Developers ===
> >
> > Currently, the developers are paid to do work on Samza. However, once the
> > project has a community built around it, we expect to get committers,
> > developers and community from outside the current core developers.
> However,
> > because LinkedIn relies on Samza internally, the reliance on salaried
> > developers is unlikely to change.
> >
> > === Relationships with Other Apache Products ===
> >
> > Samza is deeply integrated with Apache products. Samza uses Apache Kafka
> as
> > its underlying message passing system. Samza also uses Apache YARN for
> task
> > scheduling. Both YARN and Kafka, in turn, rely on Apache ZooKeeper for
> > coordination. In addition, we hope to integrate with Apache HDFS in the
> > near future.
> >
> > === An Excessive Fascination with the Apache Brand ===
> >
> > While we respect the reputation of the Apache brand and have no doubts
> that
> > it will attract contributors and users, our interest is primarily to give
> > Samza a solid home as an open source project following an established
> > development model. We have also given reasons in the Rationale and
> > Alignment sections.
> >
> > == Documentation ==
> >
> > http://wiki.apache.org/incubator/SamzaProposal
> >
> > == Initial Source ==
> >
> > Available upon request.
> >
> > == External Dependencies ==
> >
> > The dependencies all have Apache compatible licenses.
> >
> >  * metrics (Apache 2.0)
> >  * zkclient (Apache 2.0)
> >  * zookeeper (Apache 2.0)
> >  * jetty (Apache 2.0)
> >  * jackson (Apache 2.0)
> >  * commons-httpclient (Apache 2.0)
> >  * slf4j (MIT)
> >  * avro (Apache 2.0)
> >  * hadoop (Apache 2.0)
> >  * junit (Common Public License)
> >  * grizzled-slf4j (BSD)
> >  * scalatra (https://github.com/scalatra/scalatra/blob/develop/LICENSE)
> >  * scala (http://www.scala-lang.org/node/146)
> >  * joptsimple (MIT)
> >  * kafka (Apache 2.0)
> >  * scalate (Apache 2.0)
> >  * leveldb jni (BSD)
> >
> > == Cryptography ==
> >
> > Samza will depend on secure Hadoop, which can optionally use Kerberos.
> >
> > == Required Resources ==
> >
> > === Mailing Lists ===
> >
> > samza-private for private PMC discussions (with moderated subscriptions)
> > samza-dev
> > samza-commits
> > samza-user
> >
> > === Subversion Directory ===
> >
> > Git is the preferred source control system: git://git.apache.org/samza
> >
> > === Issue Tracking ===
> >
> > JIRA Samza (SAMZA)
> >
> > === Other Resources ===
> >
> > The existing code already has unit tests, so we would like a Hudson
> > instance to run them whenever a new patch is submitted. This can be added
> > after project creation.
> >
> > == Initial Committers ==
> >
> >  * Jay Kreps
> >  * Jakob Homan
> >  * Chris Riccomini
> >  * Sriram Subramanian
> >
> > == Affiliations ==
> >
> >  * Jay Kreps (LinkedIn)
> >  * Jakob Homan (LinkedIn)
> >  * Chris Riccomini (LinkedIn)
> >  * Sriram Subramanian (LinkedIn)
> >
> > == Sponsors ==
> >
> > === Champion ===
> >
> > Jakob Homan (Apache Member)
> >
> > === Nominated Mentors ===
> >
> >  * Arun C Murthy <acmurthy at apache dot org>
> >  * Chris Douglas <cdouglas at apache dot org>
> >  * Roman Shaposhnik <rvs at apache dot org>
> >
> > === Sponsoring Entity ===
> >
> > We are requesting the Incubator to sponsor this project.
> >
>

Re: [PROPOSAL] Samza Proposal

Reply via email to