Re: [PROPOSAL] Samza Proposal

Phillip Rhodes Wed, 31 Jul 2013 13:31:30 -0700

Same here.  Not that it matters as far as admission to the incubator
(that vote is over now anyway), but I think a lot of people (including
potential users of Samza) would like to see more about how it compares
& contrasts with other stream oriented systems.



Phil
This message optimized for indexing by NSA PRISM


On Fri, Jul 26, 2013 at 8:27 PM, Alex Karasulu <akaras...@apache.org> wrote:
> +1
>
> I would love to see the "documents comparing and contrasting Samza with
> MUPD8 and Storm."
>
>
> On Sat, Jul 27, 2013 at 2:53 AM, Enis Söztutar <e...@apache.org> wrote:
>
>> +1 on incubation.
>>
>> Enis
>>
>>
>> On Tue, Jul 23, 2013 at 7:17 PM, Chris Riccomini
>> <criccomini....@gmail.com>wrote:
>>
>> > Hey Henry and Debo,
>> >
>> > Thanks for calling this out. Samza's feature set includes:
>> >
>> >    - *Simpe API:* Unlike most low-level messaging system APIs, Samza
>> >    provides a very simple call-back based "process message" API that
>> > should be
>> >    familiar to anyone that's used Map/Reduce.
>> >    - *Managed state:* Samza manages snapshotting and restoration of a
>> >    stream processor's state. Samza will restore a stream processor's
>> state
>> > to
>> >    a snapshot consistent with the processor's last read messages when the
>> >    processor is restarted.
>> >    - *Fault tolerance:* Samza will work with YARN to restart your stream
>> >    processor if there is a machine or processor failure.
>> >    - Durability: Samza uses Kafka to guarantee that no messages will ever
>> >    be lost.
>> >    - *Scalability:* Samza is partitioned and distributed at every level.
>> >    Kafka provides ordered, partitioned, replayable, fault-tolerant
>> streams.
>> >    YARN provides a distributed environment for Samza containers to run
>> in.
>> >    - *Pluggable:* Though Samza works out of the box with Kafka and YARN,
>> >    Samza provides a pluggable API that lets you run Samza with other
>> > messaging
>> >    systems and execution environments.
>> >    - *Processor isolation:* Samza works with Apache YARN, which supports
>> >    processor security through Hadoop's security model, and resource
>> > isolation
>> >    through Linux CGroups.
>> >
>> > Some of these feature are available in S4, and some are not. The same
>> holds
>> > true for Storm.
>> >
>> > The open source stream processing systems that are available are actually
>> > quite young, and no single system offers a complete solution. Problems
>> like
>> > how a stream processor's state (e.g. counts) should be managed, whether a
>> > stream should be buffered remotely on disk or not, what to do when
>> > duplicate messages are received or messages are lost, and how to model
>> > underlying messaging systems are all pretty new.
>> >
>> > Samza's main differentiators are:
>> >
>> >    - State is modeled as a stream. When a processor fails and is
>> restarted,
>> >    the state stream is entirely replayed to restore it.
>> >    - Streams are ordered, partitioned, replayable, and fault tolerant.
>> >    - YARN is used for processor isolation, security, and fault tolerance.
>> >    - All streams are materialized to Kafka.
>> >
>> > If you guys are interested, I have much more in-depth documents comparing
>> > and contrasting Samza with MUPD8 and Storm.
>> >
>> > Cheers,
>> > Chris
>> >
>> >
>> > On Tue, Jul 23, 2013 at 6:48 PM, Henry Saputra <henry.sapu...@gmail.com
>> > >wrote:
>> >
>> > > Looks like this is similar to S4 (http://incubator.apache.org/s4/)
>> which
>> > > allow stream and real time data processing via DAG?
>> > >
>> > >
>> > > - Henry
>> > >
>> > >
>> > > On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco <
>> criccomini....@gmail.com
>> > > >wrote:
>> > >
>> > > > Hey All,
>> > > >
>> > > > Sending along an incubator proposal for Samza.
>> > > >
>> > > > Thanks!
>> > > > Chris
>> > > >
>> > > > https://wiki.apache.org/incubator/SamzaProposal
>> > > >
>> > > > --------------------------------------------
>> > > >
>> > > > == Abstract ==
>> > > >
>> > > > Samza is a stream processing system for running continuous
>> computation
>> > on
>> > > > infinite streams of data.
>> > > >
>> > > > == Proposal ==
>> > > >
>> > > > Samza provides a system for processing stream data from
>> > publish-subscribe
>> > > > systems such as Apache Kafka. The developer writes a stream
>> processing
>> > > > task, and executes it as a Samza job. Samza then routes messages
>> > between
>> > > > stream processing tasks and the publish-subscribe systems that the
>> > > messages
>> > > > are addressed to.
>> > > >
>> > > > == Background ==
>> > > >
>> > > > Samza was developed at LinkedIn to enable easier processing of
>> > streaming
>> > > > data on top of Apache Kafka. Current use cases include content
>> > processing
>> > > > pipelines, aggregating operational log data, data ingestion into
>> > > > distributed database infrastructure, and measuring user activity
>> across
>> > > > different aggregation types.
>> > > >
>> > > > Samza is focused on providing an easy to use framework to process
>> > > streams.
>> > > > It uses Apache YARN to provide a mechanism for deploying stream
>> > > processing
>> > > > tasks in a distributed cluster. Samza also takes advantage of YARN to
>> > > make
>> > > > decisions about stream processor locality, co-partition of streams,
>> and
>> > > > provide security. Apache Kafka is also leveraged to provide a
>> mechanism
>> > > to
>> > > > pass messages from one stream processor to the next. Apache Kafka is
>> > also
>> > > > used to help manage a stream processor's state, so that it can be
>> > > recovered
>> > > > in the event of a failure.
>> > > >
>> > > > Samza is written in Scala. It was developed internally at LinkedIn to
>> > > meet
>> > > > our particular use cases, but will be useful to many organizations
>> > > facing a
>> > > > similar need to reliably process large amounts of streaming data.
>> > > > Therefore, we would like to share it the ASF and begin developing a
>> > > > community of developers and users within Apache.
>> > > >
>> > > > == Rationale ==
>> > > >
>> > > > Many organizations can benefit from a reliable stream processing
>> system
>> > > > such as Samza. While our use case of processing events from a large
>> > > website
>> > > > like LinkedIn has driven the design of Samza, its uses are varied and
>> > we
>> > > > expect many new use cases to emerge. Samza provides a generic API to
>> > > > process messages from streaming infrastructure and will appeal to
>> many
>> > > > users.
>> > > >
>> > > > == Current Status ==
>> > > >
>> > > > === Meritocracy ===
>> > > >
>> > > > Our intent with this incubator proposal is to start building a
>> diverse
>> > > > developer community around Samza following the Apache meritocracy
>> > model.
>> > > > Since Samza was initially developed in late 2011, we have had fast
>> > > adoption
>> > > > and contributions by multiple teams at LinkedIn. We plan to continue
>> > > > support for new contributors and work with those who contribute
>> > > > significantly to the project to make them committers.
>> > > >
>> > > > === Community ===
>> > > >
>> > > > Samza is currently being used internally at LinkedIn. We hope to
>> extend
>> > > our
>> > > > contributor base significantly and invite all those who are
>> interested
>> > in
>> > > > building large-scale distributed systems to participate.
>> > > >
>> > > > === Core Developers ===
>> > > >
>> > > > Samza is currently being developed by four engineers at LinkedIn: Jay
>> > > > Kreps, Jakob Homan, Sriram Subramanian, and Chris Riccomini. Jakob is
>> > an
>> > > > ASF Member, Incubator PMC member and PMC member on Apache Hadoop,
>> Kafka
>> > > and
>> > > > Giraph. Jay is a member of the Apache Kafka PMC and contributor to
>> > > various
>> > > > Apache projects. Chris has been an active contributor for several
>> > > projects
>> > > > including Apache Kafka and Apache YARN. Sriram has contributed to
>> > Samza,
>> > > as
>> > > > well as Apache Kafka.
>> > > >
>> > > > === Alignment ===
>> > > >
>> > > > The ASF is the natural choice to host the Samza project as its goal
>> of
>> > > > encouraging community-driven open-source projects fits with our
>> vision
>> > > for
>> > > > Samza. Additionally, many other projects with which we are familiar
>> > with
>> > > > and expect Samza to integrate with, such as Apache ZooKeeper, YARN,
>> > HDFS
>> > > > and log4j are hosted by the ASF and we will benefit and provide
>> benefit
>> > > by
>> > > > close proximity to them.
>> > > >
>> > > > == Known Risks ==
>> > > >
>> > > > === Orphaned Products ===
>> > > >
>> > > > The core developers plan to work full time on the project. There is
>> > very
>> > > > little risk of Samza being abandoned as it is part of LinkedIn's
>> > internal
>> > > > infrastructure.
>> > > >
>> > > > === Inexperience with Open Source ===
>> > > >
>> > > > All of the core developers have experience with open source
>> > development.
>> > > > Jay and Chris has been involved with several open source projects
>> > > released
>> > > > by LinkedIn, and Jay is a committer on Apache Kafka. Jakob has been
>> > > > actively involved with the ASF as a full-time Hadoop committer and
>> PMC
>> > > > member. Sriram is a contributor to Apache Kafka.
>> > > >
>> > > > === Homogeneous Developers ===
>> > > >
>> > > > The current core developers are all from LinkedIn. However, we hope
>> to
>> > > > establish a developer community that includes contributors from
>> several
>> > > > corporations and we actively encouraging new contributors via the
>> > mailing
>> > > > lists and public presentations of Samza.
>> > > >
>> > > > === Reliance on Salaried Developers ===
>> > > >
>> > > > Currently, the developers are paid to do work on Samza. However, once
>> > the
>> > > > project has a community built around it, we expect to get committers,
>> > > > developers and community from outside the current core developers.
>> > > However,
>> > > > because LinkedIn relies on Samza internally, the reliance on salaried
>> > > > developers is unlikely to change.
>> > > >
>> > > > === Relationships with Other Apache Products ===
>> > > >
>> > > > Samza is deeply integrated with Apache products. Samza uses Apache
>> > Kafka
>> > > as
>> > > > its underlying message passing system. Samza also uses Apache YARN
>> for
>> > > task
>> > > > scheduling. Both YARN and Kafka, in turn, rely on Apache ZooKeeper
>> for
>> > > > coordination. In addition, we hope to integrate with Apache HDFS in
>> the
>> > > > near future.
>> > > >
>> > > > === An Excessive Fascination with the Apache Brand ===
>> > > >
>> > > > While we respect the reputation of the Apache brand and have no
>> doubts
>> > > that
>> > > > it will attract contributors and users, our interest is primarily to
>> > give
>> > > > Samza a solid home as an open source project following an established
>> > > > development model. We have also given reasons in the Rationale and
>> > > > Alignment sections.
>> > > >
>> > > > == Documentation ==
>> > > >
>> > > > http://wiki.apache.org/incubator/SamzaProposal
>> > > >
>> > > > == Initial Source ==
>> > > >
>> > > > Available upon request.
>> > > >
>> > > > == External Dependencies ==
>> > > >
>> > > > The dependencies all have Apache compatible licenses.
>> > > >
>> > > >  * metrics (Apache 2.0)
>> > > >  * zkclient (Apache 2.0)
>> > > >  * zookeeper (Apache 2.0)
>> > > >  * jetty (Apache 2.0)
>> > > >  * jackson (Apache 2.0)
>> > > >  * commons-httpclient (Apache 2.0)
>> > > >  * slf4j (MIT)
>> > > >  * avro (Apache 2.0)
>> > > >  * hadoop (Apache 2.0)
>> > > >  * junit (Common Public License)
>> > > >  * grizzled-slf4j (BSD)
>> > > >  * scalatra (
>> https://github.com/scalatra/scalatra/blob/develop/LICENSE
>> > )
>> > > >  * scala (http://www.scala-lang.org/node/146)
>> > > >  * joptsimple (MIT)
>> > > >  * kafka (Apache 2.0)
>> > > >  * scalate (Apache 2.0)
>> > > >  * leveldb jni (BSD)
>> > > >
>> > > > == Cryptography ==
>> > > >
>> > > > Samza will depend on secure Hadoop, which can optionally use
>> Kerberos.
>> > > >
>> > > > == Required Resources ==
>> > > >
>> > > > === Mailing Lists ===
>> > > >
>> > > > samza-private for private PMC discussions (with moderated
>> > subscriptions)
>> > > > samza-dev
>> > > > samza-commits
>> > > > samza-user
>> > > >
>> > > > === Subversion Directory ===
>> > > >
>> > > > Git is the preferred source control system: git://
>> git.apache.org/samza
>> > > >
>> > > > === Issue Tracking ===
>> > > >
>> > > > JIRA Samza (SAMZA)
>> > > >
>> > > > === Other Resources ===
>> > > >
>> > > > The existing code already has unit tests, so we would like a Hudson
>> > > > instance to run them whenever a new patch is submitted. This can be
>> > added
>> > > > after project creation.
>> > > >
>> > > > == Initial Committers ==
>> > > >
>> > > >  * Jay Kreps
>> > > >  * Jakob Homan
>> > > >  * Chris Riccomini
>> > > >  * Sriram Subramanian
>> > > >
>> > > > == Affiliations ==
>> > > >
>> > > >  * Jay Kreps (LinkedIn)
>> > > >  * Jakob Homan (LinkedIn)
>> > > >  * Chris Riccomini (LinkedIn)
>> > > >  * Sriram Subramanian (LinkedIn)
>> > > >
>> > > > == Sponsors ==
>> > > >
>> > > > === Champion ===
>> > > >
>> > > > Jakob Homan (Apache Member)
>> > > >
>> > > > === Nominated Mentors ===
>> > > >
>> > > >  * Arun C Murthy <acmurthy at apache dot org>
>> > > >  * Chris Douglas <cdouglas at apache dot org>
>> > > >  * Roman Shaposhnik <rvs at apache dot org>
>> > > >
>> > > > === Sponsoring Entity ===
>> > > >
>> > > > We are requesting the Incubator to sponsor this project.
>> > > >
>> > >
>> >
>>
>
>
>
> --
> Best Regards,
> -- Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [PROPOSAL] Samza Proposal

Reply via email to