Re: [PROPOSAL] Samza Proposal

Henry Saputra Tue, 23 Jul 2013 18:49:36 -0700

Looks like this is similar to S4 (http://incubator.apache.org/s4/) which
allow stream and real time data processing via DAG?



- Henry


On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco <criccomini....@gmail.com>wrote:

> Hey All,
>
> Sending along an incubator proposal for Samza.
>
> Thanks!
> Chris
>
> https://wiki.apache.org/incubator/SamzaProposal
>
> --------------------------------------------
>
> == Abstract ==
>
> Samza is a stream processing system for running continuous computation on
> infinite streams of data.
>
> == Proposal ==
>
> Samza provides a system for processing stream data from publish-subscribe
> systems such as Apache Kafka. The developer writes a stream processing
> task, and executes it as a Samza job. Samza then routes messages between
> stream processing tasks and the publish-subscribe systems that the messages
> are addressed to.
>
> == Background ==
>
> Samza was developed at LinkedIn to enable easier processing of streaming
> data on top of Apache Kafka. Current use cases include content processing
> pipelines, aggregating operational log data, data ingestion into
> distributed database infrastructure, and measuring user activity across
> different aggregation types.
>
> Samza is focused on providing an easy to use framework to process streams.
> It uses Apache YARN to provide a mechanism for deploying stream processing
> tasks in a distributed cluster. Samza also takes advantage of YARN to make
> decisions about stream processor locality, co-partition of streams, and
> provide security. Apache Kafka is also leveraged to provide a mechanism to
> pass messages from one stream processor to the next. Apache Kafka is also
> used to help manage a stream processor's state, so that it can be recovered
> in the event of a failure.
>
> Samza is written in Scala. It was developed internally at LinkedIn to meet
> our particular use cases, but will be useful to many organizations facing a
> similar need to reliably process large amounts of streaming data.
> Therefore, we would like to share it the ASF and begin developing a
> community of developers and users within Apache.
>
> == Rationale ==
>
> Many organizations can benefit from a reliable stream processing system
> such as Samza. While our use case of processing events from a large website
> like LinkedIn has driven the design of Samza, its uses are varied and we
> expect many new use cases to emerge. Samza provides a generic API to
> process messages from streaming infrastructure and will appeal to many
> users.
>
> == Current Status ==
>
> === Meritocracy ===
>
> Our intent with this incubator proposal is to start building a diverse
> developer community around Samza following the Apache meritocracy model.
> Since Samza was initially developed in late 2011, we have had fast adoption
> and contributions by multiple teams at LinkedIn. We plan to continue
> support for new contributors and work with those who contribute
> significantly to the project to make them committers.
>
> === Community ===
>
> Samza is currently being used internally at LinkedIn. We hope to extend our
> contributor base significantly and invite all those who are interested in
> building large-scale distributed systems to participate.
>
> === Core Developers ===
>
> Samza is currently being developed by four engineers at LinkedIn: Jay
> Kreps, Jakob Homan, Sriram Subramanian, and Chris Riccomini. Jakob is an
> ASF Member, Incubator PMC member and PMC member on Apache Hadoop, Kafka and
> Giraph. Jay is a member of the Apache Kafka PMC and contributor to various
> Apache projects. Chris has been an active contributor for several projects
> including Apache Kafka and Apache YARN. Sriram has contributed to Samza, as
> well as Apache Kafka.
>
> === Alignment ===
>
> The ASF is the natural choice to host the Samza project as its goal of
> encouraging community-driven open-source projects fits with our vision for
> Samza. Additionally, many other projects with which we are familiar with
> and expect Samza to integrate with, such as Apache ZooKeeper, YARN, HDFS
> and log4j are hosted by the ASF and we will benefit and provide benefit by
> close proximity to them.
>
> == Known Risks ==
>
> === Orphaned Products ===
>
> The core developers plan to work full time on the project. There is very
> little risk of Samza being abandoned as it is part of LinkedIn's internal
> infrastructure.
>
> === Inexperience with Open Source ===
>
> All of the core developers have experience with open source development.
> Jay and Chris has been involved with several open source projects released
> by LinkedIn, and Jay is a committer on Apache Kafka. Jakob has been
> actively involved with the ASF as a full-time Hadoop committer and PMC
> member. Sriram is a contributor to Apache Kafka.
>
> === Homogeneous Developers ===
>
> The current core developers are all from LinkedIn. However, we hope to
> establish a developer community that includes contributors from several
> corporations and we actively encouraging new contributors via the mailing
> lists and public presentations of Samza.
>
> === Reliance on Salaried Developers ===
>
> Currently, the developers are paid to do work on Samza. However, once the
> project has a community built around it, we expect to get committers,
> developers and community from outside the current core developers. However,
> because LinkedIn relies on Samza internally, the reliance on salaried
> developers is unlikely to change.
>
> === Relationships with Other Apache Products ===
>
> Samza is deeply integrated with Apache products. Samza uses Apache Kafka as
> its underlying message passing system. Samza also uses Apache YARN for task
> scheduling. Both YARN and Kafka, in turn, rely on Apache ZooKeeper for
> coordination. In addition, we hope to integrate with Apache HDFS in the
> near future.
>
> === An Excessive Fascination with the Apache Brand ===
>
> While we respect the reputation of the Apache brand and have no doubts that
> it will attract contributors and users, our interest is primarily to give
> Samza a solid home as an open source project following an established
> development model. We have also given reasons in the Rationale and
> Alignment sections.
>
> == Documentation ==
>
> http://wiki.apache.org/incubator/SamzaProposal
>
> == Initial Source ==
>
> Available upon request.
>
> == External Dependencies ==
>
> The dependencies all have Apache compatible licenses.
>
>  * metrics (Apache 2.0)
>  * zkclient (Apache 2.0)
>  * zookeeper (Apache 2.0)
>  * jetty (Apache 2.0)
>  * jackson (Apache 2.0)
>  * commons-httpclient (Apache 2.0)
>  * slf4j (MIT)
>  * avro (Apache 2.0)
>  * hadoop (Apache 2.0)
>  * junit (Common Public License)
>  * grizzled-slf4j (BSD)
>  * scalatra (https://github.com/scalatra/scalatra/blob/develop/LICENSE)
>  * scala (http://www.scala-lang.org/node/146)
>  * joptsimple (MIT)
>  * kafka (Apache 2.0)
>  * scalate (Apache 2.0)
>  * leveldb jni (BSD)
>
> == Cryptography ==
>
> Samza will depend on secure Hadoop, which can optionally use Kerberos.
>
> == Required Resources ==
>
> === Mailing Lists ===
>
> samza-private for private PMC discussions (with moderated subscriptions)
> samza-dev
> samza-commits
> samza-user
>
> === Subversion Directory ===
>
> Git is the preferred source control system: git://git.apache.org/samza
>
> === Issue Tracking ===
>
> JIRA Samza (SAMZA)
>
> === Other Resources ===
>
> The existing code already has unit tests, so we would like a Hudson
> instance to run them whenever a new patch is submitted. This can be added
> after project creation.
>
> == Initial Committers ==
>
>  * Jay Kreps
>  * Jakob Homan
>  * Chris Riccomini
>  * Sriram Subramanian
>
> == Affiliations ==
>
>  * Jay Kreps (LinkedIn)
>  * Jakob Homan (LinkedIn)
>  * Chris Riccomini (LinkedIn)
>  * Sriram Subramanian (LinkedIn)
>
> == Sponsors ==
>
> === Champion ===
>
> Jakob Homan (Apache Member)
>
> === Nominated Mentors ===
>
>  * Arun C Murthy <acmurthy at apache dot org>
>  * Chris Douglas <cdouglas at apache dot org>
>  * Roman Shaposhnik <rvs at apache dot org>
>
> === Sponsoring Entity ===
>
> We are requesting the Incubator to sponsor this project.
>

Re: [PROPOSAL] Samza Proposal

Reply via email to