[PROPOSAL] Samza Proposal

2013-07-23 Thread Chris Ricco
Hey All,

Sending along an incubator proposal for Samza.

Thanks!
Chris

https://wiki.apache.org/incubator/SamzaProposal



== Abstract ==

Samza is a stream processing system for running continuous computation on
infinite streams of data.

== Proposal ==

Samza provides a system for processing stream data from publish-subscribe
systems such as Apache Kafka. The developer writes a stream processing
task, and executes it as a Samza job. Samza then routes messages between
stream processing tasks and the publish-subscribe systems that the messages
are addressed to.

== Background ==

Samza was developed at LinkedIn to enable easier processing of streaming
data on top of Apache Kafka. Current use cases include content processing
pipelines, aggregating operational log data, data ingestion into
distributed database infrastructure, and measuring user activity across
different aggregation types.

Samza is focused on providing an easy to use framework to process streams.
It uses Apache YARN to provide a mechanism for deploying stream processing
tasks in a distributed cluster. Samza also takes advantage of YARN to make
decisions about stream processor locality, co-partition of streams, and
provide security. Apache Kafka is also leveraged to provide a mechanism to
pass messages from one stream processor to the next. Apache Kafka is also
used to help manage a stream processor's state, so that it can be recovered
in the event of a failure.

Samza is written in Scala. It was developed internally at LinkedIn to meet
our particular use cases, but will be useful to many organizations facing a
similar need to reliably process large amounts of streaming data.
Therefore, we would like to share it the ASF and begin developing a
community of developers and users within Apache.

== Rationale ==

Many organizations can benefit from a reliable stream processing system
such as Samza. While our use case of processing events from a large website
like LinkedIn has driven the design of Samza, its uses are varied and we
expect many new use cases to emerge. Samza provides a generic API to
process messages from streaming infrastructure and will appeal to many
users.

== Current Status ==

=== Meritocracy ===

Our intent with this incubator proposal is to start building a diverse
developer community around Samza following the Apache meritocracy model.
Since Samza was initially developed in late 2011, we have had fast adoption
and contributions by multiple teams at LinkedIn. We plan to continue
support for new contributors and work with those who contribute
significantly to the project to make them committers.

=== Community ===

Samza is currently being used internally at LinkedIn. We hope to extend our
contributor base significantly and invite all those who are interested in
building large-scale distributed systems to participate.

=== Core Developers ===

Samza is currently being developed by four engineers at LinkedIn: Jay
Kreps, Jakob Homan, Sriram Subramanian, and Chris Riccomini. Jakob is an
ASF Member, Incubator PMC member and PMC member on Apache Hadoop, Kafka and
Giraph. Jay is a member of the Apache Kafka PMC and contributor to various
Apache projects. Chris has been an active contributor for several projects
including Apache Kafka and Apache YARN. Sriram has contributed to Samza, as
well as Apache Kafka.

=== Alignment ===

The ASF is the natural choice to host the Samza project as its goal of
encouraging community-driven open-source projects fits with our vision for
Samza. Additionally, many other projects with which we are familiar with
and expect Samza to integrate with, such as Apache ZooKeeper, YARN, HDFS
and log4j are hosted by the ASF and we will benefit and provide benefit by
close proximity to them.

== Known Risks ==

=== Orphaned Products ===

The core developers plan to work full time on the project. There is very
little risk of Samza being abandoned as it is part of LinkedIn's internal
infrastructure.

=== Inexperience with Open Source ===

All of the core developers have experience with open source development.
Jay and Chris has been involved with several open source projects released
by LinkedIn, and Jay is a committer on Apache Kafka. Jakob has been
actively involved with the ASF as a full-time Hadoop committer and PMC
member. Sriram is a contributor to Apache Kafka.

=== Homogeneous Developers ===

The current core developers are all from LinkedIn. However, we hope to
establish a developer community that includes contributors from several
corporations and we actively encouraging new contributors via the mailing
lists and public presentations of Samza.

=== Reliance on Salaried Developers ===

Currently, the developers are paid to do work on Samza. However, once the
project has a community built around it, we expect to get committers,
developers and community from outside the current core developers. However,
because LinkedIn relies on Samza internally, the reliance on 

Re: [PROPOSAL] Samza Proposal

2013-07-23 Thread Henry Saputra
Looks like this is similar to S4 (http://incubator.apache.org/s4/) which
allow stream and real time data processing via DAG?


- Henry


On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco criccomini@gmail.comwrote:

 Hey All,

 Sending along an incubator proposal for Samza.

 Thanks!
 Chris

 https://wiki.apache.org/incubator/SamzaProposal

 

 == Abstract ==

 Samza is a stream processing system for running continuous computation on
 infinite streams of data.

 == Proposal ==

 Samza provides a system for processing stream data from publish-subscribe
 systems such as Apache Kafka. The developer writes a stream processing
 task, and executes it as a Samza job. Samza then routes messages between
 stream processing tasks and the publish-subscribe systems that the messages
 are addressed to.

 == Background ==

 Samza was developed at LinkedIn to enable easier processing of streaming
 data on top of Apache Kafka. Current use cases include content processing
 pipelines, aggregating operational log data, data ingestion into
 distributed database infrastructure, and measuring user activity across
 different aggregation types.

 Samza is focused on providing an easy to use framework to process streams.
 It uses Apache YARN to provide a mechanism for deploying stream processing
 tasks in a distributed cluster. Samza also takes advantage of YARN to make
 decisions about stream processor locality, co-partition of streams, and
 provide security. Apache Kafka is also leveraged to provide a mechanism to
 pass messages from one stream processor to the next. Apache Kafka is also
 used to help manage a stream processor's state, so that it can be recovered
 in the event of a failure.

 Samza is written in Scala. It was developed internally at LinkedIn to meet
 our particular use cases, but will be useful to many organizations facing a
 similar need to reliably process large amounts of streaming data.
 Therefore, we would like to share it the ASF and begin developing a
 community of developers and users within Apache.

 == Rationale ==

 Many organizations can benefit from a reliable stream processing system
 such as Samza. While our use case of processing events from a large website
 like LinkedIn has driven the design of Samza, its uses are varied and we
 expect many new use cases to emerge. Samza provides a generic API to
 process messages from streaming infrastructure and will appeal to many
 users.

 == Current Status ==

 === Meritocracy ===

 Our intent with this incubator proposal is to start building a diverse
 developer community around Samza following the Apache meritocracy model.
 Since Samza was initially developed in late 2011, we have had fast adoption
 and contributions by multiple teams at LinkedIn. We plan to continue
 support for new contributors and work with those who contribute
 significantly to the project to make them committers.

 === Community ===

 Samza is currently being used internally at LinkedIn. We hope to extend our
 contributor base significantly and invite all those who are interested in
 building large-scale distributed systems to participate.

 === Core Developers ===

 Samza is currently being developed by four engineers at LinkedIn: Jay
 Kreps, Jakob Homan, Sriram Subramanian, and Chris Riccomini. Jakob is an
 ASF Member, Incubator PMC member and PMC member on Apache Hadoop, Kafka and
 Giraph. Jay is a member of the Apache Kafka PMC and contributor to various
 Apache projects. Chris has been an active contributor for several projects
 including Apache Kafka and Apache YARN. Sriram has contributed to Samza, as
 well as Apache Kafka.

 === Alignment ===

 The ASF is the natural choice to host the Samza project as its goal of
 encouraging community-driven open-source projects fits with our vision for
 Samza. Additionally, many other projects with which we are familiar with
 and expect Samza to integrate with, such as Apache ZooKeeper, YARN, HDFS
 and log4j are hosted by the ASF and we will benefit and provide benefit by
 close proximity to them.

 == Known Risks ==

 === Orphaned Products ===

 The core developers plan to work full time on the project. There is very
 little risk of Samza being abandoned as it is part of LinkedIn's internal
 infrastructure.

 === Inexperience with Open Source ===

 All of the core developers have experience with open source development.
 Jay and Chris has been involved with several open source projects released
 by LinkedIn, and Jay is a committer on Apache Kafka. Jakob has been
 actively involved with the ASF as a full-time Hadoop committer and PMC
 member. Sriram is a contributor to Apache Kafka.

 === Homogeneous Developers ===

 The current core developers are all from LinkedIn. However, we hope to
 establish a developer community that includes contributors from several
 corporations and we actively encouraging new contributors via the mailing
 lists and public presentations of Samza.

 === Reliance on 

Re: [PROPOSAL] Samza Proposal

2013-07-23 Thread Debo Dutta (dedutta)
Also add storm to the mix. Storm also allows you to do back edges.

debo

On 7/23/13 6:48 PM, Henry Saputra henry.sapu...@gmail.com wrote:

Looks like this is similar to S4 (http://incubator.apache.org/s4/) which
allow stream and real time data processing via DAG?


- Henry


On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco
criccomini@gmail.comwrote:

 Hey All,

 Sending along an incubator proposal for Samza.

 Thanks!
 Chris

 https://wiki.apache.org/incubator/SamzaProposal

 

 == Abstract ==

 Samza is a stream processing system for running continuous computation
on
 infinite streams of data.

 == Proposal ==

 Samza provides a system for processing stream data from
publish-subscribe
 systems such as Apache Kafka. The developer writes a stream processing
 task, and executes it as a Samza job. Samza then routes messages between
 stream processing tasks and the publish-subscribe systems that the
messages
 are addressed to.

 == Background ==

 Samza was developed at LinkedIn to enable easier processing of streaming
 data on top of Apache Kafka. Current use cases include content
processing
 pipelines, aggregating operational log data, data ingestion into
 distributed database infrastructure, and measuring user activity across
 different aggregation types.

 Samza is focused on providing an easy to use framework to process
streams.
 It uses Apache YARN to provide a mechanism for deploying stream
processing
 tasks in a distributed cluster. Samza also takes advantage of YARN to
make
 decisions about stream processor locality, co-partition of streams, and
 provide security. Apache Kafka is also leveraged to provide a mechanism
to
 pass messages from one stream processor to the next. Apache Kafka is
also
 used to help manage a stream processor's state, so that it can be
recovered
 in the event of a failure.

 Samza is written in Scala. It was developed internally at LinkedIn to
meet
 our particular use cases, but will be useful to many organizations
facing a
 similar need to reliably process large amounts of streaming data.
 Therefore, we would like to share it the ASF and begin developing a
 community of developers and users within Apache.

 == Rationale ==

 Many organizations can benefit from a reliable stream processing system
 such as Samza. While our use case of processing events from a large
website
 like LinkedIn has driven the design of Samza, its uses are varied and we
 expect many new use cases to emerge. Samza provides a generic API to
 process messages from streaming infrastructure and will appeal to many
 users.

 == Current Status ==

 === Meritocracy ===

 Our intent with this incubator proposal is to start building a diverse
 developer community around Samza following the Apache meritocracy model.
 Since Samza was initially developed in late 2011, we have had fast
adoption
 and contributions by multiple teams at LinkedIn. We plan to continue
 support for new contributors and work with those who contribute
 significantly to the project to make them committers.

 === Community ===

 Samza is currently being used internally at LinkedIn. We hope to extend
our
 contributor base significantly and invite all those who are interested
in
 building large-scale distributed systems to participate.

 === Core Developers ===

 Samza is currently being developed by four engineers at LinkedIn: Jay
 Kreps, Jakob Homan, Sriram Subramanian, and Chris Riccomini. Jakob is an
 ASF Member, Incubator PMC member and PMC member on Apache Hadoop, Kafka
and
 Giraph. Jay is a member of the Apache Kafka PMC and contributor to
various
 Apache projects. Chris has been an active contributor for several
projects
 including Apache Kafka and Apache YARN. Sriram has contributed to
Samza, as
 well as Apache Kafka.

 === Alignment ===

 The ASF is the natural choice to host the Samza project as its goal of
 encouraging community-driven open-source projects fits with our vision
for
 Samza. Additionally, many other projects with which we are familiar with
 and expect Samza to integrate with, such as Apache ZooKeeper, YARN, HDFS
 and log4j are hosted by the ASF and we will benefit and provide benefit
by
 close proximity to them.

 == Known Risks ==

 === Orphaned Products ===

 The core developers plan to work full time on the project. There is very
 little risk of Samza being abandoned as it is part of LinkedIn's
internal
 infrastructure.

 === Inexperience with Open Source ===

 All of the core developers have experience with open source development.
 Jay and Chris has been involved with several open source projects
released
 by LinkedIn, and Jay is a committer on Apache Kafka. Jakob has been
 actively involved with the ASF as a full-time Hadoop committer and PMC
 member. Sriram is a contributor to Apache Kafka.

 === Homogeneous Developers ===

 The current core developers are all from LinkedIn. However, we hope to
 establish a developer community that includes contributors from several
 

Re: [PROPOSAL] Samza Proposal

2013-07-23 Thread Chris Riccomini
Hey Henry and Debo,

Thanks for calling this out. Samza's feature set includes:

   - *Simpe API:* Unlike most low-level messaging system APIs, Samza
   provides a very simple call-back based process message API that should be
   familiar to anyone that's used Map/Reduce.
   - *Managed state:* Samza manages snapshotting and restoration of a
   stream processor's state. Samza will restore a stream processor's state to
   a snapshot consistent with the processor's last read messages when the
   processor is restarted.
   - *Fault tolerance:* Samza will work with YARN to restart your stream
   processor if there is a machine or processor failure.
   - Durability: Samza uses Kafka to guarantee that no messages will ever
   be lost.
   - *Scalability:* Samza is partitioned and distributed at every level.
   Kafka provides ordered, partitioned, replayable, fault-tolerant streams.
   YARN provides a distributed environment for Samza containers to run in.
   - *Pluggable:* Though Samza works out of the box with Kafka and YARN,
   Samza provides a pluggable API that lets you run Samza with other messaging
   systems and execution environments.
   - *Processor isolation:* Samza works with Apache YARN, which supports
   processor security through Hadoop's security model, and resource isolation
   through Linux CGroups.

Some of these feature are available in S4, and some are not. The same holds
true for Storm.

The open source stream processing systems that are available are actually
quite young, and no single system offers a complete solution. Problems like
how a stream processor's state (e.g. counts) should be managed, whether a
stream should be buffered remotely on disk or not, what to do when
duplicate messages are received or messages are lost, and how to model
underlying messaging systems are all pretty new.

Samza's main differentiators are:

   - State is modeled as a stream. When a processor fails and is restarted,
   the state stream is entirely replayed to restore it.
   - Streams are ordered, partitioned, replayable, and fault tolerant.
   - YARN is used for processor isolation, security, and fault tolerance.
   - All streams are materialized to Kafka.

If you guys are interested, I have much more in-depth documents comparing
and contrasting Samza with MUPD8 and Storm.

Cheers,
Chris


On Tue, Jul 23, 2013 at 6:48 PM, Henry Saputra henry.sapu...@gmail.comwrote:

 Looks like this is similar to S4 (http://incubator.apache.org/s4/) which
 allow stream and real time data processing via DAG?


 - Henry


 On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco criccomini@gmail.com
 wrote:

  Hey All,
 
  Sending along an incubator proposal for Samza.
 
  Thanks!
  Chris
 
  https://wiki.apache.org/incubator/SamzaProposal
 
  
 
  == Abstract ==
 
  Samza is a stream processing system for running continuous computation on
  infinite streams of data.
 
  == Proposal ==
 
  Samza provides a system for processing stream data from publish-subscribe
  systems such as Apache Kafka. The developer writes a stream processing
  task, and executes it as a Samza job. Samza then routes messages between
  stream processing tasks and the publish-subscribe systems that the
 messages
  are addressed to.
 
  == Background ==
 
  Samza was developed at LinkedIn to enable easier processing of streaming
  data on top of Apache Kafka. Current use cases include content processing
  pipelines, aggregating operational log data, data ingestion into
  distributed database infrastructure, and measuring user activity across
  different aggregation types.
 
  Samza is focused on providing an easy to use framework to process
 streams.
  It uses Apache YARN to provide a mechanism for deploying stream
 processing
  tasks in a distributed cluster. Samza also takes advantage of YARN to
 make
  decisions about stream processor locality, co-partition of streams, and
  provide security. Apache Kafka is also leveraged to provide a mechanism
 to
  pass messages from one stream processor to the next. Apache Kafka is also
  used to help manage a stream processor's state, so that it can be
 recovered
  in the event of a failure.
 
  Samza is written in Scala. It was developed internally at LinkedIn to
 meet
  our particular use cases, but will be useful to many organizations
 facing a
  similar need to reliably process large amounts of streaming data.
  Therefore, we would like to share it the ASF and begin developing a
  community of developers and users within Apache.
 
  == Rationale ==
 
  Many organizations can benefit from a reliable stream processing system
  such as Samza. While our use case of processing events from a large
 website
  like LinkedIn has driven the design of Samza, its uses are varied and we
  expect many new use cases to emerge. Samza provides a generic API to
  process messages from streaming infrastructure and will appeal to many
  users.
 
  == Current Status ==
 
  === Meritocracy ===