Re: [PROPOSAL] Samza Proposal

2013-08-02 Thread Jakob Homan
For those interested in more detail, the site is now up:
http://samza.incubator.apache.org/  As is the dev list,
http://samza.incubator.apache.org/community/mailing-lists.html, for further
discussion.

Thanks.
-Jakob

On Wed, Jul 31, 2013 at 3:15 PM, Henry Saputra wrote:

> NP
>
> Good luck and congrats
>
> - Henry
>
> On Wednesday, July 31, 2013, Chris Riccomini wrote:
>
> > Hey Guys,
> >
> > Jakob (the project Champion) is in the process of getting all of the
> > resources requested in our proposal (JIRA, Hudson, webspace, etc).
> >
> > As soon as we have webspace allocated, we'll put the Samza site up, which
> > has all of these docs on it. Henry, as you said, I'll follow up with this
> > thread when they're up.
> >
> > Cheers,
> > Chris
> >
> >
> > On Wed, Jul 31, 2013 at 2:04 PM, Henry Saputra  
> > >wrote:
> >
> > > Well, usually VOTE is conducted after discussion had calmed down. Looks
> > > like this time the VOTE starts even though there were some question
> about
> > > the proposal.
> > >
> > > Would be great to actually add links to the comparisons in the thread
> > even
> > > though the VOTE had concluded.
> > >
> > > - Henry
> > >
> > >
> > > On Wed, Jul 31, 2013 at 1:29 PM, Phillip Rhodes
> > > wrote:
> > >
> > > > Same here.  Not that it matters as far as admission to the incubator
> > > > (that vote is over now anyway), but I think a lot of people
> (including
> > > > potential users of Samza) would like to see more about how it
> compares
> > > > & contrasts with other stream oriented systems.
> > > >
> > > >
> > > > Phil
> > > > This message optimized for indexing by NSA PRISM
> > > >
> > > >
> > > > On Fri, Jul 26, 2013 at 8:27 PM, Alex Karasulu  >
> > > > wrote:
> > > > > +1
> > > > >
> > > > > I would love to see the "documents comparing and contrasting Samza
> > with
> > > > > MUPD8 and Storm."
> > > > >
> > > > >
> > > > > On Sat, Jul 27, 2013 at 2:53 AM, Enis Söztutar 
> > > wrote:
> > > > >
> > > > >> +1 on incubation.
> > > > >>
> > > > >> Enis
> > > > >>
> > > > >>
> > > > >> On Tue, Jul 23, 2013 at 7:17 PM, Chris Riccomini
> > > > >> wrote:
> > > > >>
> > > > >> > Hey Henry and Debo,
> > > > >> >
> > > > >> > Thanks for calling this out. Samza's feature set includes:
> > > > >> >
> > > > >> >- *Simpe API:* Unlike most low-level messaging system APIs,
> > Samza
> > > > >> >provides a very simple call-back based "process message" API
> > that
> > > > >> > should be
> > > > >> >familiar to anyone that's used Map/Reduce.
> > > > >> >- *Managed state:* Samza manages snapshotting and restoration
> > of
> > > a
> > > > >> >stream processor's state. Samza will restore a stream
> > processor's
> > > > >> state
> > > > >> > to
> > > > >> >a snapshot consistent with the processor's last read messages
> > > when
> > > > the
> > > > >> >processor is restarted.
> > > > >> >- *Fault tolerance:* Samza will work with YARN to restart
> your
> > > > stream
> > > > >> >processor if there is a machine or processor failure.
> > > > >> >- Durability: Samza uses Kafka to guarantee that no messages
> > will
> > > > ever
> > > > >> >be lost.
> > > > >> >- *Scalability:* Samza is partitioned and distributed at
> every
> > > > level.
> > > > >> >Kafka provides ordered, partitioned, replayable,
> fault-tolerant
> > > > >> streams.
> > > > >> >YARN provides a distributed environment for Samza containers
> to
> > > run
> > > > >> in.
> > > > >> >- *Pluggable:* Though Samza works out of the box with Kafka
> and
> > > > YARN,
> > > > >> >Samza provides a pluggable API that lets you run Samza with
> > other
> > > > >> > messaging
> > > > >> >systems and execution environments.
> > > > >> >- *Processor isolation:* Samza works with Apache YARN, which
> > > > supports
> > > > >> >processor security through Hadoop's security model, and
> > resource
> > > > >> > isolation
> > > > >> >through Linux CGroups.
> > > > >> >
> > > > >> > Some of these feature are available in S4, and some are not. The
> > > same
> > > > >> holds
> > > > >> > true for Storm.
> > > > >> >
> > > > >> > The open source stream processing systems that are available are
> > > > actually
> > > > >> > quite young, and no single system offers a c
>


Re: [PROPOSAL] Samza Proposal

2013-07-31 Thread Henry Saputra
NP

Good luck and congrats

- Henry

On Wednesday, July 31, 2013, Chris Riccomini wrote:

> Hey Guys,
>
> Jakob (the project Champion) is in the process of getting all of the
> resources requested in our proposal (JIRA, Hudson, webspace, etc).
>
> As soon as we have webspace allocated, we'll put the Samza site up, which
> has all of these docs on it. Henry, as you said, I'll follow up with this
> thread when they're up.
>
> Cheers,
> Chris
>
>
> On Wed, Jul 31, 2013 at 2:04 PM, Henry Saputra 
> 
> >wrote:
>
> > Well, usually VOTE is conducted after discussion had calmed down. Looks
> > like this time the VOTE starts even though there were some question about
> > the proposal.
> >
> > Would be great to actually add links to the comparisons in the thread
> even
> > though the VOTE had concluded.
> >
> > - Henry
> >
> >
> > On Wed, Jul 31, 2013 at 1:29 PM, Phillip Rhodes
> > wrote:
> >
> > > Same here.  Not that it matters as far as admission to the incubator
> > > (that vote is over now anyway), but I think a lot of people (including
> > > potential users of Samza) would like to see more about how it compares
> > > & contrasts with other stream oriented systems.
> > >
> > >
> > > Phil
> > > This message optimized for indexing by NSA PRISM
> > >
> > >
> > > On Fri, Jul 26, 2013 at 8:27 PM, Alex Karasulu 
> > > wrote:
> > > > +1
> > > >
> > > > I would love to see the "documents comparing and contrasting Samza
> with
> > > > MUPD8 and Storm."
> > > >
> > > >
> > > > On Sat, Jul 27, 2013 at 2:53 AM, Enis Söztutar 
> > wrote:
> > > >
> > > >> +1 on incubation.
> > > >>
> > > >> Enis
> > > >>
> > > >>
> > > >> On Tue, Jul 23, 2013 at 7:17 PM, Chris Riccomini
> > > >> wrote:
> > > >>
> > > >> > Hey Henry and Debo,
> > > >> >
> > > >> > Thanks for calling this out. Samza's feature set includes:
> > > >> >
> > > >> >- *Simpe API:* Unlike most low-level messaging system APIs,
> Samza
> > > >> >provides a very simple call-back based "process message" API
> that
> > > >> > should be
> > > >> >familiar to anyone that's used Map/Reduce.
> > > >> >- *Managed state:* Samza manages snapshotting and restoration
> of
> > a
> > > >> >stream processor's state. Samza will restore a stream
> processor's
> > > >> state
> > > >> > to
> > > >> >a snapshot consistent with the processor's last read messages
> > when
> > > the
> > > >> >processor is restarted.
> > > >> >- *Fault tolerance:* Samza will work with YARN to restart your
> > > stream
> > > >> >processor if there is a machine or processor failure.
> > > >> >- Durability: Samza uses Kafka to guarantee that no messages
> will
> > > ever
> > > >> >be lost.
> > > >> >- *Scalability:* Samza is partitioned and distributed at every
> > > level.
> > > >> >Kafka provides ordered, partitioned, replayable, fault-tolerant
> > > >> streams.
> > > >> >YARN provides a distributed environment for Samza containers to
> > run
> > > >> in.
> > > >> >- *Pluggable:* Though Samza works out of the box with Kafka and
> > > YARN,
> > > >> >Samza provides a pluggable API that lets you run Samza with
> other
> > > >> > messaging
> > > >> >systems and execution environments.
> > > >> >- *Processor isolation:* Samza works with Apache YARN, which
> > > supports
> > > >> >processor security through Hadoop's security model, and
> resource
> > > >> > isolation
> > > >> >through Linux CGroups.
> > > >> >
> > > >> > Some of these feature are available in S4, and some are not. The
> > same
> > > >> holds
> > > >> > true for Storm.
> > > >> >
> > > >> > The open source stream processing systems that are available are
> > > actually
> > > >> > quite young, and no single system offers a c


Re: [PROPOSAL] Samza Proposal

2013-07-31 Thread Chris Riccomini
Hey Guys,

Jakob (the project Champion) is in the process of getting all of the
resources requested in our proposal (JIRA, Hudson, webspace, etc).

As soon as we have webspace allocated, we'll put the Samza site up, which
has all of these docs on it. Henry, as you said, I'll follow up with this
thread when they're up.

Cheers,
Chris


On Wed, Jul 31, 2013 at 2:04 PM, Henry Saputra wrote:

> Well, usually VOTE is conducted after discussion had calmed down. Looks
> like this time the VOTE starts even though there were some question about
> the proposal.
>
> Would be great to actually add links to the comparisons in the thread even
> though the VOTE had concluded.
>
> - Henry
>
>
> On Wed, Jul 31, 2013 at 1:29 PM, Phillip Rhodes
> wrote:
>
> > Same here.  Not that it matters as far as admission to the incubator
> > (that vote is over now anyway), but I think a lot of people (including
> > potential users of Samza) would like to see more about how it compares
> > & contrasts with other stream oriented systems.
> >
> >
> > Phil
> > This message optimized for indexing by NSA PRISM
> >
> >
> > On Fri, Jul 26, 2013 at 8:27 PM, Alex Karasulu 
> > wrote:
> > > +1
> > >
> > > I would love to see the "documents comparing and contrasting Samza with
> > > MUPD8 and Storm."
> > >
> > >
> > > On Sat, Jul 27, 2013 at 2:53 AM, Enis Söztutar 
> wrote:
> > >
> > >> +1 on incubation.
> > >>
> > >> Enis
> > >>
> > >>
> > >> On Tue, Jul 23, 2013 at 7:17 PM, Chris Riccomini
> > >> wrote:
> > >>
> > >> > Hey Henry and Debo,
> > >> >
> > >> > Thanks for calling this out. Samza's feature set includes:
> > >> >
> > >> >- *Simpe API:* Unlike most low-level messaging system APIs, Samza
> > >> >provides a very simple call-back based "process message" API that
> > >> > should be
> > >> >familiar to anyone that's used Map/Reduce.
> > >> >- *Managed state:* Samza manages snapshotting and restoration of
> a
> > >> >stream processor's state. Samza will restore a stream processor's
> > >> state
> > >> > to
> > >> >a snapshot consistent with the processor's last read messages
> when
> > the
> > >> >processor is restarted.
> > >> >- *Fault tolerance:* Samza will work with YARN to restart your
> > stream
> > >> >processor if there is a machine or processor failure.
> > >> >- Durability: Samza uses Kafka to guarantee that no messages will
> > ever
> > >> >be lost.
> > >> >- *Scalability:* Samza is partitioned and distributed at every
> > level.
> > >> >Kafka provides ordered, partitioned, replayable, fault-tolerant
> > >> streams.
> > >> >YARN provides a distributed environment for Samza containers to
> run
> > >> in.
> > >> >- *Pluggable:* Though Samza works out of the box with Kafka and
> > YARN,
> > >> >Samza provides a pluggable API that lets you run Samza with other
> > >> > messaging
> > >> >systems and execution environments.
> > >> >- *Processor isolation:* Samza works with Apache YARN, which
> > supports
> > >> >processor security through Hadoop's security model, and resource
> > >> > isolation
> > >> >through Linux CGroups.
> > >> >
> > >> > Some of these feature are available in S4, and some are not. The
> same
> > >> holds
> > >> > true for Storm.
> > >> >
> > >> > The open source stream processing systems that are available are
> > actually
> > >> > quite young, and no single system offers a complete solution.
> Problems
> > >> like
> > >> > how a stream processor's state (e.g. counts) should be managed,
> > whether a
> > >> > stream should be buffered remotely on disk or not, what to do when
> > >> > duplicate messages are received or messages are lost, and how to
> model
> > >> > underlying messaging systems are all pretty new.
> > >> >
> > >> > Samza's main differentiators are:
> > >> >
> > >> >- State is modeled as a stream. When a processor fails and is
> > >> restarted,
> > >> >the state stream is entirely replayed to restore it.
> > >> >- Streams are ordered, partitioned, replayable, and fault
> tolerant.
> > >> >- YARN is used for processor isolation, security, and fault
> > tolerance.
> > >> >- All streams are materialized to Kafka.
> > >> >
> > >> > If you guys are interested, I have much more in-depth documents
> > comparing
> > >> > and contrasting Samza with MUPD8 and Storm.
> > >> >
> > >> > Cheers,
> > >> > Chris
> > >> >
> > >> >
> > >> > On Tue, Jul 23, 2013 at 6:48 PM, Henry Saputra <
> > henry.sapu...@gmail.com
> > >> > >wrote:
> > >> >
> > >> > > Looks like this is similar to S4 (http://incubator.apache.org/s4/
> )
> > >> which
> > >> > > allow stream and real time data processing via DAG?
> > >> > >
> > >> > >
> > >> > > - Henry
> > >> > >
> > >> > >
> > >> > > On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco <
> > >> criccomini@gmail.com
> > >> > > >wrote:
> > >> > >
> > >> > > > Hey All,
> > >> > > >
> > >> > > > Sending along an incubator proposal for Samza.
> > >> > > >
> > >> > > > Thanks!
> > >> > > > Chris

Re: [PROPOSAL] Samza Proposal

2013-07-31 Thread Jakob Homan
Sorry, Henry, but no, there was no question about the proposal's
suitability for incubation, just interest in the technology itself.  Phil
made this clear above, and it should be on its own.  The documents Chris
referred to are embedded as part of the web site which will be up shortly.
 It's great to see such interest - I'm getting the infrastructure taken
care of now.  Site should be up in a day or two.  Thanks everybody.
-jg


On Wed, Jul 31, 2013 at 2:04 PM, Henry Saputra wrote:

> Well, usually VOTE is conducted after discussion had calmed down. Looks
> like this time the VOTE starts even though there were some question about
> the proposal.
>
> Would be great to actually add links to the comparisons in the thread even
> though the VOTE had concluded.
>
> - Henry
>
>
> On Wed, Jul 31, 2013 at 1:29 PM, Phillip Rhodes
> wrote:
>
> > Same here.  Not that it matters as far as admission to the incubator
> > (that vote is over now anyway), but I think a lot of people (including
> > potential users of Samza) would like to see more about how it compares
> > & contrasts with other stream oriented systems.
> >
> >
> > Phil
> > This message optimized for indexing by NSA PRISM
> >
> >
> > On Fri, Jul 26, 2013 at 8:27 PM, Alex Karasulu 
> > wrote:
> > > +1
> > >
> > > I would love to see the "documents comparing and contrasting Samza with
> > > MUPD8 and Storm."
> > >
> > >
> > > On Sat, Jul 27, 2013 at 2:53 AM, Enis Söztutar 
> wrote:
> > >
> > >> +1 on incubation.
> > >>
> > >> Enis
> > >>
> > >>
> > >> On Tue, Jul 23, 2013 at 7:17 PM, Chris Riccomini
> > >> wrote:
> > >>
> > >> > Hey Henry and Debo,
> > >> >
> > >> > Thanks for calling this out. Samza's feature set includes:
> > >> >
> > >> >- *Simpe API:* Unlike most low-level messaging system APIs, Samza
> > >> >provides a very simple call-back based "process message" API that
> > >> > should be
> > >> >familiar to anyone that's used Map/Reduce.
> > >> >- *Managed state:* Samza manages snapshotting and restoration of
> a
> > >> >stream processor's state. Samza will restore a stream processor's
> > >> state
> > >> > to
> > >> >a snapshot consistent with the processor's last read messages
> when
> > the
> > >> >processor is restarted.
> > >> >- *Fault tolerance:* Samza will work with YARN to restart your
> > stream
> > >> >processor if there is a machine or processor failure.
> > >> >- Durability: Samza uses Kafka to guarantee that no messages will
> > ever
> > >> >be lost.
> > >> >- *Scalability:* Samza is partitioned and distributed at every
> > level.
> > >> >Kafka provides ordered, partitioned, replayable, fault-tolerant
> > >> streams.
> > >> >YARN provides a distributed environment for Samza containers to
> run
> > >> in.
> > >> >- *Pluggable:* Though Samza works out of the box with Kafka and
> > YARN,
> > >> >Samza provides a pluggable API that lets you run Samza with other
> > >> > messaging
> > >> >systems and execution environments.
> > >> >- *Processor isolation:* Samza works with Apache YARN, which
> > supports
> > >> >processor security through Hadoop's security model, and resource
> > >> > isolation
> > >> >through Linux CGroups.
> > >> >
> > >> > Some of these feature are available in S4, and some are not. The
> same
> > >> holds
> > >> > true for Storm.
> > >> >
> > >> > The open source stream processing systems that are available are
> > actually
> > >> > quite young, and no single system offers a complete solution.
> Problems
> > >> like
> > >> > how a stream processor's state (e.g. counts) should be managed,
> > whether a
> > >> > stream should be buffered remotely on disk or not, what to do when
> > >> > duplicate messages are received or messages are lost, and how to
> model
> > >> > underlying messaging systems are all pretty new.
> > >> >
> > >> > Samza's main differentiators are:
> > >> >
> > >> >- State is modeled as a stream. When a processor fails and is
> > >> restarted,
> > >> >the state stream is entirely replayed to restore it.
> > >> >- Streams are ordered, partitioned, replayable, and fault
> tolerant.
> > >> >- YARN is used for processor isolation, security, and fault
> > tolerance.
> > >> >- All streams are materialized to Kafka.
> > >> >
> > >> > If you guys are interested, I have much more in-depth documents
> > comparing
> > >> > and contrasting Samza with MUPD8 and Storm.
> > >> >
> > >> > Cheers,
> > >> > Chris
> > >> >
> > >> >
> > >> > On Tue, Jul 23, 2013 at 6:48 PM, Henry Saputra <
> > henry.sapu...@gmail.com
> > >> > >wrote:
> > >> >
> > >> > > Looks like this is similar to S4 (http://incubator.apache.org/s4/
> )
> > >> which
> > >> > > allow stream and real time data processing via DAG?
> > >> > >
> > >> > >
> > >> > > - Henry
> > >> > >
> > >> > >
> > >> > > On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco <
> > >> criccomini@gmail.com
> > >> > > >wrote:
> > >> > >
> > >> > > > Hey All,
> > >> > > >
> > >> > > > Send

Re: [PROPOSAL] Samza Proposal

2013-07-31 Thread Henry Saputra
Well, usually VOTE is conducted after discussion had calmed down. Looks
like this time the VOTE starts even though there were some question about
the proposal.

Would be great to actually add links to the comparisons in the thread even
though the VOTE had concluded.

- Henry


On Wed, Jul 31, 2013 at 1:29 PM, Phillip Rhodes
wrote:

> Same here.  Not that it matters as far as admission to the incubator
> (that vote is over now anyway), but I think a lot of people (including
> potential users of Samza) would like to see more about how it compares
> & contrasts with other stream oriented systems.
>
>
> Phil
> This message optimized for indexing by NSA PRISM
>
>
> On Fri, Jul 26, 2013 at 8:27 PM, Alex Karasulu 
> wrote:
> > +1
> >
> > I would love to see the "documents comparing and contrasting Samza with
> > MUPD8 and Storm."
> >
> >
> > On Sat, Jul 27, 2013 at 2:53 AM, Enis Söztutar  wrote:
> >
> >> +1 on incubation.
> >>
> >> Enis
> >>
> >>
> >> On Tue, Jul 23, 2013 at 7:17 PM, Chris Riccomini
> >> wrote:
> >>
> >> > Hey Henry and Debo,
> >> >
> >> > Thanks for calling this out. Samza's feature set includes:
> >> >
> >> >- *Simpe API:* Unlike most low-level messaging system APIs, Samza
> >> >provides a very simple call-back based "process message" API that
> >> > should be
> >> >familiar to anyone that's used Map/Reduce.
> >> >- *Managed state:* Samza manages snapshotting and restoration of a
> >> >stream processor's state. Samza will restore a stream processor's
> >> state
> >> > to
> >> >a snapshot consistent with the processor's last read messages when
> the
> >> >processor is restarted.
> >> >- *Fault tolerance:* Samza will work with YARN to restart your
> stream
> >> >processor if there is a machine or processor failure.
> >> >- Durability: Samza uses Kafka to guarantee that no messages will
> ever
> >> >be lost.
> >> >- *Scalability:* Samza is partitioned and distributed at every
> level.
> >> >Kafka provides ordered, partitioned, replayable, fault-tolerant
> >> streams.
> >> >YARN provides a distributed environment for Samza containers to run
> >> in.
> >> >- *Pluggable:* Though Samza works out of the box with Kafka and
> YARN,
> >> >Samza provides a pluggable API that lets you run Samza with other
> >> > messaging
> >> >systems and execution environments.
> >> >- *Processor isolation:* Samza works with Apache YARN, which
> supports
> >> >processor security through Hadoop's security model, and resource
> >> > isolation
> >> >through Linux CGroups.
> >> >
> >> > Some of these feature are available in S4, and some are not. The same
> >> holds
> >> > true for Storm.
> >> >
> >> > The open source stream processing systems that are available are
> actually
> >> > quite young, and no single system offers a complete solution. Problems
> >> like
> >> > how a stream processor's state (e.g. counts) should be managed,
> whether a
> >> > stream should be buffered remotely on disk or not, what to do when
> >> > duplicate messages are received or messages are lost, and how to model
> >> > underlying messaging systems are all pretty new.
> >> >
> >> > Samza's main differentiators are:
> >> >
> >> >- State is modeled as a stream. When a processor fails and is
> >> restarted,
> >> >the state stream is entirely replayed to restore it.
> >> >- Streams are ordered, partitioned, replayable, and fault tolerant.
> >> >- YARN is used for processor isolation, security, and fault
> tolerance.
> >> >- All streams are materialized to Kafka.
> >> >
> >> > If you guys are interested, I have much more in-depth documents
> comparing
> >> > and contrasting Samza with MUPD8 and Storm.
> >> >
> >> > Cheers,
> >> > Chris
> >> >
> >> >
> >> > On Tue, Jul 23, 2013 at 6:48 PM, Henry Saputra <
> henry.sapu...@gmail.com
> >> > >wrote:
> >> >
> >> > > Looks like this is similar to S4 (http://incubator.apache.org/s4/)
> >> which
> >> > > allow stream and real time data processing via DAG?
> >> > >
> >> > >
> >> > > - Henry
> >> > >
> >> > >
> >> > > On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco <
> >> criccomini@gmail.com
> >> > > >wrote:
> >> > >
> >> > > > Hey All,
> >> > > >
> >> > > > Sending along an incubator proposal for Samza.
> >> > > >
> >> > > > Thanks!
> >> > > > Chris
> >> > > >
> >> > > > https://wiki.apache.org/incubator/SamzaProposal
> >> > > >
> >> > > > 
> >> > > >
> >> > > > == Abstract ==
> >> > > >
> >> > > > Samza is a stream processing system for running continuous
> >> computation
> >> > on
> >> > > > infinite streams of data.
> >> > > >
> >> > > > == Proposal ==
> >> > > >
> >> > > > Samza provides a system for processing stream data from
> >> > publish-subscribe
> >> > > > systems such as Apache Kafka. The developer writes a stream
> >> processing
> >> > > > task, and executes it as a Samza job. Samza then routes messages
> >> > between
> >> > > > stream processing tasks 

Re: [PROPOSAL] Samza Proposal

2013-07-31 Thread Phillip Rhodes
Same here.  Not that it matters as far as admission to the incubator
(that vote is over now anyway), but I think a lot of people (including
potential users of Samza) would like to see more about how it compares
& contrasts with other stream oriented systems.


Phil
This message optimized for indexing by NSA PRISM


On Fri, Jul 26, 2013 at 8:27 PM, Alex Karasulu  wrote:
> +1
>
> I would love to see the "documents comparing and contrasting Samza with
> MUPD8 and Storm."
>
>
> On Sat, Jul 27, 2013 at 2:53 AM, Enis Söztutar  wrote:
>
>> +1 on incubation.
>>
>> Enis
>>
>>
>> On Tue, Jul 23, 2013 at 7:17 PM, Chris Riccomini
>> wrote:
>>
>> > Hey Henry and Debo,
>> >
>> > Thanks for calling this out. Samza's feature set includes:
>> >
>> >- *Simpe API:* Unlike most low-level messaging system APIs, Samza
>> >provides a very simple call-back based "process message" API that
>> > should be
>> >familiar to anyone that's used Map/Reduce.
>> >- *Managed state:* Samza manages snapshotting and restoration of a
>> >stream processor's state. Samza will restore a stream processor's
>> state
>> > to
>> >a snapshot consistent with the processor's last read messages when the
>> >processor is restarted.
>> >- *Fault tolerance:* Samza will work with YARN to restart your stream
>> >processor if there is a machine or processor failure.
>> >- Durability: Samza uses Kafka to guarantee that no messages will ever
>> >be lost.
>> >- *Scalability:* Samza is partitioned and distributed at every level.
>> >Kafka provides ordered, partitioned, replayable, fault-tolerant
>> streams.
>> >YARN provides a distributed environment for Samza containers to run
>> in.
>> >- *Pluggable:* Though Samza works out of the box with Kafka and YARN,
>> >Samza provides a pluggable API that lets you run Samza with other
>> > messaging
>> >systems and execution environments.
>> >- *Processor isolation:* Samza works with Apache YARN, which supports
>> >processor security through Hadoop's security model, and resource
>> > isolation
>> >through Linux CGroups.
>> >
>> > Some of these feature are available in S4, and some are not. The same
>> holds
>> > true for Storm.
>> >
>> > The open source stream processing systems that are available are actually
>> > quite young, and no single system offers a complete solution. Problems
>> like
>> > how a stream processor's state (e.g. counts) should be managed, whether a
>> > stream should be buffered remotely on disk or not, what to do when
>> > duplicate messages are received or messages are lost, and how to model
>> > underlying messaging systems are all pretty new.
>> >
>> > Samza's main differentiators are:
>> >
>> >- State is modeled as a stream. When a processor fails and is
>> restarted,
>> >the state stream is entirely replayed to restore it.
>> >- Streams are ordered, partitioned, replayable, and fault tolerant.
>> >- YARN is used for processor isolation, security, and fault tolerance.
>> >- All streams are materialized to Kafka.
>> >
>> > If you guys are interested, I have much more in-depth documents comparing
>> > and contrasting Samza with MUPD8 and Storm.
>> >
>> > Cheers,
>> > Chris
>> >
>> >
>> > On Tue, Jul 23, 2013 at 6:48 PM, Henry Saputra > > >wrote:
>> >
>> > > Looks like this is similar to S4 (http://incubator.apache.org/s4/)
>> which
>> > > allow stream and real time data processing via DAG?
>> > >
>> > >
>> > > - Henry
>> > >
>> > >
>> > > On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco <
>> criccomini@gmail.com
>> > > >wrote:
>> > >
>> > > > Hey All,
>> > > >
>> > > > Sending along an incubator proposal for Samza.
>> > > >
>> > > > Thanks!
>> > > > Chris
>> > > >
>> > > > https://wiki.apache.org/incubator/SamzaProposal
>> > > >
>> > > > 
>> > > >
>> > > > == Abstract ==
>> > > >
>> > > > Samza is a stream processing system for running continuous
>> computation
>> > on
>> > > > infinite streams of data.
>> > > >
>> > > > == Proposal ==
>> > > >
>> > > > Samza provides a system for processing stream data from
>> > publish-subscribe
>> > > > systems such as Apache Kafka. The developer writes a stream
>> processing
>> > > > task, and executes it as a Samza job. Samza then routes messages
>> > between
>> > > > stream processing tasks and the publish-subscribe systems that the
>> > > messages
>> > > > are addressed to.
>> > > >
>> > > > == Background ==
>> > > >
>> > > > Samza was developed at LinkedIn to enable easier processing of
>> > streaming
>> > > > data on top of Apache Kafka. Current use cases include content
>> > processing
>> > > > pipelines, aggregating operational log data, data ingestion into
>> > > > distributed database infrastructure, and measuring user activity
>> across
>> > > > different aggregation types.
>> > > >
>> > > > Samza is focused on providing an easy to use framework to process
>> > > streams.
>> > > > It uses Apache YARN to provid

Re: [PROPOSAL] Samza Proposal

2013-07-26 Thread Alex Karasulu
+1

I would love to see the "documents comparing and contrasting Samza with
MUPD8 and Storm."


On Sat, Jul 27, 2013 at 2:53 AM, Enis Söztutar  wrote:

> +1 on incubation.
>
> Enis
>
>
> On Tue, Jul 23, 2013 at 7:17 PM, Chris Riccomini
> wrote:
>
> > Hey Henry and Debo,
> >
> > Thanks for calling this out. Samza's feature set includes:
> >
> >- *Simpe API:* Unlike most low-level messaging system APIs, Samza
> >provides a very simple call-back based "process message" API that
> > should be
> >familiar to anyone that's used Map/Reduce.
> >- *Managed state:* Samza manages snapshotting and restoration of a
> >stream processor's state. Samza will restore a stream processor's
> state
> > to
> >a snapshot consistent with the processor's last read messages when the
> >processor is restarted.
> >- *Fault tolerance:* Samza will work with YARN to restart your stream
> >processor if there is a machine or processor failure.
> >- Durability: Samza uses Kafka to guarantee that no messages will ever
> >be lost.
> >- *Scalability:* Samza is partitioned and distributed at every level.
> >Kafka provides ordered, partitioned, replayable, fault-tolerant
> streams.
> >YARN provides a distributed environment for Samza containers to run
> in.
> >- *Pluggable:* Though Samza works out of the box with Kafka and YARN,
> >Samza provides a pluggable API that lets you run Samza with other
> > messaging
> >systems and execution environments.
> >- *Processor isolation:* Samza works with Apache YARN, which supports
> >processor security through Hadoop's security model, and resource
> > isolation
> >through Linux CGroups.
> >
> > Some of these feature are available in S4, and some are not. The same
> holds
> > true for Storm.
> >
> > The open source stream processing systems that are available are actually
> > quite young, and no single system offers a complete solution. Problems
> like
> > how a stream processor's state (e.g. counts) should be managed, whether a
> > stream should be buffered remotely on disk or not, what to do when
> > duplicate messages are received or messages are lost, and how to model
> > underlying messaging systems are all pretty new.
> >
> > Samza's main differentiators are:
> >
> >- State is modeled as a stream. When a processor fails and is
> restarted,
> >the state stream is entirely replayed to restore it.
> >- Streams are ordered, partitioned, replayable, and fault tolerant.
> >- YARN is used for processor isolation, security, and fault tolerance.
> >- All streams are materialized to Kafka.
> >
> > If you guys are interested, I have much more in-depth documents comparing
> > and contrasting Samza with MUPD8 and Storm.
> >
> > Cheers,
> > Chris
> >
> >
> > On Tue, Jul 23, 2013 at 6:48 PM, Henry Saputra  > >wrote:
> >
> > > Looks like this is similar to S4 (http://incubator.apache.org/s4/)
> which
> > > allow stream and real time data processing via DAG?
> > >
> > >
> > > - Henry
> > >
> > >
> > > On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco <
> criccomini@gmail.com
> > > >wrote:
> > >
> > > > Hey All,
> > > >
> > > > Sending along an incubator proposal for Samza.
> > > >
> > > > Thanks!
> > > > Chris
> > > >
> > > > https://wiki.apache.org/incubator/SamzaProposal
> > > >
> > > > 
> > > >
> > > > == Abstract ==
> > > >
> > > > Samza is a stream processing system for running continuous
> computation
> > on
> > > > infinite streams of data.
> > > >
> > > > == Proposal ==
> > > >
> > > > Samza provides a system for processing stream data from
> > publish-subscribe
> > > > systems such as Apache Kafka. The developer writes a stream
> processing
> > > > task, and executes it as a Samza job. Samza then routes messages
> > between
> > > > stream processing tasks and the publish-subscribe systems that the
> > > messages
> > > > are addressed to.
> > > >
> > > > == Background ==
> > > >
> > > > Samza was developed at LinkedIn to enable easier processing of
> > streaming
> > > > data on top of Apache Kafka. Current use cases include content
> > processing
> > > > pipelines, aggregating operational log data, data ingestion into
> > > > distributed database infrastructure, and measuring user activity
> across
> > > > different aggregation types.
> > > >
> > > > Samza is focused on providing an easy to use framework to process
> > > streams.
> > > > It uses Apache YARN to provide a mechanism for deploying stream
> > > processing
> > > > tasks in a distributed cluster. Samza also takes advantage of YARN to
> > > make
> > > > decisions about stream processor locality, co-partition of streams,
> and
> > > > provide security. Apache Kafka is also leveraged to provide a
> mechanism
> > > to
> > > > pass messages from one stream processor to the next. Apache Kafka is
> > also
> > > > used to help manage a stream processor's state, so that it can be
> > > recovered
> > > > in the event 

Re: [PROPOSAL] Samza Proposal

2013-07-26 Thread Enis Söztutar
+1 on incubation.

Enis


On Tue, Jul 23, 2013 at 7:17 PM, Chris Riccomini
wrote:

> Hey Henry and Debo,
>
> Thanks for calling this out. Samza's feature set includes:
>
>- *Simpe API:* Unlike most low-level messaging system APIs, Samza
>provides a very simple call-back based "process message" API that
> should be
>familiar to anyone that's used Map/Reduce.
>- *Managed state:* Samza manages snapshotting and restoration of a
>stream processor's state. Samza will restore a stream processor's state
> to
>a snapshot consistent with the processor's last read messages when the
>processor is restarted.
>- *Fault tolerance:* Samza will work with YARN to restart your stream
>processor if there is a machine or processor failure.
>- Durability: Samza uses Kafka to guarantee that no messages will ever
>be lost.
>- *Scalability:* Samza is partitioned and distributed at every level.
>Kafka provides ordered, partitioned, replayable, fault-tolerant streams.
>YARN provides a distributed environment for Samza containers to run in.
>- *Pluggable:* Though Samza works out of the box with Kafka and YARN,
>Samza provides a pluggable API that lets you run Samza with other
> messaging
>systems and execution environments.
>- *Processor isolation:* Samza works with Apache YARN, which supports
>processor security through Hadoop's security model, and resource
> isolation
>through Linux CGroups.
>
> Some of these feature are available in S4, and some are not. The same holds
> true for Storm.
>
> The open source stream processing systems that are available are actually
> quite young, and no single system offers a complete solution. Problems like
> how a stream processor's state (e.g. counts) should be managed, whether a
> stream should be buffered remotely on disk or not, what to do when
> duplicate messages are received or messages are lost, and how to model
> underlying messaging systems are all pretty new.
>
> Samza's main differentiators are:
>
>- State is modeled as a stream. When a processor fails and is restarted,
>the state stream is entirely replayed to restore it.
>- Streams are ordered, partitioned, replayable, and fault tolerant.
>- YARN is used for processor isolation, security, and fault tolerance.
>- All streams are materialized to Kafka.
>
> If you guys are interested, I have much more in-depth documents comparing
> and contrasting Samza with MUPD8 and Storm.
>
> Cheers,
> Chris
>
>
> On Tue, Jul 23, 2013 at 6:48 PM, Henry Saputra  >wrote:
>
> > Looks like this is similar to S4 (http://incubator.apache.org/s4/) which
> > allow stream and real time data processing via DAG?
> >
> >
> > - Henry
> >
> >
> > On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco  > >wrote:
> >
> > > Hey All,
> > >
> > > Sending along an incubator proposal for Samza.
> > >
> > > Thanks!
> > > Chris
> > >
> > > https://wiki.apache.org/incubator/SamzaProposal
> > >
> > > 
> > >
> > > == Abstract ==
> > >
> > > Samza is a stream processing system for running continuous computation
> on
> > > infinite streams of data.
> > >
> > > == Proposal ==
> > >
> > > Samza provides a system for processing stream data from
> publish-subscribe
> > > systems such as Apache Kafka. The developer writes a stream processing
> > > task, and executes it as a Samza job. Samza then routes messages
> between
> > > stream processing tasks and the publish-subscribe systems that the
> > messages
> > > are addressed to.
> > >
> > > == Background ==
> > >
> > > Samza was developed at LinkedIn to enable easier processing of
> streaming
> > > data on top of Apache Kafka. Current use cases include content
> processing
> > > pipelines, aggregating operational log data, data ingestion into
> > > distributed database infrastructure, and measuring user activity across
> > > different aggregation types.
> > >
> > > Samza is focused on providing an easy to use framework to process
> > streams.
> > > It uses Apache YARN to provide a mechanism for deploying stream
> > processing
> > > tasks in a distributed cluster. Samza also takes advantage of YARN to
> > make
> > > decisions about stream processor locality, co-partition of streams, and
> > > provide security. Apache Kafka is also leveraged to provide a mechanism
> > to
> > > pass messages from one stream processor to the next. Apache Kafka is
> also
> > > used to help manage a stream processor's state, so that it can be
> > recovered
> > > in the event of a failure.
> > >
> > > Samza is written in Scala. It was developed internally at LinkedIn to
> > meet
> > > our particular use cases, but will be useful to many organizations
> > facing a
> > > similar need to reliably process large amounts of streaming data.
> > > Therefore, we would like to share it the ASF and begin developing a
> > > community of developers and users within Apache.
> > >
> > > == Rationale ==
> > >
> > > Many organizations can b

Re: [PROPOSAL] Samza Proposal

2013-07-26 Thread Chris Riccomini
Hey Marvin,

I think we pretty much agree with everything you've said. :)

We're definitely sensitive to the discuss-in-person issue. We'll pay
attention to it, and try to move conversation to the list when we see it
happening. The same holds true for the risk assessment comment. We're
sensitive to it. We'll try our best.

I talked with Jakob, and we'll hold off on the -user mailing list initially.

Thanks for the feedback!

Cheers,
Chris


On Thu, Jul 25, 2013 at 8:17 AM, Marvin Humphrey wrote:

> Hi,
>
> The current core developers are all from LinkedIn. However, we hope to
> establish a developer community that includes contributors from several
> corporations and we actively encouraging new contributors via the
> mailing
> lists and public presentations of Samza.
>
> Collective experience in the Incubator suggests that keeping discussions
> on-list will be an important challenge for the prospective Samza podling.
> It's much more efficient to just discuss things in person when everyone
> works
> in the same office -- especially difficult architectural issues requiring a
> high-level perspective on the complete project.  However, given the history
> both the individual contributors and LinkedIn itself, at least the podling
> ought to start off with a strong understanding of the challenge.
>
> The core developers plan to work full time on the project. There is
> very
> little risk of Samza being abandoned as it is part of LinkedIn's
> internal
> infrastructure.
>
> This risk assessment seems a tad optimistic to me ;) since I only trust
> commercial entities to do what's in the interest of the bottom line and
> that
> can change very rapidly.  (Witness IBM pulling all of its developers off of
> Harmony.)  But I certainly don't see any problem that ought to block entry
> into the Incubator.
>
> * samza-private for private PMC discussions (with moderated
> subscriptions)
> * samza-dev
> * samza-commits
> * samza-user
>
> I'd suggest foregoing the user list for now.  (See
> .)  You want highly
> engaged users who are likely to become developers.  It's best to keep
> everyone
> on the same list until dev traffic levels become burdensome to users; you
> don't have that problem yet.
>
> Nominated Mentors
>
> * Arun C Murthy 
> * Chris Douglas 
> * Roman Shaposhnik 
>
> The core developers may all work for LinkedIn, but that's a good, diverse
> list
> of Mentors in terms of affiliation.  (Hortonworks, Microsoft, and Cloudera,
> repectively.)
>
> Marvin Humphrey
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>


Re: [PROPOSAL] Samza Proposal

2013-07-25 Thread Marvin Humphrey
Hi,

The current core developers are all from LinkedIn. However, we hope to
establish a developer community that includes contributors from several
corporations and we actively encouraging new contributors via the mailing
lists and public presentations of Samza.

Collective experience in the Incubator suggests that keeping discussions
on-list will be an important challenge for the prospective Samza podling.
It's much more efficient to just discuss things in person when everyone works
in the same office -- especially difficult architectural issues requiring a
high-level perspective on the complete project.  However, given the history
both the individual contributors and LinkedIn itself, at least the podling
ought to start off with a strong understanding of the challenge.

The core developers plan to work full time on the project. There is very
little risk of Samza being abandoned as it is part of LinkedIn's internal
infrastructure.

This risk assessment seems a tad optimistic to me ;) since I only trust
commercial entities to do what's in the interest of the bottom line and that
can change very rapidly.  (Witness IBM pulling all of its developers off of
Harmony.)  But I certainly don't see any problem that ought to block entry
into the Incubator.

* samza-private for private PMC discussions (with moderated subscriptions)
* samza-dev
* samza-commits
* samza-user

I'd suggest foregoing the user list for now.  (See
.)  You want highly
engaged users who are likely to become developers.  It's best to keep everyone
on the same list until dev traffic levels become burdensome to users; you
don't have that problem yet.

Nominated Mentors

* Arun C Murthy 
* Chris Douglas 
* Roman Shaposhnik 

The core developers may all work for LinkedIn, but that's a good, diverse list
of Mentors in terms of affiliation.  (Hortonworks, Microsoft, and Cloudera,
repectively.)

Marvin Humphrey

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Samza Proposal

2013-07-24 Thread Lars Francke
Hi,

interesting proposal!

> If you guys are interested, I have much more in-depth documents comparing
> and contrasting Samza with MUPD8 and Storm.

I for one would be very interested in those documents.

Cheers,
Lars

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Samza Proposal

2013-07-23 Thread Chris Riccomini
Hey Henry and Debo,

Thanks for calling this out. Samza's feature set includes:

   - *Simpe API:* Unlike most low-level messaging system APIs, Samza
   provides a very simple call-back based "process message" API that should be
   familiar to anyone that's used Map/Reduce.
   - *Managed state:* Samza manages snapshotting and restoration of a
   stream processor's state. Samza will restore a stream processor's state to
   a snapshot consistent with the processor's last read messages when the
   processor is restarted.
   - *Fault tolerance:* Samza will work with YARN to restart your stream
   processor if there is a machine or processor failure.
   - Durability: Samza uses Kafka to guarantee that no messages will ever
   be lost.
   - *Scalability:* Samza is partitioned and distributed at every level.
   Kafka provides ordered, partitioned, replayable, fault-tolerant streams.
   YARN provides a distributed environment for Samza containers to run in.
   - *Pluggable:* Though Samza works out of the box with Kafka and YARN,
   Samza provides a pluggable API that lets you run Samza with other messaging
   systems and execution environments.
   - *Processor isolation:* Samza works with Apache YARN, which supports
   processor security through Hadoop's security model, and resource isolation
   through Linux CGroups.

Some of these feature are available in S4, and some are not. The same holds
true for Storm.

The open source stream processing systems that are available are actually
quite young, and no single system offers a complete solution. Problems like
how a stream processor's state (e.g. counts) should be managed, whether a
stream should be buffered remotely on disk or not, what to do when
duplicate messages are received or messages are lost, and how to model
underlying messaging systems are all pretty new.

Samza's main differentiators are:

   - State is modeled as a stream. When a processor fails and is restarted,
   the state stream is entirely replayed to restore it.
   - Streams are ordered, partitioned, replayable, and fault tolerant.
   - YARN is used for processor isolation, security, and fault tolerance.
   - All streams are materialized to Kafka.

If you guys are interested, I have much more in-depth documents comparing
and contrasting Samza with MUPD8 and Storm.

Cheers,
Chris


On Tue, Jul 23, 2013 at 6:48 PM, Henry Saputra wrote:

> Looks like this is similar to S4 (http://incubator.apache.org/s4/) which
> allow stream and real time data processing via DAG?
>
>
> - Henry
>
>
> On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco  >wrote:
>
> > Hey All,
> >
> > Sending along an incubator proposal for Samza.
> >
> > Thanks!
> > Chris
> >
> > https://wiki.apache.org/incubator/SamzaProposal
> >
> > 
> >
> > == Abstract ==
> >
> > Samza is a stream processing system for running continuous computation on
> > infinite streams of data.
> >
> > == Proposal ==
> >
> > Samza provides a system for processing stream data from publish-subscribe
> > systems such as Apache Kafka. The developer writes a stream processing
> > task, and executes it as a Samza job. Samza then routes messages between
> > stream processing tasks and the publish-subscribe systems that the
> messages
> > are addressed to.
> >
> > == Background ==
> >
> > Samza was developed at LinkedIn to enable easier processing of streaming
> > data on top of Apache Kafka. Current use cases include content processing
> > pipelines, aggregating operational log data, data ingestion into
> > distributed database infrastructure, and measuring user activity across
> > different aggregation types.
> >
> > Samza is focused on providing an easy to use framework to process
> streams.
> > It uses Apache YARN to provide a mechanism for deploying stream
> processing
> > tasks in a distributed cluster. Samza also takes advantage of YARN to
> make
> > decisions about stream processor locality, co-partition of streams, and
> > provide security. Apache Kafka is also leveraged to provide a mechanism
> to
> > pass messages from one stream processor to the next. Apache Kafka is also
> > used to help manage a stream processor's state, so that it can be
> recovered
> > in the event of a failure.
> >
> > Samza is written in Scala. It was developed internally at LinkedIn to
> meet
> > our particular use cases, but will be useful to many organizations
> facing a
> > similar need to reliably process large amounts of streaming data.
> > Therefore, we would like to share it the ASF and begin developing a
> > community of developers and users within Apache.
> >
> > == Rationale ==
> >
> > Many organizations can benefit from a reliable stream processing system
> > such as Samza. While our use case of processing events from a large
> website
> > like LinkedIn has driven the design of Samza, its uses are varied and we
> > expect many new use cases to emerge. Samza provides a generic API to
> > process messages from streaming infrastructure and

Re: [PROPOSAL] Samza Proposal

2013-07-23 Thread Debo Dutta (dedutta)
Also add storm to the mix. Storm also allows you to do back edges.

debo

On 7/23/13 6:48 PM, "Henry Saputra"  wrote:

>Looks like this is similar to S4 (http://incubator.apache.org/s4/) which
>allow stream and real time data processing via DAG?
>
>
>- Henry
>
>
>On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco
>wrote:
>
>> Hey All,
>>
>> Sending along an incubator proposal for Samza.
>>
>> Thanks!
>> Chris
>>
>> https://wiki.apache.org/incubator/SamzaProposal
>>
>> 
>>
>> == Abstract ==
>>
>> Samza is a stream processing system for running continuous computation
>>on
>> infinite streams of data.
>>
>> == Proposal ==
>>
>> Samza provides a system for processing stream data from
>>publish-subscribe
>> systems such as Apache Kafka. The developer writes a stream processing
>> task, and executes it as a Samza job. Samza then routes messages between
>> stream processing tasks and the publish-subscribe systems that the
>>messages
>> are addressed to.
>>
>> == Background ==
>>
>> Samza was developed at LinkedIn to enable easier processing of streaming
>> data on top of Apache Kafka. Current use cases include content
>>processing
>> pipelines, aggregating operational log data, data ingestion into
>> distributed database infrastructure, and measuring user activity across
>> different aggregation types.
>>
>> Samza is focused on providing an easy to use framework to process
>>streams.
>> It uses Apache YARN to provide a mechanism for deploying stream
>>processing
>> tasks in a distributed cluster. Samza also takes advantage of YARN to
>>make
>> decisions about stream processor locality, co-partition of streams, and
>> provide security. Apache Kafka is also leveraged to provide a mechanism
>>to
>> pass messages from one stream processor to the next. Apache Kafka is
>>also
>> used to help manage a stream processor's state, so that it can be
>>recovered
>> in the event of a failure.
>>
>> Samza is written in Scala. It was developed internally at LinkedIn to
>>meet
>> our particular use cases, but will be useful to many organizations
>>facing a
>> similar need to reliably process large amounts of streaming data.
>> Therefore, we would like to share it the ASF and begin developing a
>> community of developers and users within Apache.
>>
>> == Rationale ==
>>
>> Many organizations can benefit from a reliable stream processing system
>> such as Samza. While our use case of processing events from a large
>>website
>> like LinkedIn has driven the design of Samza, its uses are varied and we
>> expect many new use cases to emerge. Samza provides a generic API to
>> process messages from streaming infrastructure and will appeal to many
>> users.
>>
>> == Current Status ==
>>
>> === Meritocracy ===
>>
>> Our intent with this incubator proposal is to start building a diverse
>> developer community around Samza following the Apache meritocracy model.
>> Since Samza was initially developed in late 2011, we have had fast
>>adoption
>> and contributions by multiple teams at LinkedIn. We plan to continue
>> support for new contributors and work with those who contribute
>> significantly to the project to make them committers.
>>
>> === Community ===
>>
>> Samza is currently being used internally at LinkedIn. We hope to extend
>>our
>> contributor base significantly and invite all those who are interested
>>in
>> building large-scale distributed systems to participate.
>>
>> === Core Developers ===
>>
>> Samza is currently being developed by four engineers at LinkedIn: Jay
>> Kreps, Jakob Homan, Sriram Subramanian, and Chris Riccomini. Jakob is an
>> ASF Member, Incubator PMC member and PMC member on Apache Hadoop, Kafka
>>and
>> Giraph. Jay is a member of the Apache Kafka PMC and contributor to
>>various
>> Apache projects. Chris has been an active contributor for several
>>projects
>> including Apache Kafka and Apache YARN. Sriram has contributed to
>>Samza, as
>> well as Apache Kafka.
>>
>> === Alignment ===
>>
>> The ASF is the natural choice to host the Samza project as its goal of
>> encouraging community-driven open-source projects fits with our vision
>>for
>> Samza. Additionally, many other projects with which we are familiar with
>> and expect Samza to integrate with, such as Apache ZooKeeper, YARN, HDFS
>> and log4j are hosted by the ASF and we will benefit and provide benefit
>>by
>> close proximity to them.
>>
>> == Known Risks ==
>>
>> === Orphaned Products ===
>>
>> The core developers plan to work full time on the project. There is very
>> little risk of Samza being abandoned as it is part of LinkedIn's
>>internal
>> infrastructure.
>>
>> === Inexperience with Open Source ===
>>
>> All of the core developers have experience with open source development.
>> Jay and Chris has been involved with several open source projects
>>released
>> by LinkedIn, and Jay is a committer on Apache Kafka. Jakob has been
>> actively involved with the ASF as a full-time Hadoop committer and PMC
>> me

Re: [PROPOSAL] Samza Proposal

2013-07-23 Thread Henry Saputra
Looks like this is similar to S4 (http://incubator.apache.org/s4/) which
allow stream and real time data processing via DAG?


- Henry


On Tue, Jul 23, 2013 at 10:47 AM, Chris Ricco wrote:

> Hey All,
>
> Sending along an incubator proposal for Samza.
>
> Thanks!
> Chris
>
> https://wiki.apache.org/incubator/SamzaProposal
>
> 
>
> == Abstract ==
>
> Samza is a stream processing system for running continuous computation on
> infinite streams of data.
>
> == Proposal ==
>
> Samza provides a system for processing stream data from publish-subscribe
> systems such as Apache Kafka. The developer writes a stream processing
> task, and executes it as a Samza job. Samza then routes messages between
> stream processing tasks and the publish-subscribe systems that the messages
> are addressed to.
>
> == Background ==
>
> Samza was developed at LinkedIn to enable easier processing of streaming
> data on top of Apache Kafka. Current use cases include content processing
> pipelines, aggregating operational log data, data ingestion into
> distributed database infrastructure, and measuring user activity across
> different aggregation types.
>
> Samza is focused on providing an easy to use framework to process streams.
> It uses Apache YARN to provide a mechanism for deploying stream processing
> tasks in a distributed cluster. Samza also takes advantage of YARN to make
> decisions about stream processor locality, co-partition of streams, and
> provide security. Apache Kafka is also leveraged to provide a mechanism to
> pass messages from one stream processor to the next. Apache Kafka is also
> used to help manage a stream processor's state, so that it can be recovered
> in the event of a failure.
>
> Samza is written in Scala. It was developed internally at LinkedIn to meet
> our particular use cases, but will be useful to many organizations facing a
> similar need to reliably process large amounts of streaming data.
> Therefore, we would like to share it the ASF and begin developing a
> community of developers and users within Apache.
>
> == Rationale ==
>
> Many organizations can benefit from a reliable stream processing system
> such as Samza. While our use case of processing events from a large website
> like LinkedIn has driven the design of Samza, its uses are varied and we
> expect many new use cases to emerge. Samza provides a generic API to
> process messages from streaming infrastructure and will appeal to many
> users.
>
> == Current Status ==
>
> === Meritocracy ===
>
> Our intent with this incubator proposal is to start building a diverse
> developer community around Samza following the Apache meritocracy model.
> Since Samza was initially developed in late 2011, we have had fast adoption
> and contributions by multiple teams at LinkedIn. We plan to continue
> support for new contributors and work with those who contribute
> significantly to the project to make them committers.
>
> === Community ===
>
> Samza is currently being used internally at LinkedIn. We hope to extend our
> contributor base significantly and invite all those who are interested in
> building large-scale distributed systems to participate.
>
> === Core Developers ===
>
> Samza is currently being developed by four engineers at LinkedIn: Jay
> Kreps, Jakob Homan, Sriram Subramanian, and Chris Riccomini. Jakob is an
> ASF Member, Incubator PMC member and PMC member on Apache Hadoop, Kafka and
> Giraph. Jay is a member of the Apache Kafka PMC and contributor to various
> Apache projects. Chris has been an active contributor for several projects
> including Apache Kafka and Apache YARN. Sriram has contributed to Samza, as
> well as Apache Kafka.
>
> === Alignment ===
>
> The ASF is the natural choice to host the Samza project as its goal of
> encouraging community-driven open-source projects fits with our vision for
> Samza. Additionally, many other projects with which we are familiar with
> and expect Samza to integrate with, such as Apache ZooKeeper, YARN, HDFS
> and log4j are hosted by the ASF and we will benefit and provide benefit by
> close proximity to them.
>
> == Known Risks ==
>
> === Orphaned Products ===
>
> The core developers plan to work full time on the project. There is very
> little risk of Samza being abandoned as it is part of LinkedIn's internal
> infrastructure.
>
> === Inexperience with Open Source ===
>
> All of the core developers have experience with open source development.
> Jay and Chris has been involved with several open source projects released
> by LinkedIn, and Jay is a committer on Apache Kafka. Jakob has been
> actively involved with the ASF as a full-time Hadoop committer and PMC
> member. Sriram is a contributor to Apache Kafka.
>
> === Homogeneous Developers ===
>
> The current core developers are all from LinkedIn. However, we hope to
> establish a developer community that includes contributors from several
> corporations and we actively encouraging new contri