subject:"RE\: \[DISCUSS\] Apache Dataflow Incubator Proposal"

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-30 Thread Ted Dunning

Same reason you hear "can't trademark certain parts of speech".  Half informed 
zealots go off half cocked. 

For instance, here is apparently authoritative advice that a trademark can only 
be an adjective: 

http://www.ramseylawgroup.com/viewarticle.php?id=21

And here is a real world analysis:

http://itre.cis.upenn.edu/myl/languagelog/archives/000943.html

The fact is, I bought a BMW, not a BMW car.  It is still trademark in spite of 
my linguistic faux pas. 

Sent from my iPhone

> On Jan 28, 2016, at 18:47, Greg Stein  wrote:
> 
> Hrm. Given that, I'm confused why I keep hearing "oh, natural word, can't
> be trademarked."

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-28 Thread Louis Suárez-Potts

> On 28 Jan 16, at 21:47, Greg Stein  wrote:
> 
> On Thu, Jan 28, 2016 at 6:29 PM, Doug Cutting  wrote:
> 
>> On Thu, Jan 28, 2016 at 3:11 PM, Greg Stein  wrote:
>>> As a regular english word, "beam" cannot be trademarked, by others/us.
>> 
>> Like Windows® or Apple®?
> 
> 
> oh, snap. True.
> 
> Hrm. Given that, I'm confused why I keep hearing "oh, natural word, can't
> be trademarked."
> 
> Thx,
> -g

"Snap" is not yet trademarked, so you get off "free."

But do look at this: Very clear, clean, concise, comprehensible.

http://www.inta.org/TrademarkBasics/FactSheets/Pages/TrademarksvsGenericTermsFactSheet.aspx

Relevant quotes:

• Fact Sheets Home
Trademarks vs. Generic Terms

Updated, June 2015

1. What is meant by “generic term”?

Generic terms are common words or terms, often found in the dictionary, that 
identify products and services and are not specific to any particular source. 
It is not possible to register as a trademark a term that is generic for the 
goods and/or services identified in the application. If a trademark becomes 
generic, often as a result of improper use, rights in the mark may no longer be 
enforceable.

2. Are generic terms considered a category of trademarks?

In assessing their suitability as trademarks, words can be divided into five 
categories. These categories range from fanciful, invented words, which 
typically are strong trademarks, to generic terms, which are not protectable at 
all. The stronger the mark, the more protection it will be given against other 
marks.

The categories, ranked in decreasing order in terms of strength, are:

a. Fanciful Marks—coined (made-up) words that have no relation to the goods 
being described (e.g., EXXON for petroleum products).

***b. Arbitrary Marks—existing words that contribute no meaning to the goods 
being described (e.g., APPLE for computers).***

c. Suggestive Marks—words that suggest meaning or relation but that do not 
describe the goods themselves (e.g., COPPERTONE for suntan lotion).

d. Descriptive Marks—marks that describe either the goods or a characteristic 
of the goods. Often it is very difficult to enforce trademark rights in a 
descriptive mark unless the mark has acquired a secondary meaning (e.g., 
SHOELAND for a shoe store).

e. Generic Terms—words that are the accepted and recognized description of a 
class of goods or services (e.g., computer software, facial tissue).

Interestingly, and perhaps not surprising to some of us, Windows™ is far more 
plausibly a suitable descriptor of a software user interface that steps out of 
the confines of the command line (which I prefer, as it happens) than Apple is 
of an electronic calculating engine made of dirty silicon and shiny metal, 
colourful plastic, and exotic minerals. Unless, that is, one thinks of what 
apple means in relation to tempting knowledge that goes beyond good and evil.

Regarding "Beam." I think the items Inta.org offers give some guidance?

louis

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-28 Thread Greg Stein

On Thu, Jan 28, 2016 at 6:29 PM, Doug Cutting  wrote:

> On Thu, Jan 28, 2016 at 3:11 PM, Greg Stein  wrote:
> > As a regular english word, "beam" cannot be trademarked, by others/us.
>
> Like Windows® or Apple®?

oh, snap. True.

Hrm. Given that, I'm confused why I keep hearing "oh, natural word, can't
be trademarked."

Thx,
-g

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-28 Thread Doug Cutting

On Thu, Jan 28, 2016 at 3:11 PM, Greg Stein  wrote:
> As a regular english word, "beam" cannot be trademarked, by others/us.

Like Windows® or Apple®?

Doug

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-28 Thread Greg Stein

As a regular english word, "beam" cannot be trademarked, by others/us. Yet
the *pair* of words, "Apache Beam" can be implicitly/explicitly trademarked.

On Thu, Jan 28, 2016 at 11:51 AM, Alex Harui  wrote:

>
>
> On 1/28/16, 3:26 AM, "Jean-Baptiste Onofré"  wrote:
>
> >I prefer Beam ;)
>
> I like the name and the logic behind choosing it.  Some concerns are that
> a Google search of "Beam Software" turned up [1] and [2] among others,
> which might mean that Apache Beam won't work as a TLP name.  "Beam" is a
> good stem word so maybe you can add something to the front: "FlowBeam",
> "DataBeam", etc.
>
> -Alex
>
> [1] http://www.beamsoftware.com
> [2] https://earth.esa.int/web/sentinel/-/beam
>
>
>

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-28 Thread Alex Harui



On 1/28/16, 3:26 AM, "Jean-Baptiste Onofré"  wrote:

>I prefer Beam ;)

I like the name and the logic behind choosing it.  Some concerns are that
a Google search of "Beam Software" turned up [1] and [2] among others,
which might mean that Apache Beam won't work as a TLP name.  "Beam" is a
good stem word so maybe you can add something to the front: "FlowBeam",
"DataBeam", etc.

-Alex

[1] http://www.beamsoftware.com
[2] https://earth.esa.int/web/sentinel/-/beam

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-28 Thread Jean-Baptiste Onofré


LOL ;)

Regards
JB

On 01/28/2016 01:00 PM, Hadrian Zbarcea wrote:

Hi Bertrand,

Your suggested name only has an entropy of 0.19153. You may consider
adding a special character or two :). To standardize (every company
seems to like this lately) we may consider using SHA-256 for our project
names. Being of fixed length it will help Sally in her templates for
press releases.

Sorry, couldn't resit :)
Hadrian


On 01/28/2016 06:25 AM, Bertrand Delacretaz wrote:

On Thu, Jan 28, 2016 at 11:41 AM, Serge Huber  wrote:

...Please ignore my last message, I missed the fact that a project
was already existing
with the name “Arrow”...


Hehe, that's the risk when using common names, with about 200 projects
here. Naming your project sdkjhkjhsdfxyhs is safer in this respect but
has other disadvantages.

-Bertrand

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-28 Thread Hadrian Zbarcea


Hi Bertrand,

Your suggested name only has an entropy of 0.19153. You may consider 
adding a special character or two :). To standardize (every company 
seems to like this lately) we may consider using SHA-256 for our project 
names. Being of fixed length it will help Sally in her templates for 
press releases.


Sorry, couldn't resit :)
Hadrian


On 01/28/2016 06:25 AM, Bertrand Delacretaz wrote:

On Thu, Jan 28, 2016 at 11:41 AM, Serge Huber  wrote:

...Please ignore my last message, I missed the fact that a project was already 
existing
with the name “Arrow”...


Hehe, that's the risk when using common names, with about 200 projects
here. Naming your project sdkjhkjhsdfxyhs is safer in this respect but
has other disadvantages.

-Bertrand

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-28 Thread Jean-Baptiste Onofré


I prefer Beam ;)

Regards
JB

On 01/28/2016 12:25 PM, Bertrand Delacretaz wrote:

On Thu, Jan 28, 2016 at 11:41 AM, Serge Huber  wrote:

...Please ignore my last message, I missed the fact that a project was already 
existing
with the name “Arrow”...


Hehe, that's the risk when using common names, with about 200 projects
here. Naming your project sdkjhkjhsdfxyhs is safer in this respect but
has other disadvantages.

-Bertrand

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-28 Thread Bertrand Delacretaz

On Thu, Jan 28, 2016 at 11:41 AM, Serge Huber  wrote:
> ...Please ignore my last message, I missed the fact that a project was 
> already existing
> with the name “Arrow”...

Hehe, that's the risk when using common names, with about 200 projects
here. Naming your project sdkjhkjhsdfxyhs is safer in this respect but
has other disadvantages.

-Bertrand

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-28 Thread Serge Huber

Please ignore my last message, I missed the fact that a project was already 
existing with the name “Arrow”.

cheers,
  Serge… 


> On 28 janv. 2016, at 11:01, Bertrand Delacretaz  
> wrote:
> 
> On Wed, Jan 27, 2016 at 6:22 PM, James Malone
>  wrote:
>> ...To that end, the name we propose to use is:
>> 
>> Apache Beam
> 
> The name sounds good to me and it's indeed a good idea to set it now.
> 
> -Bertrand
> 
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
> 


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-28 Thread Serge Huber

On a lighter side… 

Just to mess with you guys if you’re looking at an alternative name I might 
suggest : 

Apache Arrow

I thought of this because I was thinking of how an arrow flows through the air, 
might be deviated, and it also associates with something sharp, dangerous and 
fast :) And of course an Apache would use an arrow more than a beam :)

But maybe Beam is more appropriate, but I just thought I’d put this out there.

cheers,
  Serge… 

> On 28 janv. 2016, at 11:01, Bertrand Delacretaz  
> wrote:
> 
> On Wed, Jan 27, 2016 at 6:22 PM, James Malone
>  wrote:
>> ...To that end, the name we propose to use is:
>> 
>> Apache Beam
> 
> The name sounds good to me and it's indeed a good idea to set it now.
> 
> -Bertrand
> 
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
> 


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-28 Thread Bertrand Delacretaz

On Wed, Jan 27, 2016 at 6:22 PM, James Malone
 wrote:
> ...To that end, the name we propose to use is:
>
> Apache Beam

The name sounds good to me and it's indeed a good idea to set it now.

-Bertrand

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-27 Thread James Malone

Hi everyone,

Based on the feedback concerning naming, we would like to rename the
proposal and the project. We want to do this early to ensure we don't
disrupt the project based on naming. To that end, the name we propose to
use is:

Apache Beam

The name Beam is based on a joining of Batch and strEAM to showcase the
unified nature of the model and tools. We also wanted to select a name
which is simple, memorable, and also not already in use.

Best,

James

On Fri, Jan 22, 2016 at 2:07 AM, Jean-Baptiste Onofré 
wrote:

> It makes perfect sense, and it's something that we already discussed.
>
> Thanks James and Marvin.
>
> @James, yes, we are going to deal with that together, not a problem at
> all. I agree that renaming should happen now.
> As discussed, we should be back with a new name early next week.
>
> I'm happy to see the discussion now (and thanks again Marvin for details
> and always helpful messages): it's exactly the purpose of sending the
> discussion thread on the incubator mailing list.
>
> Thanks guys !
>
> Regards
> JB
>
>
> On 01/22/2016 02:19 AM, James Malone wrote:
>
>> Thank you for such a detailed response Marvin!
>>
>> Everything you mention makes a lot of sense. Needless to say, we don't
>> want
>> to squander cycles, break any rules, or throw velocity into disarray all
>> due to a name.
>>
>> To that end, I am going to work with JB to amend the proposal with respect
>> to renaming. I'm also going to clarify a name change would be an
>> immediate-term to-do item so it does not block creation creation of lists,
>> repositories, and so on.
>>
>> Best,
>>
>> James
>>
>> I am going to work with JB to amend the proposal to indicate
>>
>> On Thu, Jan 21, 2016 at 9:30 AM, Marvin Humphrey 
>> wrote:
>>
>> On Wed, Jan 20, 2016 at 3:30 PM, James Malone
>>>  wrote:
>>>
 If we need to rename, we would ideally choose a new name, change the
 project name at that time, and start our refactoring with that new name.

>>> Is
>>>
 is acceptable for us to flag a name change as something we need to do as

>>> a
>>>
 near-term (1st month) item in incubation (if accepted)? If a rename is
 required I'd like to add it to our to-do roadmap but also not block our
 proposal on a renaming. I ask so we can address this concern in the best
 way possible.

>>>
>>> That's acceptable.  Project naming issues do not block entry into the
>>> Incubator, they block graduation from the Incubator.
>>>
>>> Because "dataflow" is descriptive, it will be hard to defend as
>>> a trademark.  The Wikipedia article on trademark distinctiveness explains
>>> things well:
>>>
>>>  https://en.wikipedia.org/wiki/Trademark_distinctiveness
>>>
>>> A weak mark both increases the amount of volunteer effort that goes
>>> into dealing with infringement cases and makes bad outcomes more likely.
>>> It is not an absolute requirement that Apache projects have defensible
>>> names,
>>> but painful past experience has taught us that mishandled branding can
>>> deal
>>> surprising amounts of damage to a project community.
>>>
>>> But beyond that, the issue of "Google Cloud Dataflow" vs. "Apache
>>> Dataflow" is
>>> a blocker.  One or the other will have to be renamed, and since the
>>> software
>>> is being donated but apparently not the brand, it sounds like renaming
>>> the
>>> prospective Apache project will be required and you should add that task
>>> to
>>> your roadmap.
>>>
>>> Changing names in the middle of incubation is disruptive because it
>>> requires
>>> renaming infrastructure resources, impacting both the Apache
>>> Infrastructure
>>> team and also the podling's developer and user communities.  My
>>> suggestion
>>> would be that immediately after the VOTE to enter incubation concludes,
>>> you
>>> only create a dev mailing list and deal with the renaming immediately,
>>> delaying the creation of other resources until after the renaming is
>>> resolved.
>>> However, the exact plan is something you can work out with your Mentors.
>>>
>>> Marvin Humphrey
>>>
>>> -
>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>>> For additional commands, e-mail: general-h...@incubator.apache.org
>>>
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-26 Thread Jean-Baptiste Onofré


Hi Renaud and Bertrand,

No worries Bertrand ! And thanks Renaud ;)

Regards
JB

On 01/26/2016 04:19 PM, Bertrand Delacretaz wrote:

Bonjour Renaud,

On Tue, Jan 26, 2016 at 4:04 PM, Renaud Richardet  wrote:

...Please add me to “Additional Interested Contributors” section as well


I've done this, happily! (JB I hope you don't mind).

(I know Renaud for quite some time and I think he can make great
contributions to Dataflow).

-Bertrand

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-26 Thread Bertrand Delacretaz

Bonjour Renaud,

On Tue, Jan 26, 2016 at 4:04 PM, Renaud Richardet  wrote:
> ...Please add me to “Additional Interested Contributors” section as well

I've done this, happily! (JB I hope you don't mind).

(I know Renaud for quite some time and I think he can make great
contributions to Dataflow).

-Bertrand

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-26 Thread Renaud Richardet

Bonjour,

Please add me to “Additional Interested Contributors” section as well.

I am an Apache UIMA committer, and would like to use Dataflow to process
large amounts of text [1]. I just started a POC [2] and really like the API
so far.

Thanks, Renaud


[1] https://github.com/BlueBrain/bluima
[2] https://github.com/renaud/textdataflow

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-24 Thread Jean-Baptiste Onofré


Hey Ajay,

great: I added you on the proposal.

Thanks !
Regards
JB

On 01/25/2016 06:25 AM, Ajay Yadav wrote:

Great proposal. I would also like to contribute to the project especially
the Python SDK, if possible.

Cheers
Ajay Yadava

On Sun, Jan 24, 2016 at 1:25 AM, Jean-Baptiste Onofré 
wrote:


Hi Seshu,

it does both: streaming and batching data processing.

Regards
JB

On 01/23/2016 03:01 PM, Adunuthula, Seshu wrote:


Did not get a chance to play with it yet, Within Google is it used more as
a MR replacement or a Stream processing engine? Or it does both of them
fantastically well?


On 1/22/16, 10:58 AM, "Frances Perry"  wrote:

Crunch started as a clone of FlumeJava, which was Google internal. In the

meantime inside Google, FlumeJava evolved into Dataflow. So all three
share
a number of concepts like PCollections, ParDo, DoFn, etc. However,
Dataflow
adds a number of new things -- the biggest being a unified
batch/streaming
semantics using concepts like Windowing and Triggers. Tyler Akidau's
OReilly post has a really nice explanation:
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

On Fri, Jan 22, 2016 at 10:42 AM, Ashish 
wrote:

Crunch has Spark pipelines, but not sure about the runner abstraction.


May be Josh Wills or Tom White can provide more insight on this topic.
They are core devs for both projects :)

On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré 
wrote:


Hi,

I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce


pipeline, it


doesn't provide runner abstraction. It's based on FlumeJava.

The logic is very similar (with DoFns, pipelines, ...). Correct me if


I'm


wrong, but Crunch started after Google Dataflow, especially because


Dataflow


was not opensourced at that time.

So, I agree it's very similar/close.

Regards
JB


On 01/22/2016 05:51 PM, Ashish wrote:



Hi JB,

Curious to know about how it compares to Apache Crunch? Constructs
looks very familiar (had used Crunch long ago)

Thoughts?

- Ashish

On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré






wrote:




Hi Seshu,

I blogged about Apache Dataflow proposal:
http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/

You can see in the "what's next ?" section that new runners, skins


and



sources are on our roadmap. Definitely, a storm runner could be



part of



this.


Regards
JB


On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote:




Awesome to see CloudDataFlow coming to Apache. The Stream


Processing



area

has been in general fragmented with a variety of solutions, hoping


the



community galvanizes around Apache Data Flow.


We are still in the "Apache Storm" world, Any chance for folks


building



a

"Storm Runner²?


On 1/20/16, 9:39 AM, "James Malone"






wrote:


Great proposal. I like that your proposal includes a well



presented



roadmap, but I don't see any goals that directly address



building a



larger

community. Y'all have any ideas around outreach that will help


with



adoption?




Thank you and fair point. We have a few additional ideas which we


can



put

into the Community section.




As a start, I recommend y'all add a section to the proposal on


the



wiki

page for "Additional Interested Contributors" so that folks who


want



to

sign up to participate in the project can do so without


requesting



additions to the initial committer list.



This is a great idea and I think it makes a lot of sense to add an

"Additional
Interested Contributors" section to the proposal.


On Wed, Jan 20, 2016 at 10:32 AM, James Malone <

jamesmal...@google.com.invalid> wrote:

Hello everyone,


Attached to this message is a proposed new project - Apache


Dataflow,






a




unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the




proposal is




in draft form on the wiki where we will make any required


changes:





https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of
language-specific




SDKs




for defining and executing data processing workflows, and also


data



ingestion and integration flows, supporting Enterprise



Integration






Patterns




(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines




simplify




the mechanics of large-scale batch and streaming data processing


and






can




run on a number of runtimes like Apache Flink, Apache Spark, and




Google




Cloud Dataflow (a cloud service). Dataflow also brings DSL in




different




languages, allowing users to easily implement their data


integration



processes.


== Proposal ==

Dataflow is a simple, flexible, and powerful system for


distributed






data




processing at any scale. Dataflow provides a unified programming




model, a




software development kit to define and construct data proc

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-24 Thread Ajay Yadav

Great proposal. I would also like to contribute to the project especially
the Python SDK, if possible.

Cheers
Ajay Yadava

On Sun, Jan 24, 2016 at 1:25 AM, Jean-Baptiste Onofré 
wrote:

> Hi Seshu,
>
> it does both: streaming and batching data processing.
>
> Regards
> JB
>
> On 01/23/2016 03:01 PM, Adunuthula, Seshu wrote:
>
>> Did not get a chance to play with it yet, Within Google is it used more as
>> a MR replacement or a Stream processing engine? Or it does both of them
>> fantastically well?
>>
>>
>> On 1/22/16, 10:58 AM, "Frances Perry"  wrote:
>>
>> Crunch started as a clone of FlumeJava, which was Google internal. In the
>>> meantime inside Google, FlumeJava evolved into Dataflow. So all three
>>> share
>>> a number of concepts like PCollections, ParDo, DoFn, etc. However,
>>> Dataflow
>>> adds a number of new things -- the biggest being a unified
>>> batch/streaming
>>> semantics using concepts like Windowing and Triggers. Tyler Akidau's
>>> OReilly post has a really nice explanation:
>>> https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
>>>
>>> On Fri, Jan 22, 2016 at 10:42 AM, Ashish 
>>> wrote:
>>>
>>> Crunch has Spark pipelines, but not sure about the runner abstraction.

 May be Josh Wills or Tom White can provide more insight on this topic.
 They are core devs for both projects :)

 On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré 
 wrote:

> Hi,
>
> I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce
>
 pipeline, it

> doesn't provide runner abstraction. It's based on FlumeJava.
>
> The logic is very similar (with DoFns, pipelines, ...). Correct me if
>
 I'm

> wrong, but Crunch started after Google Dataflow, especially because
>
 Dataflow

> was not opensourced at that time.
>
> So, I agree it's very similar/close.
>
> Regards
> JB
>
>
> On 01/22/2016 05:51 PM, Ashish wrote:
>
>>
>> Hi JB,
>>
>> Curious to know about how it compares to Apache Crunch? Constructs
>> looks very familiar (had used Crunch long ago)
>>
>> Thoughts?
>>
>> - Ashish
>>
>> On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré
>>
> 

> wrote:
>>
>>>
>>> Hi Seshu,
>>>
>>> I blogged about Apache Dataflow proposal:
>>> http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/
>>>
>>> You can see in the "what's next ?" section that new runners, skins
>>>
>> and

> sources are on our roadmap. Definitely, a storm runner could be
>>>
>> part of

> this.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote:
>>>


 Awesome to see CloudDataFlow coming to Apache. The Stream

>>> Processing

> area
 has been in general fragmented with a variety of solutions, hoping

>>> the

> community galvanizes around Apache Data Flow.

 We are still in the "Apache Storm" world, Any chance for folks

>>> building

> a
 "Storm Runner²?


 On 1/20/16, 9:39 AM, "James Malone"

>>> 

> wrote:

 Great proposal. I like that your proposal includes a well
>>
> presented

> roadmap, but I don't see any goals that directly address
>>
> building a

> larger
>> community. Y'all have any ideas around outreach that will help
>>
> with

> adoption?
>>
>>
> Thank you and fair point. We have a few additional ideas which we
>
 can

> put
> into the Community section.
>
>
>
>> As a start, I recommend y'all add a section to the proposal on
>>
> the

> wiki
>> page for "Additional Interested Contributors" so that folks who
>>
> want

> to
>> sign up to participate in the project can do so without
>>
> requesting

> additions to the initial committer list.
>>
>>
>> This is a great idea and I think it makes a lot of sense to add an
> "Additional
> Interested Contributors" section to the proposal.
>
>
> On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
>> jamesmal...@google.com.invalid> wrote:
>>
>> Hello everyone,
>>>
>>> Attached to this message is a proposed new project - Apache
>>>
>> Dataflow,

>
>>
>> a
>>
>>>
>>>
>>> unified programming model for data processing and integration.
>>>
>>> The text of the proposal is included below. Additionally, the
>>>
>>
>>
>> proposal

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-23 Thread Jean-Baptiste Onofré


Hi Seshu,

it does both: streaming and batching data processing.

Regards
JB

On 01/23/2016 03:01 PM, Adunuthula, Seshu wrote:

Did not get a chance to play with it yet, Within Google is it used more as
a MR replacement or a Stream processing engine? Or it does both of them
fantastically well?


On 1/22/16, 10:58 AM, "Frances Perry"  wrote:


Crunch started as a clone of FlumeJava, which was Google internal. In the
meantime inside Google, FlumeJava evolved into Dataflow. So all three
share
a number of concepts like PCollections, ParDo, DoFn, etc. However,
Dataflow
adds a number of new things -- the biggest being a unified batch/streaming
semantics using concepts like Windowing and Triggers. Tyler Akidau's
OReilly post has a really nice explanation:
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

On Fri, Jan 22, 2016 at 10:42 AM, Ashish  wrote:


Crunch has Spark pipelines, but not sure about the runner abstraction.

May be Josh Wills or Tom White can provide more insight on this topic.
They are core devs for both projects :)

On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré 
wrote:

Hi,

I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce

pipeline, it

doesn't provide runner abstraction. It's based on FlumeJava.

The logic is very similar (with DoFns, pipelines, ...). Correct me if

I'm

wrong, but Crunch started after Google Dataflow, especially because

Dataflow

was not opensourced at that time.

So, I agree it's very similar/close.

Regards
JB


On 01/22/2016 05:51 PM, Ashish wrote:


Hi JB,

Curious to know about how it compares to Apache Crunch? Constructs
looks very familiar (had used Crunch long ago)

Thoughts?

- Ashish

On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré



wrote:


Hi Seshu,

I blogged about Apache Dataflow proposal:
http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/

You can see in the "what's next ?" section that new runners, skins

and

sources are on our roadmap. Definitely, a storm runner could be

part of

this.

Regards
JB


On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote:



Awesome to see CloudDataFlow coming to Apache. The Stream

Processing

area
has been in general fragmented with a variety of solutions, hoping

the

community galvanizes around Apache Data Flow.

We are still in the "Apache Storm" world, Any chance for folks

building

a
"Storm Runner²?


On 1/20/16, 9:39 AM, "James Malone"



wrote:


Great proposal. I like that your proposal includes a well

presented

roadmap, but I don't see any goals that directly address

building a

larger
community. Y'all have any ideas around outreach that will help

with

adoption?



Thank you and fair point. We have a few additional ideas which we

can

put
into the Community section.




As a start, I recommend y'all add a section to the proposal on

the

wiki
page for "Additional Interested Contributors" so that folks who

want

to
sign up to participate in the project can do so without

requesting

additions to the initial committer list.



This is a great idea and I think it makes a lot of sense to add an
"Additional
Interested Contributors" section to the proposal.



On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:


Hello everyone,

Attached to this message is a proposed new project - Apache

Dataflow,



a



unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the



proposal is



in draft form on the wiki where we will make any required

changes:


https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of
language-specific



SDKs



for defining and executing data processing workflows, and also

data

ingestion and integration flows, supporting Enterprise

Integration



Patterns



(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines



simplify



the mechanics of large-scale batch and streaming data processing

and



can



run on a number of runtimes like Apache Flink, Apache Spark, and



Google



Cloud Dataflow (a cloud service). Dataflow also brings DSL in



different



languages, allowing users to easily implement their data

integration

processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for

distributed



data



processing at any scale. Dataflow provides a unified programming



model, a



software development kit to define and construct data processing



pipelines,



and runners to execute Dataflow pipelines in several runtime

engines,



like



Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow

can
be



used



for a variety of streaming or batch data processing goals

including



ETL,



stream analysis, and aggregate computation. The underlying
programming
model for Dataflow provides MapReduce-like parallelism, combined

with

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-23 Thread Adunuthula, Seshu

Did not get a chance to play with it yet, Within Google is it used more as
a MR replacement or a Stream processing engine? Or it does both of them
fantastically well?


On 1/22/16, 10:58 AM, "Frances Perry"  wrote:

>Crunch started as a clone of FlumeJava, which was Google internal. In the
>meantime inside Google, FlumeJava evolved into Dataflow. So all three
>share
>a number of concepts like PCollections, ParDo, DoFn, etc. However,
>Dataflow
>adds a number of new things -- the biggest being a unified batch/streaming
>semantics using concepts like Windowing and Triggers. Tyler Akidau's
>OReilly post has a really nice explanation:
>https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
>
>On Fri, Jan 22, 2016 at 10:42 AM, Ashish  wrote:
>
>> Crunch has Spark pipelines, but not sure about the runner abstraction.
>>
>> May be Josh Wills or Tom White can provide more insight on this topic.
>> They are core devs for both projects :)
>>
>> On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré 
>> wrote:
>> > Hi,
>> >
>> > I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce
>> pipeline, it
>> > doesn't provide runner abstraction. It's based on FlumeJava.
>> >
>> > The logic is very similar (with DoFns, pipelines, ...). Correct me if
>>I'm
>> > wrong, but Crunch started after Google Dataflow, especially because
>> Dataflow
>> > was not opensourced at that time.
>> >
>> > So, I agree it's very similar/close.
>> >
>> > Regards
>> > JB
>> >
>> >
>> > On 01/22/2016 05:51 PM, Ashish wrote:
>> >>
>> >> Hi JB,
>> >>
>> >> Curious to know about how it compares to Apache Crunch? Constructs
>> >> looks very familiar (had used Crunch long ago)
>> >>
>> >> Thoughts?
>> >>
>> >> - Ashish
>> >>
>> >> On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré
>>
>> >> wrote:
>> >>>
>> >>> Hi Seshu,
>> >>>
>> >>> I blogged about Apache Dataflow proposal:
>> >>> http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/
>> >>>
>> >>> You can see in the "what's next ?" section that new runners, skins
>>and
>> >>> sources are on our roadmap. Definitely, a storm runner could be
>>part of
>> >>> this.
>> >>>
>> >>> Regards
>> >>> JB
>> >>>
>> >>>
>> >>> On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote:
>> 
>> 
>>  Awesome to see CloudDataFlow coming to Apache. The Stream
>>Processing
>>  area
>>  has been in general fragmented with a variety of solutions, hoping
>>the
>>  community galvanizes around Apache Data Flow.
>> 
>>  We are still in the "Apache Storm" world, Any chance for folks
>> building
>>  a
>>  "Storm Runner²?
>> 
>> 
>>  On 1/20/16, 9:39 AM, "James Malone"
>>
>>  wrote:
>> 
>> >> Great proposal. I like that your proposal includes a well
>>presented
>> >> roadmap, but I don't see any goals that directly address
>>building a
>> >> larger
>> >> community. Y'all have any ideas around outreach that will help
>>with
>> >> adoption?
>> >>
>> >
>> > Thank you and fair point. We have a few additional ideas which we
>>can
>> > put
>> > into the Community section.
>> >
>> >
>> >>
>> >> As a start, I recommend y'all add a section to the proposal on
>>the
>> >> wiki
>> >> page for "Additional Interested Contributors" so that folks who
>>want
>> >> to
>> >> sign up to participate in the project can do so without
>>requesting
>> >> additions to the initial committer list.
>> >>
>> >>
>> > This is a great idea and I think it makes a lot of sense to add an
>> > "Additional
>> > Interested Contributors" section to the proposal.
>> >
>> >
>> >> On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
>> >> jamesmal...@google.com.invalid> wrote:
>> >>
>> >>> Hello everyone,
>> >>>
>> >>> Attached to this message is a proposed new project - Apache
>> Dataflow,
>> >>
>> >>
>> >> a
>> >>>
>> >>>
>> >>> unified programming model for data processing and integration.
>> >>>
>> >>> The text of the proposal is included below. Additionally, the
>> >>
>> >>
>> >> proposal is
>> >>>
>> >>>
>> >>> in draft form on the wiki where we will make any required
>>changes:
>> >>>
>> >>> https://wiki.apache.org/incubator/DataflowProposal
>> >>>
>> >>> We look forward to your feedback and input.
>> >>>
>> >>> Best,
>> >>>
>> >>> James
>> >>>
>> >>> 
>> >>>
>> >>> = Apache Dataflow =
>> >>>
>> >>> == Abstract ==
>> >>>
>> >>> Dataflow is an open source, unified model and set of
>> >>> language-specific
>> >>
>> >>
>> >> SDKs
>> >>>
>> >>>
>> >>> for defining and executing data processing workflows, and also
>>data
>> >>> ingestion and integration flows, supporting Enterprise
>>Integration
>> >>
>> >>
>> >> Patterns
>> >>>
>> >>>
>> >>> (EIPs) and Domain Specific Languages (DSLs). Dataflow

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-23 Thread Luke Han

Very great proposal!

Agreed to Stain, the name, Dataflow, is used widely long time,
maybe should think about another one.

Thanks.



Best Regards!
-

Luke Han

On Sat, Jan 23, 2016 at 2:17 PM, Alex Harui  wrote:

>
>
> On 1/22/16, 10:58 AM, "Frances Perry"  wrote:
>
> >Crunch started as a clone of FlumeJava, which was Google internal. In the
> >meantime inside Google, FlumeJava evolved into Dataflow. So all three
> >share
> >a number of concepts like PCollections, ParDo, DoFn, etc. However,
> >Dataflow
> >adds a number of new things -- the biggest being a unified batch/streaming
> >semantics using concepts like Windowing and Triggers.
>
> And somewhere in there might be your new podling name.  WinTrig or
> Wintrigue or something like that.
>
> -Alex
>
>

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-22 Thread Alex Harui



On 1/22/16, 10:58 AM, "Frances Perry"  wrote:

>Crunch started as a clone of FlumeJava, which was Google internal. In the
>meantime inside Google, FlumeJava evolved into Dataflow. So all three
>share
>a number of concepts like PCollections, ParDo, DoFn, etc. However,
>Dataflow
>adds a number of new things -- the biggest being a unified batch/streaming
>semantics using concepts like Windowing and Triggers.

And somewhere in there might be your new podling name.  WinTrig or
Wintrigue or something like that.

-Alex

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-22 Thread Ashish

Thanks Frances ! That explains it.

Wrote a couple of posts on basic usage of Crunch, may be its time to
rewrite them with Dataflow.

On Fri, Jan 22, 2016 at 10:58 AM, Frances Perry  wrote:
> Crunch started as a clone of FlumeJava, which was Google internal. In the
> meantime inside Google, FlumeJava evolved into Dataflow. So all three share
> a number of concepts like PCollections, ParDo, DoFn, etc. However, Dataflow
> adds a number of new things -- the biggest being a unified batch/streaming
> semantics using concepts like Windowing and Triggers. Tyler Akidau's
> OReilly post has a really nice explanation:
> https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
>
> On Fri, Jan 22, 2016 at 10:42 AM, Ashish  wrote:
>
>> Crunch has Spark pipelines, but not sure about the runner abstraction.
>>
>> May be Josh Wills or Tom White can provide more insight on this topic.
>> They are core devs for both projects :)
>>
>> On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré 
>> wrote:
>> > Hi,
>> >
>> > I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce
>> pipeline, it
>> > doesn't provide runner abstraction. It's based on FlumeJava.
>> >
>> > The logic is very similar (with DoFns, pipelines, ...). Correct me if I'm
>> > wrong, but Crunch started after Google Dataflow, especially because
>> Dataflow
>> > was not opensourced at that time.
>> >
>> > So, I agree it's very similar/close.
>> >
>> > Regards
>> > JB
>> >
>> >
>> > On 01/22/2016 05:51 PM, Ashish wrote:
>> >>
>> >> Hi JB,
>> >>
>> >> Curious to know about how it compares to Apache Crunch? Constructs
>> >> looks very familiar (had used Crunch long ago)
>> >>
>> >> Thoughts?
>> >>
>> >> - Ashish
>> >>
>> >> On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré 
>> >> wrote:
>> >>>
>> >>> Hi Seshu,
>> >>>
>> >>> I blogged about Apache Dataflow proposal:
>> >>> http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/
>> >>>
>> >>> You can see in the "what's next ?" section that new runners, skins and
>> >>> sources are on our roadmap. Definitely, a storm runner could be part of
>> >>> this.
>> >>>
>> >>> Regards
>> >>> JB
>> >>>
>> >>>
>> >>> On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote:
>> 
>> 
>>  Awesome to see CloudDataFlow coming to Apache. The Stream Processing
>>  area
>>  has been in general fragmented with a variety of solutions, hoping the
>>  community galvanizes around Apache Data Flow.
>> 
>>  We are still in the "Apache Storm" world, Any chance for folks
>> building
>>  a
>>  "Storm Runner²?
>> 
>> 
>>  On 1/20/16, 9:39 AM, "James Malone" 
>>  wrote:
>> 
>> >> Great proposal. I like that your proposal includes a well presented
>> >> roadmap, but I don't see any goals that directly address building a
>> >> larger
>> >> community. Y'all have any ideas around outreach that will help with
>> >> adoption?
>> >>
>> >
>> > Thank you and fair point. We have a few additional ideas which we can
>> > put
>> > into the Community section.
>> >
>> >
>> >>
>> >> As a start, I recommend y'all add a section to the proposal on the
>> >> wiki
>> >> page for "Additional Interested Contributors" so that folks who want
>> >> to
>> >> sign up to participate in the project can do so without requesting
>> >> additions to the initial committer list.
>> >>
>> >>
>> > This is a great idea and I think it makes a lot of sense to add an
>> > "Additional
>> > Interested Contributors" section to the proposal.
>> >
>> >
>> >> On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
>> >> jamesmal...@google.com.invalid> wrote:
>> >>
>> >>> Hello everyone,
>> >>>
>> >>> Attached to this message is a proposed new project - Apache
>> Dataflow,
>> >>
>> >>
>> >> a
>> >>>
>> >>>
>> >>> unified programming model for data processing and integration.
>> >>>
>> >>> The text of the proposal is included below. Additionally, the
>> >>
>> >>
>> >> proposal is
>> >>>
>> >>>
>> >>> in draft form on the wiki where we will make any required changes:
>> >>>
>> >>> https://wiki.apache.org/incubator/DataflowProposal
>> >>>
>> >>> We look forward to your feedback and input.
>> >>>
>> >>> Best,
>> >>>
>> >>> James
>> >>>
>> >>> 
>> >>>
>> >>> = Apache Dataflow =
>> >>>
>> >>> == Abstract ==
>> >>>
>> >>> Dataflow is an open source, unified model and set of
>> >>> language-specific
>> >>
>> >>
>> >> SDKs
>> >>>
>> >>>
>> >>> for defining and executing data processing workflows, and also data
>> >>> ingestion and integration flows, supporting Enterprise Integration
>> >>
>> >>
>> >> Patterns
>> >>>
>> >>>
>> >>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
>> >>
>> >>
>> >> simplify
>> >

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-22 Thread Frances Perry

Crunch started as a clone of FlumeJava, which was Google internal. In the
meantime inside Google, FlumeJava evolved into Dataflow. So all three share
a number of concepts like PCollections, ParDo, DoFn, etc. However, Dataflow
adds a number of new things -- the biggest being a unified batch/streaming
semantics using concepts like Windowing and Triggers. Tyler Akidau's
OReilly post has a really nice explanation:
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

On Fri, Jan 22, 2016 at 10:42 AM, Ashish  wrote:

> Crunch has Spark pipelines, but not sure about the runner abstraction.
>
> May be Josh Wills or Tom White can provide more insight on this topic.
> They are core devs for both projects :)
>
> On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré 
> wrote:
> > Hi,
> >
> > I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce
> pipeline, it
> > doesn't provide runner abstraction. It's based on FlumeJava.
> >
> > The logic is very similar (with DoFns, pipelines, ...). Correct me if I'm
> > wrong, but Crunch started after Google Dataflow, especially because
> Dataflow
> > was not opensourced at that time.
> >
> > So, I agree it's very similar/close.
> >
> > Regards
> > JB
> >
> >
> > On 01/22/2016 05:51 PM, Ashish wrote:
> >>
> >> Hi JB,
> >>
> >> Curious to know about how it compares to Apache Crunch? Constructs
> >> looks very familiar (had used Crunch long ago)
> >>
> >> Thoughts?
> >>
> >> - Ashish
> >>
> >> On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré 
> >> wrote:
> >>>
> >>> Hi Seshu,
> >>>
> >>> I blogged about Apache Dataflow proposal:
> >>> http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/
> >>>
> >>> You can see in the "what's next ?" section that new runners, skins and
> >>> sources are on our roadmap. Definitely, a storm runner could be part of
> >>> this.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>>
> >>> On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote:
> 
> 
>  Awesome to see CloudDataFlow coming to Apache. The Stream Processing
>  area
>  has been in general fragmented with a variety of solutions, hoping the
>  community galvanizes around Apache Data Flow.
> 
>  We are still in the "Apache Storm" world, Any chance for folks
> building
>  a
>  "Storm Runner²?
> 
> 
>  On 1/20/16, 9:39 AM, "James Malone" 
>  wrote:
> 
> >> Great proposal. I like that your proposal includes a well presented
> >> roadmap, but I don't see any goals that directly address building a
> >> larger
> >> community. Y'all have any ideas around outreach that will help with
> >> adoption?
> >>
> >
> > Thank you and fair point. We have a few additional ideas which we can
> > put
> > into the Community section.
> >
> >
> >>
> >> As a start, I recommend y'all add a section to the proposal on the
> >> wiki
> >> page for "Additional Interested Contributors" so that folks who want
> >> to
> >> sign up to participate in the project can do so without requesting
> >> additions to the initial committer list.
> >>
> >>
> > This is a great idea and I think it makes a lot of sense to add an
> > "Additional
> > Interested Contributors" section to the proposal.
> >
> >
> >> On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
> >> jamesmal...@google.com.invalid> wrote:
> >>
> >>> Hello everyone,
> >>>
> >>> Attached to this message is a proposed new project - Apache
> Dataflow,
> >>
> >>
> >> a
> >>>
> >>>
> >>> unified programming model for data processing and integration.
> >>>
> >>> The text of the proposal is included below. Additionally, the
> >>
> >>
> >> proposal is
> >>>
> >>>
> >>> in draft form on the wiki where we will make any required changes:
> >>>
> >>> https://wiki.apache.org/incubator/DataflowProposal
> >>>
> >>> We look forward to your feedback and input.
> >>>
> >>> Best,
> >>>
> >>> James
> >>>
> >>> 
> >>>
> >>> = Apache Dataflow =
> >>>
> >>> == Abstract ==
> >>>
> >>> Dataflow is an open source, unified model and set of
> >>> language-specific
> >>
> >>
> >> SDKs
> >>>
> >>>
> >>> for defining and executing data processing workflows, and also data
> >>> ingestion and integration flows, supporting Enterprise Integration
> >>
> >>
> >> Patterns
> >>>
> >>>
> >>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
> >>
> >>
> >> simplify
> >>>
> >>>
> >>> the mechanics of large-scale batch and streaming data processing
> and
> >>
> >>
> >> can
> >>>
> >>>
> >>> run on a number of runtimes like Apache Flink, Apache Spark, and
> >>
> >>
> >> Google
> >>>
> >>>
> >>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in
> >

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-22 Thread Ashish

Crunch has Spark pipelines, but not sure about the runner abstraction.

May be Josh Wills or Tom White can provide more insight on this topic.
They are core devs for both projects :)

On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré  wrote:
> Hi,
>
> I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce pipeline, it
> doesn't provide runner abstraction. It's based on FlumeJava.
>
> The logic is very similar (with DoFns, pipelines, ...). Correct me if I'm
> wrong, but Crunch started after Google Dataflow, especially because Dataflow
> was not opensourced at that time.
>
> So, I agree it's very similar/close.
>
> Regards
> JB
>
>
> On 01/22/2016 05:51 PM, Ashish wrote:
>>
>> Hi JB,
>>
>> Curious to know about how it compares to Apache Crunch? Constructs
>> looks very familiar (had used Crunch long ago)
>>
>> Thoughts?
>>
>> - Ashish
>>
>> On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré 
>> wrote:
>>>
>>> Hi Seshu,
>>>
>>> I blogged about Apache Dataflow proposal:
>>> http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/
>>>
>>> You can see in the "what's next ?" section that new runners, skins and
>>> sources are on our roadmap. Definitely, a storm runner could be part of
>>> this.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote:


 Awesome to see CloudDataFlow coming to Apache. The Stream Processing
 area
 has been in general fragmented with a variety of solutions, hoping the
 community galvanizes around Apache Data Flow.

 We are still in the "Apache Storm" world, Any chance for folks building
 a
 "Storm Runner²?


 On 1/20/16, 9:39 AM, "James Malone" 
 wrote:

>> Great proposal. I like that your proposal includes a well presented
>> roadmap, but I don't see any goals that directly address building a
>> larger
>> community. Y'all have any ideas around outreach that will help with
>> adoption?
>>
>
> Thank you and fair point. We have a few additional ideas which we can
> put
> into the Community section.
>
>
>>
>> As a start, I recommend y'all add a section to the proposal on the
>> wiki
>> page for "Additional Interested Contributors" so that folks who want
>> to
>> sign up to participate in the project can do so without requesting
>> additions to the initial committer list.
>>
>>
> This is a great idea and I think it makes a lot of sense to add an
> "Additional
> Interested Contributors" section to the proposal.
>
>
>> On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
>> jamesmal...@google.com.invalid> wrote:
>>
>>> Hello everyone,
>>>
>>> Attached to this message is a proposed new project - Apache Dataflow,
>>
>>
>> a
>>>
>>>
>>> unified programming model for data processing and integration.
>>>
>>> The text of the proposal is included below. Additionally, the
>>
>>
>> proposal is
>>>
>>>
>>> in draft form on the wiki where we will make any required changes:
>>>
>>> https://wiki.apache.org/incubator/DataflowProposal
>>>
>>> We look forward to your feedback and input.
>>>
>>> Best,
>>>
>>> James
>>>
>>> 
>>>
>>> = Apache Dataflow =
>>>
>>> == Abstract ==
>>>
>>> Dataflow is an open source, unified model and set of
>>> language-specific
>>
>>
>> SDKs
>>>
>>>
>>> for defining and executing data processing workflows, and also data
>>> ingestion and integration flows, supporting Enterprise Integration
>>
>>
>> Patterns
>>>
>>>
>>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
>>
>>
>> simplify
>>>
>>>
>>> the mechanics of large-scale batch and streaming data processing and
>>
>>
>> can
>>>
>>>
>>> run on a number of runtimes like Apache Flink, Apache Spark, and
>>
>>
>> Google
>>>
>>>
>>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in
>>
>>
>> different
>>>
>>>
>>> languages, allowing users to easily implement their data integration
>>> processes.
>>>
>>> == Proposal ==
>>>
>>> Dataflow is a simple, flexible, and powerful system for distributed
>>
>>
>> data
>>>
>>>
>>> processing at any scale. Dataflow provides a unified programming
>>
>>
>> model, a
>>>
>>>
>>> software development kit to define and construct data processing
>>
>>
>> pipelines,
>>>
>>>
>>> and runners to execute Dataflow pipelines in several runtime engines,
>>
>>
>> like
>>>
>>>
>>> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
>>
>>
>> used
>>>
>>>
>>> for a variety of streaming or batch data processing goals including
>>
>>

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-22 Thread Jean-Baptiste Onofré


Hi,

I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce 
pipeline, it doesn't provide runner abstraction. It's based on FlumeJava.


The logic is very similar (with DoFns, pipelines, ...). Correct me if 
I'm wrong, but Crunch started after Google Dataflow, especially because 
Dataflow was not opensourced at that time.


So, I agree it's very similar/close.

Regards
JB

On 01/22/2016 05:51 PM, Ashish wrote:

Hi JB,

Curious to know about how it compares to Apache Crunch? Constructs
looks very familiar (had used Crunch long ago)

Thoughts?

- Ashish

On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré  wrote:

Hi Seshu,

I blogged about Apache Dataflow proposal:
http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/

You can see in the "what's next ?" section that new runners, skins and
sources are on our roadmap. Definitely, a storm runner could be part of
this.

Regards
JB


On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote:


Awesome to see CloudDataFlow coming to Apache. The Stream Processing area
has been in general fragmented with a variety of solutions, hoping the
community galvanizes around Apache Data Flow.

We are still in the "Apache Storm" world, Any chance for folks building a
"Storm Runner²?


On 1/20/16, 9:39 AM, "James Malone" 
wrote:


Great proposal. I like that your proposal includes a well presented
roadmap, but I don't see any goals that directly address building a
larger
community. Y'all have any ideas around outreach that will help with
adoption?



Thank you and fair point. We have a few additional ideas which we can put
into the Community section.




As a start, I recommend y'all add a section to the proposal on the wiki
page for "Additional Interested Contributors" so that folks who want to
sign up to participate in the project can do so without requesting
additions to the initial committer list.



This is a great idea and I think it makes a lot of sense to add an
"Additional
Interested Contributors" section to the proposal.



On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:


Hello everyone,

Attached to this message is a proposed new project - Apache Dataflow,


a


unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the


proposal is


in draft form on the wiki where we will make any required changes:

https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of language-specific


SDKs


for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration


Patterns


(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines


simplify


the mechanics of large-scale batch and streaming data processing and


can


run on a number of runtimes like Apache Flink, Apache Spark, and


Google


Cloud Dataflow (a cloud service). Dataflow also brings DSL in


different


languages, allowing users to easily implement their data integration
processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for distributed


data


processing at any scale. Dataflow provides a unified programming


model, a


software development kit to define and construct data processing


pipelines,


and runners to execute Dataflow pipelines in several runtime engines,


like


Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be


used


for a variety of streaming or batch data processing goals including


ETL,


stream analysis, and aggregate computation. The underlying programming
model for Dataflow provides MapReduce-like parallelism, combined with
support for powerful data windowing, and fine-grained correctness


control.



== Background ==

Dataflow started as a set of Google projects focused on making data
processing easier, faster, and less costly. The Dataflow model is a
successor to MapReduce, FlumeJava, and Millwheel inside Google and is
focused on providing a unified solution for batch and stream


processing.


These projects on which Dataflow is based have been published in


several


papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable programming
layer. When you define a data processing pipeline with the Dataflow


model,


you are creating a job which is capable of being processed by any


number
of


Dataflow processing engines. Several engines have been developed to


run


Dataflow pipelines in other open source runtimes, including a Dataflow
runner for Apache Flink and Apache Spark. There is also a ³direct


runn

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-22 Thread Ashish

Hi JB,

Curious to know about how it compares to Apache Crunch? Constructs
looks very familiar (had used Crunch long ago)

Thoughts?

- Ashish

On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré  wrote:
> Hi Seshu,
>
> I blogged about Apache Dataflow proposal:
> http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/
>
> You can see in the "what's next ?" section that new runners, skins and
> sources are on our roadmap. Definitely, a storm runner could be part of
> this.
>
> Regards
> JB
>
>
> On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote:
>>
>> Awesome to see CloudDataFlow coming to Apache. The Stream Processing area
>> has been in general fragmented with a variety of solutions, hoping the
>> community galvanizes around Apache Data Flow.
>>
>> We are still in the "Apache Storm" world, Any chance for folks building a
>> "Storm Runner²?
>>
>>
>> On 1/20/16, 9:39 AM, "James Malone" 
>> wrote:
>>
 Great proposal. I like that your proposal includes a well presented
 roadmap, but I don't see any goals that directly address building a
 larger
 community. Y'all have any ideas around outreach that will help with
 adoption?

>>>
>>> Thank you and fair point. We have a few additional ideas which we can put
>>> into the Community section.
>>>
>>>

 As a start, I recommend y'all add a section to the proposal on the wiki
 page for "Additional Interested Contributors" so that folks who want to
 sign up to participate in the project can do so without requesting
 additions to the initial committer list.


>>> This is a great idea and I think it makes a lot of sense to add an
>>> "Additional
>>> Interested Contributors" section to the proposal.
>>>
>>>
 On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
 jamesmal...@google.com.invalid> wrote:

> Hello everyone,
>
> Attached to this message is a proposed new project - Apache Dataflow,

 a
>
> unified programming model for data processing and integration.
>
> The text of the proposal is included below. Additionally, the

 proposal is
>
> in draft form on the wiki where we will make any required changes:
>
> https://wiki.apache.org/incubator/DataflowProposal
>
> We look forward to your feedback and input.
>
> Best,
>
> James
>
> 
>
> = Apache Dataflow =
>
> == Abstract ==
>
> Dataflow is an open source, unified model and set of language-specific

 SDKs
>
> for defining and executing data processing workflows, and also data
> ingestion and integration flows, supporting Enterprise Integration

 Patterns
>
> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines

 simplify
>
> the mechanics of large-scale batch and streaming data processing and

 can
>
> run on a number of runtimes like Apache Flink, Apache Spark, and

 Google
>
> Cloud Dataflow (a cloud service). Dataflow also brings DSL in

 different
>
> languages, allowing users to easily implement their data integration
> processes.
>
> == Proposal ==
>
> Dataflow is a simple, flexible, and powerful system for distributed

 data
>
> processing at any scale. Dataflow provides a unified programming

 model, a
>
> software development kit to define and construct data processing

 pipelines,
>
> and runners to execute Dataflow pipelines in several runtime engines,

 like
>
> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be

 used
>
> for a variety of streaming or batch data processing goals including

 ETL,
>
> stream analysis, and aggregate computation. The underlying programming
> model for Dataflow provides MapReduce-like parallelism, combined with
> support for powerful data windowing, and fine-grained correctness

 control.
>
>
> == Background ==
>
> Dataflow started as a set of Google projects focused on making data
> processing easier, faster, and less costly. The Dataflow model is a
> successor to MapReduce, FlumeJava, and Millwheel inside Google and is
> focused on providing a unified solution for batch and stream

 processing.
>
> These projects on which Dataflow is based have been published in

 several
>
> papers made available to the public:
>
> * MapReduce - http://research.google.com/archive/mapreduce.html
>
> * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>
> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>
> * MillWheel - http://research.google.com/pubs/pub41378.html
>
> Dataflow was designed from the start to provide a portable programming
> layer. When you define a data processing pipeline with the Dataflow

 model,

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-22 Thread Stian Soiland-Reyes

As a committer of another "dataflow" incubator Taverna, I think this
looks like an exciting proposal.

Agree on the confusion of the name, and it's probably better to get
that sorted early.

In Taverna we have used the term "dataflow" since 2004, and as a
concept the paradigm was created in the 1960s. So Dataflow is a bit
too broad and likely not trademarkable. Your model seems more of an
Event-driven workflow, as you explain in the paper.

You can do a renaming during the very first month of incubation (which
several indubator projects have done) - it's a simple way to engage
everyone in the newly formed/refreshed incubator community, who should
then feel ownership to the name decission, rather than let selected
few decide beforehand.

In your case you do not already have a single community mailing list
(?), so perhaps it would be harder to do this kind of community
decission as a GitHub issue?

Remember the later you rename, the more you have to rename, like
mailing list address, code repositories, package names, documentation,
website.. :)

On 20 January 2016 at 17:12, Marvin Humphrey  wrote:
> On Wed, Jan 20, 2016 at 8:32 AM, James Malone
>  wrote:
>
>> == Abstract ==
>>
>> Dataflow is an open source, unified model and set of language-specific SDKs
>> for defining and executing data processing workflows, and also data
>> ingestion and integration flows, supporting Enterprise Integration Patterns
>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
>> the mechanics of large-scale batch and streaming data processing and can
>> run on a number of runtimes like Apache Flink, Apache Spark, and Google
>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
>> languages, allowing users to easily implement their data integration
>> processes.
>
> In general this seems like an excellent project and a well-thought-through and
> viable proposal -- I certainly anticipate that it will be accepted for
> incubation in one form or another.
>
> However, how does this "Dataflow" project relate to the programming paradigm
> of "dataflow programming"?
>
> https://en.wikipedia.org/wiki/Dataflow_programming
>
> Besides the potential for confusion, it seems like the proposed project name
> would be tough to defend as a trademark.
>
>> With respect to trademark rights, Google does not hold a trademark on the
>> phrase “Dataflow.” Based on feedback and guidance we receive during the
>> incubation process, we are open to renaming the project if necessary for
>> trademark or other concerns.
>
> If a renaming is going to happen, there are advantages to renaming sooner
> rather than later and sparing the community additional disruption.
>
> Marvin Humphrey
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>

-- 
Stian Soiland-Reyes
Apache Taverna (incubating), Apache Commons RDF (incubating)
http://orcid.org/-0001-9842-9718

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-22 Thread Adunuthula, Seshu

Awesome to see CloudDataFlow coming to Apache. The Stream Processing area
has been in general fragmented with a variety of solutions, hoping the
community galvanizes around Apache Data Flow.

We are still in the "Apache Storm" world, Any chance for folks building a
"Storm Runner²?
 

On 1/20/16, 9:39 AM, "James Malone"  wrote:

>> Great proposal. I like that your proposal includes a well presented
>> roadmap, but I don't see any goals that directly address building a
>>larger
>> community. Y'all have any ideas around outreach that will help with
>> adoption?
>>
>
>Thank you and fair point. We have a few additional ideas which we can put
>into the Community section.
>
>
>>
>> As a start, I recommend y'all add a section to the proposal on the wiki
>> page for "Additional Interested Contributors" so that folks who want to
>> sign up to participate in the project can do so without requesting
>> additions to the initial committer list.
>>
>>
>This is a great idea and I think it makes a lot of sense to add an
>"Additional
>Interested Contributors" section to the proposal.
>
>
>> On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
>> jamesmal...@google.com.invalid> wrote:
>>
>> > Hello everyone,
>> >
>> > Attached to this message is a proposed new project - Apache Dataflow,
>>a
>> > unified programming model for data processing and integration.
>> >
>> > The text of the proposal is included below. Additionally, the
>>proposal is
>> > in draft form on the wiki where we will make any required changes:
>> >
>> > https://wiki.apache.org/incubator/DataflowProposal
>> >
>> > We look forward to your feedback and input.
>> >
>> > Best,
>> >
>> > James
>> >
>> > 
>> >
>> > = Apache Dataflow =
>> >
>> > == Abstract ==
>> >
>> > Dataflow is an open source, unified model and set of language-specific
>> SDKs
>> > for defining and executing data processing workflows, and also data
>> > ingestion and integration flows, supporting Enterprise Integration
>> Patterns
>> > (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
>>simplify
>> > the mechanics of large-scale batch and streaming data processing and
>>can
>> > run on a number of runtimes like Apache Flink, Apache Spark, and
>>Google
>> > Cloud Dataflow (a cloud service). Dataflow also brings DSL in
>>different
>> > languages, allowing users to easily implement their data integration
>> > processes.
>> >
>> > == Proposal ==
>> >
>> > Dataflow is a simple, flexible, and powerful system for distributed
>>data
>> > processing at any scale. Dataflow provides a unified programming
>>model, a
>> > software development kit to define and construct data processing
>> pipelines,
>> > and runners to execute Dataflow pipelines in several runtime engines,
>> like
>> > Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
>> used
>> > for a variety of streaming or batch data processing goals including
>>ETL,
>> > stream analysis, and aggregate computation. The underlying programming
>> > model for Dataflow provides MapReduce-like parallelism, combined with
>> > support for powerful data windowing, and fine-grained correctness
>> control.
>> >
>> > == Background ==
>> >
>> > Dataflow started as a set of Google projects focused on making data
>> > processing easier, faster, and less costly. The Dataflow model is a
>> > successor to MapReduce, FlumeJava, and Millwheel inside Google and is
>> > focused on providing a unified solution for batch and stream
>>processing.
>> > These projects on which Dataflow is based have been published in
>>several
>> > papers made available to the public:
>> >
>> > * MapReduce - http://research.google.com/archive/mapreduce.html
>> >
>> > * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>> >
>> > * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>> >
>> > * MillWheel - http://research.google.com/pubs/pub41378.html
>> >
>> > Dataflow was designed from the start to provide a portable programming
>> > layer. When you define a data processing pipeline with the Dataflow
>> model,
>> > you are creating a job which is capable of being processed by any
>>number
>> of
>> > Dataflow processing engines. Several engines have been developed to
>>run
>> > Dataflow pipelines in other open source runtimes, including a Dataflow
>> > runner for Apache Flink and Apache Spark. There is also a ³direct
>> runner²,
>> > for execution on the developer machine (mainly for dev/debug
>>purposes).
>> > Another runner allows a Dataflow program to run on a managed service,
>> > Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java
>>SDK is
>> > already available on GitHub, and independent from the Google Cloud
>> Dataflow
>> > service. Another Python SDK is currently in active development.
>> >
>> > In this proposal, the Dataflow SDKs, model, and a set of runners will
>>be
>> > submitted as an OSS project under the ASF. The runners which are a
>>part
>> of
>> > this proposal include those for Spark (from Cloudera), Flink

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-22 Thread Jean-Baptiste Onofré


Hi Seshu,

I blogged about Apache Dataflow proposal: 
http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/


You can see in the "what's next ?" section that new runners, skins and 
sources are on our roadmap. Definitely, a storm runner could be part of 
this.


Regards
JB

On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote:

Awesome to see CloudDataFlow coming to Apache. The Stream Processing area
has been in general fragmented with a variety of solutions, hoping the
community galvanizes around Apache Data Flow.

We are still in the "Apache Storm" world, Any chance for folks building a
"Storm Runner²?


On 1/20/16, 9:39 AM, "James Malone"  wrote:


Great proposal. I like that your proposal includes a well presented
roadmap, but I don't see any goals that directly address building a
larger
community. Y'all have any ideas around outreach that will help with
adoption?



Thank you and fair point. We have a few additional ideas which we can put
into the Community section.




As a start, I recommend y'all add a section to the proposal on the wiki
page for "Additional Interested Contributors" so that folks who want to
sign up to participate in the project can do so without requesting
additions to the initial committer list.



This is a great idea and I think it makes a lot of sense to add an
"Additional
Interested Contributors" section to the proposal.



On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:


Hello everyone,

Attached to this message is a proposed new project - Apache Dataflow,

a

unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the

proposal is

in draft form on the wiki where we will make any required changes:

https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of language-specific

SDKs

for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration

Patterns

(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines

simplify

the mechanics of large-scale batch and streaming data processing and

can

run on a number of runtimes like Apache Flink, Apache Spark, and

Google

Cloud Dataflow (a cloud service). Dataflow also brings DSL in

different

languages, allowing users to easily implement their data integration
processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for distributed

data

processing at any scale. Dataflow provides a unified programming

model, a

software development kit to define and construct data processing

pipelines,

and runners to execute Dataflow pipelines in several runtime engines,

like

Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be

used

for a variety of streaming or batch data processing goals including

ETL,

stream analysis, and aggregate computation. The underlying programming
model for Dataflow provides MapReduce-like parallelism, combined with
support for powerful data windowing, and fine-grained correctness

control.


== Background ==

Dataflow started as a set of Google projects focused on making data
processing easier, faster, and less costly. The Dataflow model is a
successor to MapReduce, FlumeJava, and Millwheel inside Google and is
focused on providing a unified solution for batch and stream

processing.

These projects on which Dataflow is based have been published in

several

papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable programming
layer. When you define a data processing pipeline with the Dataflow

model,

you are creating a job which is capable of being processed by any

number
of

Dataflow processing engines. Several engines have been developed to

run

Dataflow pipelines in other open source runtimes, including a Dataflow
runner for Apache Flink and Apache Spark. There is also a ³direct

runner²,

for execution on the developer machine (mainly for dev/debug

purposes).

Another runner allows a Dataflow program to run on a managed service,
Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java

SDK is

already available on GitHub, and independent from the Google Cloud

Dataflow

service. Another Python SDK is currently in active development.

In this proposal, the Dataflow SDKs, model, and a set of runners will

be

submitted as an OSS project under the ASF. The runners which are a

part
of

this proposal include those for Spark (from Cloudera), Flink (from

data

Artisans), and local development (from Google); the Google Cloud

Dataflow

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-22 Thread Jean-Baptiste Onofré


Hi Supun,

I added you on the proposal.

Thanks !
Regards
JB

On 01/22/2016 12:37 AM, Supun Kamburugamuve wrote:

We are developing parallel machine learning algorithms for a research
project and are very interested in DataFlow. I would like to contribute to
this project as well. It will be great if you can add me.

Thanks,
Supun...


On Thu, Jan 21, 2016 at 6:29 PM, Mayank Bansal  wrote:


Hi Jean,

Nice Proposal.

I wanted to contribute to this project. Can you please add me too?

Thanks a lot for the help

Thanks,
Mayank

On Thu, Jan 21, 2016 at 8:07 AM, Jean-Baptiste Onofré 
wrote:


Hey Alex,

awesome: I added you on the proposal.

Thanks,
Regards
JB


On 01/21/2016 05:03 PM, Alexander Bezzubov wrote:


Hi,

it's great to see DataFlow becoming part to Apache ecosystem, thank you
bringing it in.
I would be happy to get involved and help.

--
Alex

On Thu, Jan 21, 2016 at 8:42 PM, Jean-Baptiste Onofré 
wrote:

Perfect: done, you are on the proposal.


Thanks !
Regards
JB


On 01/21/2016 11:55 AM, chatz wrote:

Charitha Elvitigala


On 21 January 2016 at 16:17, Jean-Baptiste Onofré 
wrote:

Hi Chatz,



sure, what name should I use on the proposal, Charitha ?

Regards
JB


On 01/21/2016 11:32 AM, chatz wrote:

Hi Jean,



I’d be interested in contributing as well.

Thanks,

Chatz


On 21 January 2016 at 14:22, Jean-Baptiste Onofré 
wrote:

Sweet: you are on the proposal ;)



Thanks !
Regards
JB


On 01/21/2016 08:55 AM, Byung-Gon Chun wrote:

This looks very interesting. I'm interested in contributing.



Thanks.
-Gon

---
Byung-Gon Chun


On Thu, Jan 21, 2016 at 1:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:

Hello everyone,


Attached to this message is a proposed new project - Apache

Dataflow, a
unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the
proposal
is
in draft form on the wiki where we will make any required

changes:


https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of
language-specific
SDKs
for defining and executing data processing workflows, and also

data

ingestion and integration flows, supporting Enterprise

Integration

Patterns
(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
simplify
the mechanics of large-scale batch and streaming data processing
and
can
run on a number of runtimes like Apache Flink, Apache Spark, and
Google
Cloud Dataflow (a cloud service). Dataflow also brings DSL in
different
languages, allowing users to easily implement their data
integration
processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for

distributed

data
processing at any scale. Dataflow provides a unified programming
model, a
software development kit to define and construct data processing
pipelines,
and runners to execute Dataflow pipelines in several runtime
engines,
like
Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow

can

be
used
for a variety of streaming or batch data processing goals

including

ETL,
stream analysis, and aggregate computation. The underlying
programming
model for Dataflow provides MapReduce-like parallelism, combined
with
support for powerful data windowing, and fine-grained correctness
control.

== Background ==

Dataflow started as a set of Google projects focused on making

data

processing easier, faster, and less costly. The Dataflow model

is a

successor to MapReduce, FlumeJava, and Millwheel inside Google

and

is
focused on providing a unified solution for batch and stream
processing.
These projects on which Dataflow is based have been published in
several
papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  -
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable
programming
layer. When you define a data processing pipeline with the

Dataflow

model,
you are creating a job which is capable of being processed by any
number
of
Dataflow processing engines. Several engines have been developed

to

run
Dataflow pipelines in other open source runtimes, including a
Dataflow
runner for Apache Flink and Apache Spark. There is also a “direct
runner”,
for execution on the developer machine (mainly for dev/debug
purposes).
Another runner allows a Dataflow program to run on a managed
service,
Google Cloud Dataflow, in Google Cloud Platform. The Dataflow

Java

SDK
is
already available on GitHub, and independent from the Google

Cloud

Dataflow
service. Another Python SDK is currently in active development.

In this proposal, the Dataflow SDKs, model, and a set of runners
will
be
submitted as an OSS project under the ASF. The runners which

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-22 Thread Jean-Baptiste Onofré


Hi Tsuyoshi

Awesome: I added you on the proposal.

Thanks !
Regards
JB

On 01/22/2016 04:29 AM, Tsuyoshi Ozawa wrote:

Hi, I'm a core developer of Apache Hadoop and a contributor of Apache Tez.
I'd be also interested in working on Apache Dataflow as an individual.

Regards,
- Tsuyoshi

-Original Message-
From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net]
Sent: Thursday, January 21, 2016 2:38 PM
To: general@incubator.apache.org
Subject: Re: [DISCUSS] Apache Dataflow Incubator Proposal

Hi,

great: I added you in the proposal.

Thanks !
Regards
JB

On 01/21/2016 12:24 AM, Prasanth Jayachandran wrote:

Hi Jean

I’d be interested in contributing as well.

Thanks
Prasanth Jayachandran


On Jan 20, 2016, at 5:20 PM, ksobkowiak  wrote:

It's a great news the project is going to move to Apache. I'd be
interested in contributing too

Regards
Krzysztof



--
View this message in context:
http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-D
ataflow-Incubator-Proposal-tp47985p48025.html
Sent from the Apache Incubator - General mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org





-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org




-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-22 Thread Jean-Baptiste Onofré


Hi Mayank,

sure: you are in.

Thanks !
Regards
JB

On 01/22/2016 12:29 AM, Mayank Bansal wrote:

Hi Jean,

Nice Proposal.

I wanted to contribute to this project. Can you please add me too?

Thanks a lot for the help

Thanks,
Mayank

On Thu, Jan 21, 2016 at 8:07 AM, Jean-Baptiste Onofré mailto:j...@nanthrax.net>> wrote:

Hey Alex,

awesome: I added you on the proposal.

Thanks,
Regards
JB


On 01/21/2016 05:03 PM, Alexander Bezzubov wrote:

Hi,

it's great to see DataFlow becoming part to Apache ecosystem,
thank you
bringing it in.
I would be happy to get involved and help.

--
Alex

On Thu, Jan 21, 2016 at 8:42 PM, Jean-Baptiste Onofré
mailto:j...@nanthrax.net>>
wrote:

Perfect: done, you are on the proposal.

Thanks !
Regards
JB


On 01/21/2016 11:55 AM, chatz wrote:

Charitha Elvitigala

On 21 January 2016 at 16:17, Jean-Baptiste Onofré
mailto:j...@nanthrax.net>>
wrote:

Hi Chatz,


sure, what name should I use on the proposal, Charitha ?

Regards
JB


On 01/21/2016 11:32 AM, chatz wrote:

Hi Jean,


I’d be interested in contributing as well.

Thanks,

Chatz


On 21 January 2016 at 14:22, Jean-Baptiste
Onofré mailto:j...@nanthrax.net>>
wrote:

Sweet: you are on the proposal ;)


Thanks !
Regards
JB


On 01/21/2016 08:55 AM, Byung-Gon Chun wrote:

This looks very interesting. I'm interested
in contributing.


Thanks.
-Gon

---
Byung-Gon Chun


On Thu, Jan 21, 2016 at 1:32 AM, James
Malone <
jamesmal...@google.com.invalid> wrote:

Hello everyone,


Attached to this message is a
proposed new project - Apache
Dataflow, a
unified programming model for data
processing and integration.

The text of the proposal is included
below. Additionally, the
proposal
is
in draft form on the wiki where we
will make any required changes:


https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and
input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified
model and set of
language-specific
SDKs
for defining and executing data
processing workflows, and also data
ingestion and integration flows,
supporting Enterprise Integration
Patterns
(EIPs) and Domain Specific Languages
(DSLs). Dataflow pipelines
simplify
the mechanics of large-scale batch
and streaming data processing and
can
run on a number of runtimes like
Apache Flink, Apache Spark, and
Google
Cloud Dataflow (a cloud service).
Dataflow also brings DSL in
different
languages, allowing users to easily
implement their data integration
processes.

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-22 Thread Jean-Baptiste Onofré


It makes perfect sense, and it's something that we already discussed.

Thanks James and Marvin.

@James, yes, we are going to deal with that together, not a problem at 
all. I agree that renaming should happen now.

As discussed, we should be back with a new name early next week.

I'm happy to see the discussion now (and thanks again Marvin for details 
and always helpful messages): it's exactly the purpose of sending the 
discussion thread on the incubator mailing list.


Thanks guys !

Regards
JB

On 01/22/2016 02:19 AM, James Malone wrote:

Thank you for such a detailed response Marvin!

Everything you mention makes a lot of sense. Needless to say, we don't want
to squander cycles, break any rules, or throw velocity into disarray all
due to a name.

To that end, I am going to work with JB to amend the proposal with respect
to renaming. I'm also going to clarify a name change would be an
immediate-term to-do item so it does not block creation creation of lists,
repositories, and so on.

Best,

James

I am going to work with JB to amend the proposal to indicate

On Thu, Jan 21, 2016 at 9:30 AM, Marvin Humphrey 
wrote:


On Wed, Jan 20, 2016 at 3:30 PM, James Malone
 wrote:

If we need to rename, we would ideally choose a new name, change the
project name at that time, and start our refactoring with that new name.

Is

is acceptable for us to flag a name change as something we need to do as

a

near-term (1st month) item in incubation (if accepted)? If a rename is
required I'd like to add it to our to-do roadmap but also not block our
proposal on a renaming. I ask so we can address this concern in the best
way possible.


That's acceptable.  Project naming issues do not block entry into the
Incubator, they block graduation from the Incubator.

Because "dataflow" is descriptive, it will be hard to defend as
a trademark.  The Wikipedia article on trademark distinctiveness explains
things well:

 https://en.wikipedia.org/wiki/Trademark_distinctiveness

A weak mark both increases the amount of volunteer effort that goes
into dealing with infringement cases and makes bad outcomes more likely.
It is not an absolute requirement that Apache projects have defensible
names,
but painful past experience has taught us that mishandled branding can deal
surprising amounts of damage to a project community.

But beyond that, the issue of "Google Cloud Dataflow" vs. "Apache
Dataflow" is
a blocker.  One or the other will have to be renamed, and since the
software
is being donated but apparently not the brand, it sounds like renaming the
prospective Apache project will be required and you should add that task to
your roadmap.

Changing names in the middle of incubation is disruptive because it
requires
renaming infrastructure resources, impacting both the Apache Infrastructure
team and also the podling's developer and user communities.  My suggestion
would be that immediately after the VOTE to enter incubation concludes, you
only create a dev mailing list and deal with the renaming immediately,
delaying the creation of other resources until after the renaming is
resolved.
However, the exact plan is something you can work out with your Mentors.

Marvin Humphrey

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org






--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

RE: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-21 Thread Tsuyoshi Ozawa

Hi, I'm a core developer of Apache Hadoop and a contributor of Apache Tez.
I'd be also interested in working on Apache Dataflow as an individual.

Regards,
- Tsuyoshi

-Original Message-
From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net] 
Sent: Thursday, January 21, 2016 2:38 PM
To: general@incubator.apache.org
Subject: Re: [DISCUSS] Apache Dataflow Incubator Proposal

Hi,

great: I added you in the proposal.

Thanks !
Regards
JB

On 01/21/2016 12:24 AM, Prasanth Jayachandran wrote:
> Hi Jean
>
> I’d be interested in contributing as well.
>
> Thanks
> Prasanth Jayachandran
>
>> On Jan 20, 2016, at 5:20 PM, ksobkowiak  wrote:
>>
>> It's a great news the project is going to move to Apache. I'd be 
>> interested in contributing too
>>
>> Regards
>> Krzysztof
>>
>>
>>
>> --
>> View this message in context: 
>> http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-D
>> ataflow-Incubator-Proposal-tp47985p48025.html
>> Sent from the Apache Incubator - General mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>> For additional commands, e-mail: general-h...@incubator.apache.org
>>
>>
>
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org




-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-21 Thread James Malone

Thank you for such a detailed response Marvin!

Everything you mention makes a lot of sense. Needless to say, we don't want
to squander cycles, break any rules, or throw velocity into disarray all
due to a name.

To that end, I am going to work with JB to amend the proposal with respect
to renaming. I'm also going to clarify a name change would be an
immediate-term to-do item so it does not block creation creation of lists,
repositories, and so on.

Best,

James

I am going to work with JB to amend the proposal to indicate

On Thu, Jan 21, 2016 at 9:30 AM, Marvin Humphrey 
wrote:

> On Wed, Jan 20, 2016 at 3:30 PM, James Malone
>  wrote:
> > If we need to rename, we would ideally choose a new name, change the
> > project name at that time, and start our refactoring with that new name.
> Is
> > is acceptable for us to flag a name change as something we need to do as
> a
> > near-term (1st month) item in incubation (if accepted)? If a rename is
> > required I'd like to add it to our to-do roadmap but also not block our
> > proposal on a renaming. I ask so we can address this concern in the best
> > way possible.
>
> That's acceptable.  Project naming issues do not block entry into the
> Incubator, they block graduation from the Incubator.
>
> Because "dataflow" is descriptive, it will be hard to defend as
> a trademark.  The Wikipedia article on trademark distinctiveness explains
> things well:
>
> https://en.wikipedia.org/wiki/Trademark_distinctiveness
>
> A weak mark both increases the amount of volunteer effort that goes
> into dealing with infringement cases and makes bad outcomes more likely.
> It is not an absolute requirement that Apache projects have defensible
> names,
> but painful past experience has taught us that mishandled branding can deal
> surprising amounts of damage to a project community.
>
> But beyond that, the issue of "Google Cloud Dataflow" vs. "Apache
> Dataflow" is
> a blocker.  One or the other will have to be renamed, and since the
> software
> is being donated but apparently not the brand, it sounds like renaming the
> prospective Apache project will be required and you should add that task to
> your roadmap.
>
> Changing names in the middle of incubation is disruptive because it
> requires
> renaming infrastructure resources, impacting both the Apache Infrastructure
> team and also the podling's developer and user communities.  My suggestion
> would be that immediately after the VOTE to enter incubation concludes, you
> only create a dev mailing list and deal with the renaming immediately,
> delaying the creation of other resources until after the renaming is
> resolved.
> However, the exact plan is something you can work out with your Mentors.
>
> Marvin Humphrey
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-21 Thread Supun Kamburugamuve

We are developing parallel machine learning algorithms for a research
project and are very interested in DataFlow. I would like to contribute to
this project as well. It will be great if you can add me.

Thanks,
Supun...


On Thu, Jan 21, 2016 at 6:29 PM, Mayank Bansal  wrote:

> Hi Jean,
>
> Nice Proposal.
>
> I wanted to contribute to this project. Can you please add me too?
>
> Thanks a lot for the help
>
> Thanks,
> Mayank
>
> On Thu, Jan 21, 2016 at 8:07 AM, Jean-Baptiste Onofré 
> wrote:
>
> > Hey Alex,
> >
> > awesome: I added you on the proposal.
> >
> > Thanks,
> > Regards
> > JB
> >
> >
> > On 01/21/2016 05:03 PM, Alexander Bezzubov wrote:
> >
> >> Hi,
> >>
> >> it's great to see DataFlow becoming part to Apache ecosystem, thank you
> >> bringing it in.
> >> I would be happy to get involved and help.
> >>
> >> --
> >> Alex
> >>
> >> On Thu, Jan 21, 2016 at 8:42 PM, Jean-Baptiste Onofré 
> >> wrote:
> >>
> >> Perfect: done, you are on the proposal.
> >>>
> >>> Thanks !
> >>> Regards
> >>> JB
> >>>
> >>>
> >>> On 01/21/2016 11:55 AM, chatz wrote:
> >>>
> >>> Charitha Elvitigala
> 
>  On 21 January 2016 at 16:17, Jean-Baptiste Onofré 
>  wrote:
> 
>  Hi Chatz,
> 
> >
> > sure, what name should I use on the proposal, Charitha ?
> >
> > Regards
> > JB
> >
> >
> > On 01/21/2016 11:32 AM, chatz wrote:
> >
> > Hi Jean,
> >
> >>
> >> I’d be interested in contributing as well.
> >>
> >> Thanks,
> >>
> >> Chatz
> >>
> >>
> >> On 21 January 2016 at 14:22, Jean-Baptiste Onofré 
> >> wrote:
> >>
> >> Sweet: you are on the proposal ;)
> >>
> >>
> >>> Thanks !
> >>> Regards
> >>> JB
> >>>
> >>>
> >>> On 01/21/2016 08:55 AM, Byung-Gon Chun wrote:
> >>>
> >>> This looks very interesting. I'm interested in contributing.
> >>>
> >>>
>  Thanks.
>  -Gon
> 
>  ---
>  Byung-Gon Chun
> 
> 
>  On Thu, Jan 21, 2016 at 1:32 AM, James Malone <
>  jamesmal...@google.com.invalid> wrote:
> 
>  Hello everyone,
> 
> 
>  Attached to this message is a proposed new project - Apache
> > Dataflow, a
> > unified programming model for data processing and integration.
> >
> > The text of the proposal is included below. Additionally, the
> > proposal
> > is
> > in draft form on the wiki where we will make any required
> changes:
> >
> > https://wiki.apache.org/incubator/DataflowProposal
> >
> > We look forward to your feedback and input.
> >
> > Best,
> >
> > James
> >
> > 
> >
> > = Apache Dataflow =
> >
> > == Abstract ==
> >
> > Dataflow is an open source, unified model and set of
> > language-specific
> > SDKs
> > for defining and executing data processing workflows, and also
> data
> > ingestion and integration flows, supporting Enterprise
> Integration
> > Patterns
> > (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
> > simplify
> > the mechanics of large-scale batch and streaming data processing
> > and
> > can
> > run on a number of runtimes like Apache Flink, Apache Spark, and
> > Google
> > Cloud Dataflow (a cloud service). Dataflow also brings DSL in
> > different
> > languages, allowing users to easily implement their data
> > integration
> > processes.
> >
> > == Proposal ==
> >
> > Dataflow is a simple, flexible, and powerful system for
> distributed
> > data
> > processing at any scale. Dataflow provides a unified programming
> > model, a
> > software development kit to define and construct data processing
> > pipelines,
> > and runners to execute Dataflow pipelines in several runtime
> > engines,
> > like
> > Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow
> can
> > be
> > used
> > for a variety of streaming or batch data processing goals
> including
> > ETL,
> > stream analysis, and aggregate computation. The underlying
> > programming
> > model for Dataflow provides MapReduce-like parallelism, combined
> > with
> > support for powerful data windowing, and fine-grained correctness
> > control.
> >
> > == Background ==
> >
> > Dataflow started as a set of Google projects focused on making
> data
> > processing easier, faster, and less costly. The Dataflow model
> is a
> > successor to MapReduce, FlumeJava, and Millwheel inside Google
> and
> > is
> > focused on

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-21 Thread Mayank Bansal

Hi Jean,

Nice Proposal.

I wanted to contribute to this project. Can you please add me too?

Thanks a lot for the help

Thanks,
Mayank

On Thu, Jan 21, 2016 at 8:07 AM, Jean-Baptiste Onofré 
wrote:

> Hey Alex,
>
> awesome: I added you on the proposal.
>
> Thanks,
> Regards
> JB
>
>
> On 01/21/2016 05:03 PM, Alexander Bezzubov wrote:
>
>> Hi,
>>
>> it's great to see DataFlow becoming part to Apache ecosystem, thank you
>> bringing it in.
>> I would be happy to get involved and help.
>>
>> --
>> Alex
>>
>> On Thu, Jan 21, 2016 at 8:42 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Perfect: done, you are on the proposal.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>>
>>> On 01/21/2016 11:55 AM, chatz wrote:
>>>
>>> Charitha Elvitigala

 On 21 January 2016 at 16:17, Jean-Baptiste Onofré 
 wrote:

 Hi Chatz,

>
> sure, what name should I use on the proposal, Charitha ?
>
> Regards
> JB
>
>
> On 01/21/2016 11:32 AM, chatz wrote:
>
> Hi Jean,
>
>>
>> I’d be interested in contributing as well.
>>
>> Thanks,
>>
>> Chatz
>>
>>
>> On 21 January 2016 at 14:22, Jean-Baptiste Onofré 
>> wrote:
>>
>> Sweet: you are on the proposal ;)
>>
>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>>
>>> On 01/21/2016 08:55 AM, Byung-Gon Chun wrote:
>>>
>>> This looks very interesting. I'm interested in contributing.
>>>
>>>
 Thanks.
 -Gon

 ---
 Byung-Gon Chun


 On Thu, Jan 21, 2016 at 1:32 AM, James Malone <
 jamesmal...@google.com.invalid> wrote:

 Hello everyone,


 Attached to this message is a proposed new project - Apache
> Dataflow, a
> unified programming model for data processing and integration.
>
> The text of the proposal is included below. Additionally, the
> proposal
> is
> in draft form on the wiki where we will make any required changes:
>
> https://wiki.apache.org/incubator/DataflowProposal
>
> We look forward to your feedback and input.
>
> Best,
>
> James
>
> 
>
> = Apache Dataflow =
>
> == Abstract ==
>
> Dataflow is an open source, unified model and set of
> language-specific
> SDKs
> for defining and executing data processing workflows, and also data
> ingestion and integration flows, supporting Enterprise Integration
> Patterns
> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
> simplify
> the mechanics of large-scale batch and streaming data processing
> and
> can
> run on a number of runtimes like Apache Flink, Apache Spark, and
> Google
> Cloud Dataflow (a cloud service). Dataflow also brings DSL in
> different
> languages, allowing users to easily implement their data
> integration
> processes.
>
> == Proposal ==
>
> Dataflow is a simple, flexible, and powerful system for distributed
> data
> processing at any scale. Dataflow provides a unified programming
> model, a
> software development kit to define and construct data processing
> pipelines,
> and runners to execute Dataflow pipelines in several runtime
> engines,
> like
> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can
> be
> used
> for a variety of streaming or batch data processing goals including
> ETL,
> stream analysis, and aggregate computation. The underlying
> programming
> model for Dataflow provides MapReduce-like parallelism, combined
> with
> support for powerful data windowing, and fine-grained correctness
> control.
>
> == Background ==
>
> Dataflow started as a set of Google projects focused on making data
> processing easier, faster, and less costly. The Dataflow model is a
> successor to MapReduce, FlumeJava, and Millwheel inside Google and
> is
> focused on providing a unified solution for batch and stream
> processing.
> These projects on which Dataflow is based have been published in
> several
> papers made available to the public:
>
> * MapReduce - http://research.google.com/archive/mapreduce.html
>
> * Dataflow model  -
> http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>
> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>
> * MillWheel - http://research.google.com/pubs/pub41378.html
>
> Dataflow was designed from the start to provide a portable
>>>

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-21 Thread Marvin Humphrey

On Wed, Jan 20, 2016 at 3:30 PM, James Malone
 wrote:
> If we need to rename, we would ideally choose a new name, change the
> project name at that time, and start our refactoring with that new name. Is
> is acceptable for us to flag a name change as something we need to do as a
> near-term (1st month) item in incubation (if accepted)? If a rename is
> required I'd like to add it to our to-do roadmap but also not block our
> proposal on a renaming. I ask so we can address this concern in the best
> way possible.

That's acceptable.  Project naming issues do not block entry into the
Incubator, they block graduation from the Incubator.

Because "dataflow" is descriptive, it will be hard to defend as
a trademark.  The Wikipedia article on trademark distinctiveness explains
things well:

https://en.wikipedia.org/wiki/Trademark_distinctiveness

A weak mark both increases the amount of volunteer effort that goes
into dealing with infringement cases and makes bad outcomes more likely.
It is not an absolute requirement that Apache projects have defensible names,
but painful past experience has taught us that mishandled branding can deal
surprising amounts of damage to a project community.

But beyond that, the issue of "Google Cloud Dataflow" vs. "Apache Dataflow" is
a blocker.  One or the other will have to be renamed, and since the software
is being donated but apparently not the brand, it sounds like renaming the
prospective Apache project will be required and you should add that task to
your roadmap.

Changing names in the middle of incubation is disruptive because it requires
renaming infrastructure resources, impacting both the Apache Infrastructure
team and also the podling's developer and user communities.  My suggestion
would be that immediately after the VOTE to enter incubation concludes, you
only create a dev mailing list and deal with the renaming immediately,
delaying the creation of other resources until after the renaming is resolved.
However, the exact plan is something you can work out with your Mentors.

Marvin Humphrey

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-21 Thread Alexander Bezzubov

Hi,

it's great to see DataFlow becoming part to Apache ecosystem, thank you
bringing it in.
I would be happy to get involved and help.

--
Alex

On Thu, Jan 21, 2016 at 8:42 PM, Jean-Baptiste Onofré 
wrote:

> Perfect: done, you are on the proposal.
>
> Thanks !
> Regards
> JB
>
>
> On 01/21/2016 11:55 AM, chatz wrote:
>
>> Charitha Elvitigala
>>
>> On 21 January 2016 at 16:17, Jean-Baptiste Onofré 
>> wrote:
>>
>> Hi Chatz,
>>>
>>> sure, what name should I use on the proposal, Charitha ?
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 01/21/2016 11:32 AM, chatz wrote:
>>>
>>> Hi Jean,

 I’d be interested in contributing as well.

 Thanks,

 Chatz


 On 21 January 2016 at 14:22, Jean-Baptiste Onofré 
 wrote:

 Sweet: you are on the proposal ;)

>
> Thanks !
> Regards
> JB
>
>
> On 01/21/2016 08:55 AM, Byung-Gon Chun wrote:
>
> This looks very interesting. I'm interested in contributing.
>
>>
>> Thanks.
>> -Gon
>>
>> ---
>> Byung-Gon Chun
>>
>>
>> On Thu, Jan 21, 2016 at 1:32 AM, James Malone <
>> jamesmal...@google.com.invalid> wrote:
>>
>> Hello everyone,
>>
>>
>>> Attached to this message is a proposed new project - Apache
>>> Dataflow, a
>>> unified programming model for data processing and integration.
>>>
>>> The text of the proposal is included below. Additionally, the
>>> proposal
>>> is
>>> in draft form on the wiki where we will make any required changes:
>>>
>>> https://wiki.apache.org/incubator/DataflowProposal
>>>
>>> We look forward to your feedback and input.
>>>
>>> Best,
>>>
>>> James
>>>
>>> 
>>>
>>> = Apache Dataflow =
>>>
>>> == Abstract ==
>>>
>>> Dataflow is an open source, unified model and set of
>>> language-specific
>>> SDKs
>>> for defining and executing data processing workflows, and also data
>>> ingestion and integration flows, supporting Enterprise Integration
>>> Patterns
>>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
>>> simplify
>>> the mechanics of large-scale batch and streaming data processing and
>>> can
>>> run on a number of runtimes like Apache Flink, Apache Spark, and
>>> Google
>>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in
>>> different
>>> languages, allowing users to easily implement their data integration
>>> processes.
>>>
>>> == Proposal ==
>>>
>>> Dataflow is a simple, flexible, and powerful system for distributed
>>> data
>>> processing at any scale. Dataflow provides a unified programming
>>> model, a
>>> software development kit to define and construct data processing
>>> pipelines,
>>> and runners to execute Dataflow pipelines in several runtime engines,
>>> like
>>> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
>>> used
>>> for a variety of streaming or batch data processing goals including
>>> ETL,
>>> stream analysis, and aggregate computation. The underlying
>>> programming
>>> model for Dataflow provides MapReduce-like parallelism, combined with
>>> support for powerful data windowing, and fine-grained correctness
>>> control.
>>>
>>> == Background ==
>>>
>>> Dataflow started as a set of Google projects focused on making data
>>> processing easier, faster, and less costly. The Dataflow model is a
>>> successor to MapReduce, FlumeJava, and Millwheel inside Google and is
>>> focused on providing a unified solution for batch and stream
>>> processing.
>>> These projects on which Dataflow is based have been published in
>>> several
>>> papers made available to the public:
>>>
>>> * MapReduce - http://research.google.com/archive/mapreduce.html
>>>
>>> * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>>>
>>> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>>>
>>> * MillWheel - http://research.google.com/pubs/pub41378.html
>>>
>>> Dataflow was designed from the start to provide a portable
>>> programming
>>> layer. When you define a data processing pipeline with the Dataflow
>>> model,
>>> you are creating a job which is capable of being processed by any
>>> number
>>> of
>>> Dataflow processing engines. Several engines have been developed to
>>> run
>>> Dataflow pipelines in other open source runtimes, including a
>>> Dataflow
>>> runner for Apache Flink and Apache Spark. There is also a “direct
>>> runner”,
>>> for execution on the developer machine (mainly for dev/debug
>>> purposes).
>>> Another runner allows a Dataflow program to run on a managed service,
>>> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java
>>> SDK

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-21 Thread Jean-Baptiste Onofré


Hey Alex,

awesome: I added you on the proposal.

Thanks,
Regards
JB

On 01/21/2016 05:03 PM, Alexander Bezzubov wrote:

Hi,

it's great to see DataFlow becoming part to Apache ecosystem, thank you
bringing it in.
I would be happy to get involved and help.

--
Alex

On Thu, Jan 21, 2016 at 8:42 PM, Jean-Baptiste Onofré 
wrote:


Perfect: done, you are on the proposal.

Thanks !
Regards
JB


On 01/21/2016 11:55 AM, chatz wrote:


Charitha Elvitigala

On 21 January 2016 at 16:17, Jean-Baptiste Onofré 
wrote:

Hi Chatz,


sure, what name should I use on the proposal, Charitha ?

Regards
JB


On 01/21/2016 11:32 AM, chatz wrote:

Hi Jean,


I’d be interested in contributing as well.

Thanks,

Chatz


On 21 January 2016 at 14:22, Jean-Baptiste Onofré 
wrote:

Sweet: you are on the proposal ;)



Thanks !
Regards
JB


On 01/21/2016 08:55 AM, Byung-Gon Chun wrote:

This looks very interesting. I'm interested in contributing.



Thanks.
-Gon

---
Byung-Gon Chun


On Thu, Jan 21, 2016 at 1:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:

Hello everyone,



Attached to this message is a proposed new project - Apache
Dataflow, a
unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the
proposal
is
in draft form on the wiki where we will make any required changes:

https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of
language-specific
SDKs
for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration
Patterns
(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
simplify
the mechanics of large-scale batch and streaming data processing and
can
run on a number of runtimes like Apache Flink, Apache Spark, and
Google
Cloud Dataflow (a cloud service). Dataflow also brings DSL in
different
languages, allowing users to easily implement their data integration
processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for distributed
data
processing at any scale. Dataflow provides a unified programming
model, a
software development kit to define and construct data processing
pipelines,
and runners to execute Dataflow pipelines in several runtime engines,
like
Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
used
for a variety of streaming or batch data processing goals including
ETL,
stream analysis, and aggregate computation. The underlying
programming
model for Dataflow provides MapReduce-like parallelism, combined with
support for powerful data windowing, and fine-grained correctness
control.

== Background ==

Dataflow started as a set of Google projects focused on making data
processing easier, faster, and less costly. The Dataflow model is a
successor to MapReduce, FlumeJava, and Millwheel inside Google and is
focused on providing a unified solution for batch and stream
processing.
These projects on which Dataflow is based have been published in
several
papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable
programming
layer. When you define a data processing pipeline with the Dataflow
model,
you are creating a job which is capable of being processed by any
number
of
Dataflow processing engines. Several engines have been developed to
run
Dataflow pipelines in other open source runtimes, including a
Dataflow
runner for Apache Flink and Apache Spark. There is also a “direct
runner”,
for execution on the developer machine (mainly for dev/debug
purposes).
Another runner allows a Dataflow program to run on a managed service,
Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java
SDK
is
already available on GitHub, and independent from the Google Cloud
Dataflow
service. Another Python SDK is currently in active development.

In this proposal, the Dataflow SDKs, model, and a set of runners will
be
submitted as an OSS project under the ASF. The runners which are a
part
of
this proposal include those for Spark (from Cloudera), Flink (from
data
Artisans), and local development (from Google); the Google Cloud
Dataflow
service runner is not included in this proposal. Further references
to
Dataflow will refer to the Dataflow model, SDKs, and runners which
are
a
part of this proposal (Apache Dataflow) only. The initial submission
will
contain the already-released Java SDK; Google intends to submit the
Python
SDK later in the incubation process. The Google Cloud Dataflow
service
will
continue to be one of many runners for Dataflow, built on Google
Cloud
Platform, to run Dataflo

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-21 Thread Jean-Baptiste Onofré


Perfect: done, you are on the proposal.

Thanks !
Regards
JB

On 01/21/2016 11:55 AM, chatz wrote:

Charitha Elvitigala

On 21 January 2016 at 16:17, Jean-Baptiste Onofré  wrote:


Hi Chatz,

sure, what name should I use on the proposal, Charitha ?

Regards
JB


On 01/21/2016 11:32 AM, chatz wrote:


Hi Jean,

I’d be interested in contributing as well.

Thanks,

Chatz


On 21 January 2016 at 14:22, Jean-Baptiste Onofré 
wrote:

Sweet: you are on the proposal ;)


Thanks !
Regards
JB


On 01/21/2016 08:55 AM, Byung-Gon Chun wrote:

This looks very interesting. I'm interested in contributing.


Thanks.
-Gon

---
Byung-Gon Chun


On Thu, Jan 21, 2016 at 1:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:

Hello everyone,



Attached to this message is a proposed new project - Apache Dataflow, a
unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the proposal
is
in draft form on the wiki where we will make any required changes:

https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of language-specific
SDKs
for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration
Patterns
(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
simplify
the mechanics of large-scale batch and streaming data processing and
can
run on a number of runtimes like Apache Flink, Apache Spark, and Google
Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
languages, allowing users to easily implement their data integration
processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for distributed
data
processing at any scale. Dataflow provides a unified programming
model, a
software development kit to define and construct data processing
pipelines,
and runners to execute Dataflow pipelines in several runtime engines,
like
Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
used
for a variety of streaming or batch data processing goals including
ETL,
stream analysis, and aggregate computation. The underlying programming
model for Dataflow provides MapReduce-like parallelism, combined with
support for powerful data windowing, and fine-grained correctness
control.

== Background ==

Dataflow started as a set of Google projects focused on making data
processing easier, faster, and less costly. The Dataflow model is a
successor to MapReduce, FlumeJava, and Millwheel inside Google and is
focused on providing a unified solution for batch and stream
processing.
These projects on which Dataflow is based have been published in
several
papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable programming
layer. When you define a data processing pipeline with the Dataflow
model,
you are creating a job which is capable of being processed by any
number
of
Dataflow processing engines. Several engines have been developed to run
Dataflow pipelines in other open source runtimes, including a Dataflow
runner for Apache Flink and Apache Spark. There is also a “direct
runner”,
for execution on the developer machine (mainly for dev/debug purposes).
Another runner allows a Dataflow program to run on a managed service,
Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK
is
already available on GitHub, and independent from the Google Cloud
Dataflow
service. Another Python SDK is currently in active development.

In this proposal, the Dataflow SDKs, model, and a set of runners will
be
submitted as an OSS project under the ASF. The runners which are a part
of
this proposal include those for Spark (from Cloudera), Flink (from data
Artisans), and local development (from Google); the Google Cloud
Dataflow
service runner is not included in this proposal. Further references to
Dataflow will refer to the Dataflow model, SDKs, and runners which are
a
part of this proposal (Apache Dataflow) only. The initial submission
will
contain the already-released Java SDK; Google intends to submit the
Python
SDK later in the incubation process. The Google Cloud Dataflow service
will
continue to be one of many runners for Dataflow, built on Google Cloud
Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will
develop against the Apache project additions, updates, and changes.
Google
Cloud Dataflow will become one user of Apache Dataflow and will
participate
in the project openly and publicly.

The Dataflow programming model has been designed with simplicity,
scalability, and speed as key tenan

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-21 Thread chatz

Charitha Elvitigala

On 21 January 2016 at 16:17, Jean-Baptiste Onofré  wrote:

> Hi Chatz,
>
> sure, what name should I use on the proposal, Charitha ?
>
> Regards
> JB
>
>
> On 01/21/2016 11:32 AM, chatz wrote:
>
>> Hi Jean,
>>
>> I’d be interested in contributing as well.
>>
>> Thanks,
>>
>> Chatz
>>
>>
>> On 21 January 2016 at 14:22, Jean-Baptiste Onofré 
>> wrote:
>>
>> Sweet: you are on the proposal ;)
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>>
>>> On 01/21/2016 08:55 AM, Byung-Gon Chun wrote:
>>>
>>> This looks very interesting. I'm interested in contributing.

 Thanks.
 -Gon

 ---
 Byung-Gon Chun


 On Thu, Jan 21, 2016 at 1:32 AM, James Malone <
 jamesmal...@google.com.invalid> wrote:

 Hello everyone,

>
> Attached to this message is a proposed new project - Apache Dataflow, a
> unified programming model for data processing and integration.
>
> The text of the proposal is included below. Additionally, the proposal
> is
> in draft form on the wiki where we will make any required changes:
>
> https://wiki.apache.org/incubator/DataflowProposal
>
> We look forward to your feedback and input.
>
> Best,
>
> James
>
> 
>
> = Apache Dataflow =
>
> == Abstract ==
>
> Dataflow is an open source, unified model and set of language-specific
> SDKs
> for defining and executing data processing workflows, and also data
> ingestion and integration flows, supporting Enterprise Integration
> Patterns
> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
> simplify
> the mechanics of large-scale batch and streaming data processing and
> can
> run on a number of runtimes like Apache Flink, Apache Spark, and Google
> Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
> languages, allowing users to easily implement their data integration
> processes.
>
> == Proposal ==
>
> Dataflow is a simple, flexible, and powerful system for distributed
> data
> processing at any scale. Dataflow provides a unified programming
> model, a
> software development kit to define and construct data processing
> pipelines,
> and runners to execute Dataflow pipelines in several runtime engines,
> like
> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
> used
> for a variety of streaming or batch data processing goals including
> ETL,
> stream analysis, and aggregate computation. The underlying programming
> model for Dataflow provides MapReduce-like parallelism, combined with
> support for powerful data windowing, and fine-grained correctness
> control.
>
> == Background ==
>
> Dataflow started as a set of Google projects focused on making data
> processing easier, faster, and less costly. The Dataflow model is a
> successor to MapReduce, FlumeJava, and Millwheel inside Google and is
> focused on providing a unified solution for batch and stream
> processing.
> These projects on which Dataflow is based have been published in
> several
> papers made available to the public:
>
> * MapReduce - http://research.google.com/archive/mapreduce.html
>
> * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>
> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>
> * MillWheel - http://research.google.com/pubs/pub41378.html
>
> Dataflow was designed from the start to provide a portable programming
> layer. When you define a data processing pipeline with the Dataflow
> model,
> you are creating a job which is capable of being processed by any
> number
> of
> Dataflow processing engines. Several engines have been developed to run
> Dataflow pipelines in other open source runtimes, including a Dataflow
> runner for Apache Flink and Apache Spark. There is also a “direct
> runner”,
> for execution on the developer machine (mainly for dev/debug purposes).
> Another runner allows a Dataflow program to run on a managed service,
> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK
> is
> already available on GitHub, and independent from the Google Cloud
> Dataflow
> service. Another Python SDK is currently in active development.
>
> In this proposal, the Dataflow SDKs, model, and a set of runners will
> be
> submitted as an OSS project under the ASF. The runners which are a part
> of
> this proposal include those for Spark (from Cloudera), Flink (from data
> Artisans), and local development (from Google); the Google Cloud
> Dataflow
> service runner is not included in this proposal. Further references to
> Dataflow will refer to the Dataflow model, SDKs, and runners which are
> a
> part of this proposal (A

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-21 Thread Jean-Baptiste Onofré


Hi Chatz,

sure, what name should I use on the proposal, Charitha ?

Regards
JB

On 01/21/2016 11:32 AM, chatz wrote:

Hi Jean,

I’d be interested in contributing as well.

Thanks,

Chatz


On 21 January 2016 at 14:22, Jean-Baptiste Onofré  wrote:


Sweet: you are on the proposal ;)

Thanks !
Regards
JB


On 01/21/2016 08:55 AM, Byung-Gon Chun wrote:


This looks very interesting. I'm interested in contributing.

Thanks.
-Gon

---
Byung-Gon Chun


On Thu, Jan 21, 2016 at 1:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:

Hello everyone,


Attached to this message is a proposed new project - Apache Dataflow, a
unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the proposal is
in draft form on the wiki where we will make any required changes:

https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of language-specific
SDKs
for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration
Patterns
(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
the mechanics of large-scale batch and streaming data processing and can
run on a number of runtimes like Apache Flink, Apache Spark, and Google
Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
languages, allowing users to easily implement their data integration
processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for distributed data
processing at any scale. Dataflow provides a unified programming model, a
software development kit to define and construct data processing
pipelines,
and runners to execute Dataflow pipelines in several runtime engines,
like
Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
used
for a variety of streaming or batch data processing goals including ETL,
stream analysis, and aggregate computation. The underlying programming
model for Dataflow provides MapReduce-like parallelism, combined with
support for powerful data windowing, and fine-grained correctness
control.

== Background ==

Dataflow started as a set of Google projects focused on making data
processing easier, faster, and less costly. The Dataflow model is a
successor to MapReduce, FlumeJava, and Millwheel inside Google and is
focused on providing a unified solution for batch and stream processing.
These projects on which Dataflow is based have been published in several
papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable programming
layer. When you define a data processing pipeline with the Dataflow
model,
you are creating a job which is capable of being processed by any number
of
Dataflow processing engines. Several engines have been developed to run
Dataflow pipelines in other open source runtimes, including a Dataflow
runner for Apache Flink and Apache Spark. There is also a “direct
runner”,
for execution on the developer machine (mainly for dev/debug purposes).
Another runner allows a Dataflow program to run on a managed service,
Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is
already available on GitHub, and independent from the Google Cloud
Dataflow
service. Another Python SDK is currently in active development.

In this proposal, the Dataflow SDKs, model, and a set of runners will be
submitted as an OSS project under the ASF. The runners which are a part
of
this proposal include those for Spark (from Cloudera), Flink (from data
Artisans), and local development (from Google); the Google Cloud Dataflow
service runner is not included in this proposal. Further references to
Dataflow will refer to the Dataflow model, SDKs, and runners which are a
part of this proposal (Apache Dataflow) only. The initial submission will
contain the already-released Java SDK; Google intends to submit the
Python
SDK later in the incubation process. The Google Cloud Dataflow service
will
continue to be one of many runners for Dataflow, built on Google Cloud
Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will
develop against the Apache project additions, updates, and changes.
Google
Cloud Dataflow will become one user of Apache Dataflow and will
participate
in the project openly and publicly.

The Dataflow programming model has been designed with simplicity,
scalability, and speed as key tenants. In the Dataflow model, you only
need
to think about four top-level concepts when constructing your data
processing job:

* Pipelines - The data processing job made of a series of c

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-21 Thread chatz

Hi Jean,

I’d be interested in contributing as well.

Thanks,

Chatz


On 21 January 2016 at 14:22, Jean-Baptiste Onofré  wrote:

> Sweet: you are on the proposal ;)
>
> Thanks !
> Regards
> JB
>
>
> On 01/21/2016 08:55 AM, Byung-Gon Chun wrote:
>
>> This looks very interesting. I'm interested in contributing.
>>
>> Thanks.
>> -Gon
>>
>> ---
>> Byung-Gon Chun
>>
>>
>> On Thu, Jan 21, 2016 at 1:32 AM, James Malone <
>> jamesmal...@google.com.invalid> wrote:
>>
>> Hello everyone,
>>>
>>> Attached to this message is a proposed new project - Apache Dataflow, a
>>> unified programming model for data processing and integration.
>>>
>>> The text of the proposal is included below. Additionally, the proposal is
>>> in draft form on the wiki where we will make any required changes:
>>>
>>> https://wiki.apache.org/incubator/DataflowProposal
>>>
>>> We look forward to your feedback and input.
>>>
>>> Best,
>>>
>>> James
>>>
>>> 
>>>
>>> = Apache Dataflow =
>>>
>>> == Abstract ==
>>>
>>> Dataflow is an open source, unified model and set of language-specific
>>> SDKs
>>> for defining and executing data processing workflows, and also data
>>> ingestion and integration flows, supporting Enterprise Integration
>>> Patterns
>>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
>>> the mechanics of large-scale batch and streaming data processing and can
>>> run on a number of runtimes like Apache Flink, Apache Spark, and Google
>>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
>>> languages, allowing users to easily implement their data integration
>>> processes.
>>>
>>> == Proposal ==
>>>
>>> Dataflow is a simple, flexible, and powerful system for distributed data
>>> processing at any scale. Dataflow provides a unified programming model, a
>>> software development kit to define and construct data processing
>>> pipelines,
>>> and runners to execute Dataflow pipelines in several runtime engines,
>>> like
>>> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
>>> used
>>> for a variety of streaming or batch data processing goals including ETL,
>>> stream analysis, and aggregate computation. The underlying programming
>>> model for Dataflow provides MapReduce-like parallelism, combined with
>>> support for powerful data windowing, and fine-grained correctness
>>> control.
>>>
>>> == Background ==
>>>
>>> Dataflow started as a set of Google projects focused on making data
>>> processing easier, faster, and less costly. The Dataflow model is a
>>> successor to MapReduce, FlumeJava, and Millwheel inside Google and is
>>> focused on providing a unified solution for batch and stream processing.
>>> These projects on which Dataflow is based have been published in several
>>> papers made available to the public:
>>>
>>> * MapReduce - http://research.google.com/archive/mapreduce.html
>>>
>>> * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>>>
>>> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>>>
>>> * MillWheel - http://research.google.com/pubs/pub41378.html
>>>
>>> Dataflow was designed from the start to provide a portable programming
>>> layer. When you define a data processing pipeline with the Dataflow
>>> model,
>>> you are creating a job which is capable of being processed by any number
>>> of
>>> Dataflow processing engines. Several engines have been developed to run
>>> Dataflow pipelines in other open source runtimes, including a Dataflow
>>> runner for Apache Flink and Apache Spark. There is also a “direct
>>> runner”,
>>> for execution on the developer machine (mainly for dev/debug purposes).
>>> Another runner allows a Dataflow program to run on a managed service,
>>> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is
>>> already available on GitHub, and independent from the Google Cloud
>>> Dataflow
>>> service. Another Python SDK is currently in active development.
>>>
>>> In this proposal, the Dataflow SDKs, model, and a set of runners will be
>>> submitted as an OSS project under the ASF. The runners which are a part
>>> of
>>> this proposal include those for Spark (from Cloudera), Flink (from data
>>> Artisans), and local development (from Google); the Google Cloud Dataflow
>>> service runner is not included in this proposal. Further references to
>>> Dataflow will refer to the Dataflow model, SDKs, and runners which are a
>>> part of this proposal (Apache Dataflow) only. The initial submission will
>>> contain the already-released Java SDK; Google intends to submit the
>>> Python
>>> SDK later in the incubation process. The Google Cloud Dataflow service
>>> will
>>> continue to be one of many runners for Dataflow, built on Google Cloud
>>> Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will
>>> develop against the Apache project additions, updates, and changes.
>>> Google
>>> Cloud Dataflow will become one user of Apache Dataflow and will
>>> participate
>>> in the projec

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-21 Thread Jean-Baptiste Onofré


Sweet: you are on the proposal ;)

Thanks !
Regards
JB

On 01/21/2016 08:55 AM, Byung-Gon Chun wrote:

This looks very interesting. I'm interested in contributing.

Thanks.
-Gon

---
Byung-Gon Chun


On Thu, Jan 21, 2016 at 1:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:


Hello everyone,

Attached to this message is a proposed new project - Apache Dataflow, a
unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the proposal is
in draft form on the wiki where we will make any required changes:

https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of language-specific SDKs
for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration Patterns
(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
the mechanics of large-scale batch and streaming data processing and can
run on a number of runtimes like Apache Flink, Apache Spark, and Google
Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
languages, allowing users to easily implement their data integration
processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for distributed data
processing at any scale. Dataflow provides a unified programming model, a
software development kit to define and construct data processing pipelines,
and runners to execute Dataflow pipelines in several runtime engines, like
Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used
for a variety of streaming or batch data processing goals including ETL,
stream analysis, and aggregate computation. The underlying programming
model for Dataflow provides MapReduce-like parallelism, combined with
support for powerful data windowing, and fine-grained correctness control.

== Background ==

Dataflow started as a set of Google projects focused on making data
processing easier, faster, and less costly. The Dataflow model is a
successor to MapReduce, FlumeJava, and Millwheel inside Google and is
focused on providing a unified solution for batch and stream processing.
These projects on which Dataflow is based have been published in several
papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable programming
layer. When you define a data processing pipeline with the Dataflow model,
you are creating a job which is capable of being processed by any number of
Dataflow processing engines. Several engines have been developed to run
Dataflow pipelines in other open source runtimes, including a Dataflow
runner for Apache Flink and Apache Spark. There is also a “direct runner”,
for execution on the developer machine (mainly for dev/debug purposes).
Another runner allows a Dataflow program to run on a managed service,
Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is
already available on GitHub, and independent from the Google Cloud Dataflow
service. Another Python SDK is currently in active development.

In this proposal, the Dataflow SDKs, model, and a set of runners will be
submitted as an OSS project under the ASF. The runners which are a part of
this proposal include those for Spark (from Cloudera), Flink (from data
Artisans), and local development (from Google); the Google Cloud Dataflow
service runner is not included in this proposal. Further references to
Dataflow will refer to the Dataflow model, SDKs, and runners which are a
part of this proposal (Apache Dataflow) only. The initial submission will
contain the already-released Java SDK; Google intends to submit the Python
SDK later in the incubation process. The Google Cloud Dataflow service will
continue to be one of many runners for Dataflow, built on Google Cloud
Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will
develop against the Apache project additions, updates, and changes. Google
Cloud Dataflow will become one user of Apache Dataflow and will participate
in the project openly and publicly.

The Dataflow programming model has been designed with simplicity,
scalability, and speed as key tenants. In the Dataflow model, you only need
to think about four top-level concepts when constructing your data
processing job:

* Pipelines - The data processing job made of a series of computations
including input, processing, and output

* PCollections - Bounded (or unbounded) datasets which represent the input,
intermediate and output data in pipelines

* PTransforms - A data processing step in a pipeline in which one or more
PColle

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Byung-Gon Chun

This looks very interesting. I'm interested in contributing.

Thanks.
-Gon

---
Byung-Gon Chun


On Thu, Jan 21, 2016 at 1:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:

> Hello everyone,
>
> Attached to this message is a proposed new project - Apache Dataflow, a
> unified programming model for data processing and integration.
>
> The text of the proposal is included below. Additionally, the proposal is
> in draft form on the wiki where we will make any required changes:
>
> https://wiki.apache.org/incubator/DataflowProposal
>
> We look forward to your feedback and input.
>
> Best,
>
> James
>
> 
>
> = Apache Dataflow =
>
> == Abstract ==
>
> Dataflow is an open source, unified model and set of language-specific SDKs
> for defining and executing data processing workflows, and also data
> ingestion and integration flows, supporting Enterprise Integration Patterns
> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
> the mechanics of large-scale batch and streaming data processing and can
> run on a number of runtimes like Apache Flink, Apache Spark, and Google
> Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
> languages, allowing users to easily implement their data integration
> processes.
>
> == Proposal ==
>
> Dataflow is a simple, flexible, and powerful system for distributed data
> processing at any scale. Dataflow provides a unified programming model, a
> software development kit to define and construct data processing pipelines,
> and runners to execute Dataflow pipelines in several runtime engines, like
> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used
> for a variety of streaming or batch data processing goals including ETL,
> stream analysis, and aggregate computation. The underlying programming
> model for Dataflow provides MapReduce-like parallelism, combined with
> support for powerful data windowing, and fine-grained correctness control.
>
> == Background ==
>
> Dataflow started as a set of Google projects focused on making data
> processing easier, faster, and less costly. The Dataflow model is a
> successor to MapReduce, FlumeJava, and Millwheel inside Google and is
> focused on providing a unified solution for batch and stream processing.
> These projects on which Dataflow is based have been published in several
> papers made available to the public:
>
> * MapReduce - http://research.google.com/archive/mapreduce.html
>
> * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>
> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>
> * MillWheel - http://research.google.com/pubs/pub41378.html
>
> Dataflow was designed from the start to provide a portable programming
> layer. When you define a data processing pipeline with the Dataflow model,
> you are creating a job which is capable of being processed by any number of
> Dataflow processing engines. Several engines have been developed to run
> Dataflow pipelines in other open source runtimes, including a Dataflow
> runner for Apache Flink and Apache Spark. There is also a “direct runner”,
> for execution on the developer machine (mainly for dev/debug purposes).
> Another runner allows a Dataflow program to run on a managed service,
> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is
> already available on GitHub, and independent from the Google Cloud Dataflow
> service. Another Python SDK is currently in active development.
>
> In this proposal, the Dataflow SDKs, model, and a set of runners will be
> submitted as an OSS project under the ASF. The runners which are a part of
> this proposal include those for Spark (from Cloudera), Flink (from data
> Artisans), and local development (from Google); the Google Cloud Dataflow
> service runner is not included in this proposal. Further references to
> Dataflow will refer to the Dataflow model, SDKs, and runners which are a
> part of this proposal (Apache Dataflow) only. The initial submission will
> contain the already-released Java SDK; Google intends to submit the Python
> SDK later in the incubation process. The Google Cloud Dataflow service will
> continue to be one of many runners for Dataflow, built on Google Cloud
> Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will
> develop against the Apache project additions, updates, and changes. Google
> Cloud Dataflow will become one user of Apache Dataflow and will participate
> in the project openly and publicly.
>
> The Dataflow programming model has been designed with simplicity,
> scalability, and speed as key tenants. In the Dataflow model, you only need
> to think about four top-level concepts when constructing your data
> processing job:
>
> * Pipelines - The data processing job made of a series of computations
> including input, processing, and output
>
> * PCollections - Bounded (or unbounded) datasets which represent the input,
> intermediate and output data in pipelines
>
> * PTransf

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré


I added you on the proposal.

Thanks !
Regards
JB

On 01/21/2016 07:36 AM, Hao Chen wrote:

Nice proposal, exactly matches with what we wanna do in some projects,
interested to contribute.

Regards,
Hao

On Thu, Jan 21, 2016 at 12:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:


Hello everyone,

Attached to this message is a proposed new project - Apache Dataflow, a
unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the proposal is
in draft form on the wiki where we will make any required changes:

https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of language-specific SDKs
for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration Patterns
(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
the mechanics of large-scale batch and streaming data processing and can
run on a number of runtimes like Apache Flink, Apache Spark, and Google
Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
languages, allowing users to easily implement their data integration
processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for distributed data
processing at any scale. Dataflow provides a unified programming model, a
software development kit to define and construct data processing pipelines,
and runners to execute Dataflow pipelines in several runtime engines, like
Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used
for a variety of streaming or batch data processing goals including ETL,
stream analysis, and aggregate computation. The underlying programming
model for Dataflow provides MapReduce-like parallelism, combined with
support for powerful data windowing, and fine-grained correctness control.

== Background ==

Dataflow started as a set of Google projects focused on making data
processing easier, faster, and less costly. The Dataflow model is a
successor to MapReduce, FlumeJava, and Millwheel inside Google and is
focused on providing a unified solution for batch and stream processing.
These projects on which Dataflow is based have been published in several
papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable programming
layer. When you define a data processing pipeline with the Dataflow model,
you are creating a job which is capable of being processed by any number of
Dataflow processing engines. Several engines have been developed to run
Dataflow pipelines in other open source runtimes, including a Dataflow
runner for Apache Flink and Apache Spark. There is also a “direct runner”,
for execution on the developer machine (mainly for dev/debug purposes).
Another runner allows a Dataflow program to run on a managed service,
Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is
already available on GitHub, and independent from the Google Cloud Dataflow
service. Another Python SDK is currently in active development.

In this proposal, the Dataflow SDKs, model, and a set of runners will be
submitted as an OSS project under the ASF. The runners which are a part of
this proposal include those for Spark (from Cloudera), Flink (from data
Artisans), and local development (from Google); the Google Cloud Dataflow
service runner is not included in this proposal. Further references to
Dataflow will refer to the Dataflow model, SDKs, and runners which are a
part of this proposal (Apache Dataflow) only. The initial submission will
contain the already-released Java SDK; Google intends to submit the Python
SDK later in the incubation process. The Google Cloud Dataflow service will
continue to be one of many runners for Dataflow, built on Google Cloud
Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will
develop against the Apache project additions, updates, and changes. Google
Cloud Dataflow will become one user of Apache Dataflow and will participate
in the project openly and publicly.

The Dataflow programming model has been designed with simplicity,
scalability, and speed as key tenants. In the Dataflow model, you only need
to think about four top-level concepts when constructing your data
processing job:

* Pipelines - The data processing job made of a series of computations
including input, processing, and output

* PCollections - Bounded (or unbounded) datasets which represent the input,
intermediate and output data in pipelines

* PTransforms - A data processing step in a pipeline in which one or more
P

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré


Great: done ;)

Regards
JB

On 01/21/2016 07:15 AM, Edward J. Yoon wrote:

Pls add me to "Additional Interested Contributors" section too. :-)

--
Best Regards, Edward J. Yoon


-Original Message-
From: Jean-Baptiste Onofr� [mailto:j...@nanthrax.net]
Sent: Thursday, January 21, 2016 2:39 PM
To: general@incubator.apache.org
Subject: Re: [DISCUSS] Apache Dataflow Incubator Proposal

Cool ! I added you on the proposal.

Regards
JB

On 01/21/2016 12:20 AM, ksobkowiak wrote:

It's a great news the project is going to move to Apache. I'd be

interested

in contributing too

Regards
Krzysztof



--
View this message in context:

http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow
-Incubator-Proposal-tp47985p48025.html

Sent from the Apache Incubator - General mailing list archive at

Nabble.com.


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



--
Jean-Baptiste Onofr�
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org




-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Hao Chen

Nice proposal, exactly matches with what we wanna do in some projects,
interested to contribute.

Regards,
Hao

On Thu, Jan 21, 2016 at 12:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:

> Hello everyone,
>
> Attached to this message is a proposed new project - Apache Dataflow, a
> unified programming model for data processing and integration.
>
> The text of the proposal is included below. Additionally, the proposal is
> in draft form on the wiki where we will make any required changes:
>
> https://wiki.apache.org/incubator/DataflowProposal
>
> We look forward to your feedback and input.
>
> Best,
>
> James
>
> 
>
> = Apache Dataflow =
>
> == Abstract ==
>
> Dataflow is an open source, unified model and set of language-specific SDKs
> for defining and executing data processing workflows, and also data
> ingestion and integration flows, supporting Enterprise Integration Patterns
> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
> the mechanics of large-scale batch and streaming data processing and can
> run on a number of runtimes like Apache Flink, Apache Spark, and Google
> Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
> languages, allowing users to easily implement their data integration
> processes.
>
> == Proposal ==
>
> Dataflow is a simple, flexible, and powerful system for distributed data
> processing at any scale. Dataflow provides a unified programming model, a
> software development kit to define and construct data processing pipelines,
> and runners to execute Dataflow pipelines in several runtime engines, like
> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used
> for a variety of streaming or batch data processing goals including ETL,
> stream analysis, and aggregate computation. The underlying programming
> model for Dataflow provides MapReduce-like parallelism, combined with
> support for powerful data windowing, and fine-grained correctness control.
>
> == Background ==
>
> Dataflow started as a set of Google projects focused on making data
> processing easier, faster, and less costly. The Dataflow model is a
> successor to MapReduce, FlumeJava, and Millwheel inside Google and is
> focused on providing a unified solution for batch and stream processing.
> These projects on which Dataflow is based have been published in several
> papers made available to the public:
>
> * MapReduce - http://research.google.com/archive/mapreduce.html
>
> * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>
> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>
> * MillWheel - http://research.google.com/pubs/pub41378.html
>
> Dataflow was designed from the start to provide a portable programming
> layer. When you define a data processing pipeline with the Dataflow model,
> you are creating a job which is capable of being processed by any number of
> Dataflow processing engines. Several engines have been developed to run
> Dataflow pipelines in other open source runtimes, including a Dataflow
> runner for Apache Flink and Apache Spark. There is also a “direct runner”,
> for execution on the developer machine (mainly for dev/debug purposes).
> Another runner allows a Dataflow program to run on a managed service,
> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is
> already available on GitHub, and independent from the Google Cloud Dataflow
> service. Another Python SDK is currently in active development.
>
> In this proposal, the Dataflow SDKs, model, and a set of runners will be
> submitted as an OSS project under the ASF. The runners which are a part of
> this proposal include those for Spark (from Cloudera), Flink (from data
> Artisans), and local development (from Google); the Google Cloud Dataflow
> service runner is not included in this proposal. Further references to
> Dataflow will refer to the Dataflow model, SDKs, and runners which are a
> part of this proposal (Apache Dataflow) only. The initial submission will
> contain the already-released Java SDK; Google intends to submit the Python
> SDK later in the incubation process. The Google Cloud Dataflow service will
> continue to be one of many runners for Dataflow, built on Google Cloud
> Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will
> develop against the Apache project additions, updates, and changes. Google
> Cloud Dataflow will become one user of Apache Dataflow and will participate
> in the project openly and publicly.
>
> The Dataflow programming model has been designed with simplicity,
> scalability, and speed as key tenants. In the Dataflow model, you only need
> to think about four top-level concepts when constructing your data
> processing job:
>
> * Pipelines - The data processing job made of a series of computations
> including input, processing, and output
>
> * PCollections - Bounded (or unbounded) datasets which represent the input,
> intermediate and output data in pipelin

RE: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Edward J. Yoon

Pls add me to "Additional Interested Contributors" section too. :-)

--
Best Regards, Edward J. Yoon


-Original Message-
From: Jean-Baptiste Onofr� [mailto:j...@nanthrax.net]
Sent: Thursday, January 21, 2016 2:39 PM
To: general@incubator.apache.org
Subject: Re: [DISCUSS] Apache Dataflow Incubator Proposal

Cool ! I added you on the proposal.

Regards
JB

On 01/21/2016 12:20 AM, ksobkowiak wrote:
> It's a great news the project is going to move to Apache. I'd be
interested
> in contributing too
>
> Regards
> Krzysztof
>
>
>
> --
> View this message in context:
http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow
-Incubator-Proposal-tp47985p48025.html
> Sent from the Apache Incubator - General mailing list archive at
Nabble.com.
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>

--
Jean-Baptiste Onofr�
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org




-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré


Cool ! I added you on the proposal.

Regards
JB

On 01/21/2016 12:20 AM, ksobkowiak wrote:

It's a great news the project is going to move to Apache. I'd be interested
in contributing too

Regards
Krzysztof



--
View this message in context: 
http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48025.html
Sent from the Apache Incubator - General mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré


Hi Hugo,

I added you on the proposal

Thanks
Regards
JB

On 01/21/2016 12:54 AM, Hugo Louro wrote:

Hello everyone,

Very compelling proposal; congrats! I would be interested in contributing
to this project from the beginning.

Looking forward to it.

Best,
Hugo

On Wed, Jan 20, 2016 at 3:24 PM, Prasanth Jayachandran <
pjayachand...@hortonworks.com> wrote:


Hi Jean

I’d be interested in contributing as well.

Thanks
Prasanth Jayachandran


On Jan 20, 2016, at 5:20 PM, ksobkowiak 

wrote:


It's a great news the project is going to move to Apache. I'd be

interested

in contributing too

Regards
Krzysztof



--
View this message in context:

http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48025.html

Sent from the Apache Incubator - General mailing list archive at

Nabble.com.


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org









--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré


Awesome: you are in the proposal ;)

Regards
JB

On 01/21/2016 12:55 AM, Johan Edstrom wrote:

Looking forward, also interested in contributing.



On Jan 20, 2016, at 4:54 PM, Hugo Louro  wrote:

Hello everyone,

Very compelling proposal; congrats! I would be interested in contributing
to this project from the beginning.

Looking forward to it.

Best,
Hugo

On Wed, Jan 20, 2016 at 3:24 PM, Prasanth Jayachandran <
pjayachand...@hortonworks.com> wrote:


Hi Jean

I’d be interested in contributing as well.

Thanks
Prasanth Jayachandran


On Jan 20, 2016, at 5:20 PM, ksobkowiak 

wrote:


It's a great news the project is going to move to Apache. I'd be

interested

in contributing too

Regards
Krzysztof



--
View this message in context:

http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48025.html

Sent from the Apache Incubator - General mailing list archive at

Nabble.com.


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org








-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré


Hi,

great: I added you in the proposal.

Thanks !
Regards
JB

On 01/21/2016 12:24 AM, Prasanth Jayachandran wrote:

Hi Jean

I’d be interested in contributing as well.

Thanks
Prasanth Jayachandran


On Jan 20, 2016, at 5:20 PM, ksobkowiak  wrote:

It's a great news the project is going to move to Apache. I'd be interested
in contributing too

Regards
Krzysztof



--
View this message in context: 
http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48025.html
Sent from the Apache Incubator - General mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org





-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

RE: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Edward J. Yoon

Wow ..  great news!

--
Best Regards, Edward J. Yoon

-Original Message-
From: Johan Edstrom [mailto:seij...@gmail.com]
Sent: Thursday, January 21, 2016 8:56 AM
To: general@incubator.apache.org
Subject: Re: [DISCUSS] Apache Dataflow Incubator Proposal

Looking forward, also interested in contributing.


> On Jan 20, 2016, at 4:54 PM, Hugo Louro  wrote:
>
> Hello everyone,
>
> Very compelling proposal; congrats! I would be interested in contributing
> to this project from the beginning.
>
> Looking forward to it.
>
> Best,
> Hugo
>
> On Wed, Jan 20, 2016 at 3:24 PM, Prasanth Jayachandran <
> pjayachand...@hortonworks.com> wrote:
>
>> Hi Jean
>>
>> I’d be interested in contributing as well.
>>
>> Thanks
>> Prasanth Jayachandran
>>
>>> On Jan 20, 2016, at 5:20 PM, ksobkowiak 
>> wrote:
>>>
>>> It's a great news the project is going to move to Apache. I'd be
>> interested
>>> in contributing too
>>>
>>> Regards
>>> Krzysztof
>>>
>>>
>>>
>>> --
>>> View this message in context:
>> http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48025.html
>>> Sent from the Apache Incubator - General mailing list archive at
>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>>> For additional commands, e-mail: general-h...@incubator.apache.org
>>>
>>>
>>
>>


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org




-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Johan Edstrom

Looking forward, also interested in contributing.


> On Jan 20, 2016, at 4:54 PM, Hugo Louro  wrote:
> 
> Hello everyone,
> 
> Very compelling proposal; congrats! I would be interested in contributing
> to this project from the beginning.
> 
> Looking forward to it.
> 
> Best,
> Hugo
> 
> On Wed, Jan 20, 2016 at 3:24 PM, Prasanth Jayachandran <
> pjayachand...@hortonworks.com> wrote:
> 
>> Hi Jean
>> 
>> I’d be interested in contributing as well.
>> 
>> Thanks
>> Prasanth Jayachandran
>> 
>>> On Jan 20, 2016, at 5:20 PM, ksobkowiak 
>> wrote:
>>> 
>>> It's a great news the project is going to move to Apache. I'd be
>> interested
>>> in contributing too
>>> 
>>> Regards
>>> Krzysztof
>>> 
>>> 
>>> 
>>> --
>>> View this message in context:
>> http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48025.html
>>> Sent from the Apache Incubator - General mailing list archive at
>> Nabble.com.
>>> 
>>> -
>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>>> For additional commands, e-mail: general-h...@incubator.apache.org
>>> 
>>> 
>> 
>> 


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Hugo Louro

Hello everyone,

Very compelling proposal; congrats! I would be interested in contributing
to this project from the beginning.

Looking forward to it.

Best,
Hugo

On Wed, Jan 20, 2016 at 3:24 PM, Prasanth Jayachandran <
pjayachand...@hortonworks.com> wrote:

> Hi Jean
>
> I’d be interested in contributing as well.
>
> Thanks
> Prasanth Jayachandran
>
> > On Jan 20, 2016, at 5:20 PM, ksobkowiak 
> wrote:
> >
> > It's a great news the project is going to move to Apache. I'd be
> interested
> > in contributing too
> >
> > Regards
> > Krzysztof
> >
> >
> >
> > --
> > View this message in context:
> http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48025.html
> > Sent from the Apache Incubator - General mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
> >
>
>

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread James Malone

>
>
> I don't see anything in the proposal about Google ceasing the use of the
> brand
> "Google Cloud Dataflow".  Yet the co-existence of "Google Cloud Dataflow"
> and
> "Apache Dataflow" would conflict with Apache requirements for vendor
> neutrality and project independence.
>
> The issue seems similar to the recent proposal to incubate "Apache
> OpenMiracl"
> while allowing the "Miracl" company to continue distribution of the
> "Miracl"
> project.  That situation was was resolved by renaming the Apache project to
> "Milagro", allowing the Miracl company to continue benefitting from the
> brand
> they had invested in so heavily.
>

Apologies to my delay responding to the feedback about naming!

We anticipated there may be some concerns about the naming. The project
members also want to confront those concerns head-on so any issues related
to naming don't take away from the technical merit of the proposal. We're
open to coming up with a new name and renaming the proposed project if it's
prudent or required. To that end, I have a question about the order of
operations.

If we need to rename, we would ideally choose a new name, change the
project name at that time, and start our refactoring with that new name. Is
is acceptable for us to flag a name change as something we need to do as a
near-term (1st month) item in incubation (if accepted)? If a rename is
required I'd like to add it to our to-do roadmap but also not block our
proposal on a renaming. I ask so we can address this concern in the best
way possible.



> http://markmail.org/message/tpiphl55rcyezcvd
>
> Marvin Humphrey
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Prasanth Jayachandran

Hi Jean

I’d be interested in contributing as well.

Thanks
Prasanth Jayachandran

> On Jan 20, 2016, at 5:20 PM, ksobkowiak  wrote:
> 
> It's a great news the project is going to move to Apache. I'd be interested
> in contributing too
> 
> Regards
> Krzysztof
> 
> 
> 
> --
> View this message in context: 
> http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48025.html
> Sent from the Apache Incubator - General mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
> 
>

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread ksobkowiak

It's a great news the project is going to move to Apache. I'd be interested
in contributing too

Regards
Krzysztof



--
View this message in context: 
http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48025.html
Sent from the Apache Incubator - General mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré


Cool: you are on the proposal

Regards
JB

On 01/20/2016 09:52 PM, Vaibhav Gumashta wrote:

Hi Jean,

I¹d like to contribute as well.

Thanks,
‹Vaibhav

On 1/20/16, 11:19 AM, "Jean-Baptiste Onofré"  wrote:


Hey James,

you are on the proposal ;)

Thanks !
Regards
JB

On 01/20/2016 07:20 PM, James Carman wrote:

Well, I for one would be very interested in this project and would be
happy
to contribute.


On Wed, Jan 20, 2016 at 12:09 PM Jean-Baptiste Onofré 
wrote:


Hi Sean,

It's a fair point, but not present in most of the proposals. It's
something that we can address in the "Community" section.

Regards
JB

On 01/20/2016 05:55 PM, Sean Busbey wrote:

Great proposal. I like that your proposal includes a well presented
roadmap, but I don't see any goals that directly address building a

larger

community. Y'all have any ideas around outreach that will help with
adoption?

As a start, I recommend y'all add a section to the proposal on the
wiki
page for "Additional Interested Contributors" so that folks who want
to
sign up to participate in the project can do so without requesting
additions to the initial committer list.

On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:


Hello everyone,

Attached to this message is a proposed new project - Apache
Dataflow, a
unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the
proposal

is

in draft form on the wiki where we will make any required changes:

https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of
language-specific

SDKs

for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration

Patterns

(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
simplify
the mechanics of large-scale batch and streaming data processing and
can
run on a number of runtimes like Apache Flink, Apache Spark, and
Google
Cloud Dataflow (a cloud service). Dataflow also brings DSL in
different
languages, allowing users to easily implement their data integration
processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for distributed
data
processing at any scale. Dataflow provides a unified programming
model,

a

software development kit to define and construct data processing

pipelines,

and runners to execute Dataflow pipelines in several runtime engines,

like

Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be

used

for a variety of streaming or batch data processing goals including
ETL,
stream analysis, and aggregate computation. The underlying
programming
model for Dataflow provides MapReduce-like parallelism, combined with
support for powerful data windowing, and fine-grained correctness

control.


== Background ==

Dataflow started as a set of Google projects focused on making data
processing easier, faster, and less costly. The Dataflow model is a
successor to MapReduce, FlumeJava, and Millwheel inside Google and is
focused on providing a unified solution for batch and stream
processing.
These projects on which Dataflow is based have been published in
several
papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable
programming
layer. When you define a data processing pipeline with the Dataflow

model,

you are creating a job which is capable of being processed by any

number of

Dataflow processing engines. Several engines have been developed to
run
Dataflow pipelines in other open source runtimes, including a
Dataflow
runner for Apache Flink and Apache Spark. There is also a ³direct

runner²,

for execution on the developer machine (mainly for dev/debug
purposes).
Another runner allows a Dataflow program to run on a managed service,
Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java
SDK

is

already available on GitHub, and independent from the Google Cloud

Dataflow

service. Another Python SDK is currently in active development.

In this proposal, the Dataflow SDKs, model, and a set of runners
will be
submitted as an OSS project under the ASF. The runners which are a
part

of

this proposal include those for Spark (from Cloudera), Flink (from
data
Artisans), and local development (from Google); the Google Cloud

Dataflow

service runner is not included in this proposal. Further references
to
Dataflow will refer to the Dataflow model, SDKs, and runners which
are a
part of this proposal (Apache Dataflow) only. The initial submission

will

contain the already-released Java SDK; Google

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré


Great: you are on the proposal.

Regards
JB

On 01/20/2016 09:00 PM, Joe Witt wrote:

Hello

This is a very interesting proposal and concept.  I'd like to contribute.

Thanks
Joe

On Wed, Jan 20, 2016 at 2:50 PM, James Carman
 wrote:

Of course! I'd be happy to help

On Wed, Jan 20, 2016 at 2:02 PM Jean-Baptiste Onofré 
wrote:


Hi James,

Can I add your to the proposal ?

Regards
JB

On 01/20/2016 07:20 PM, James Carman wrote:

Well, I for one would be very interested in this project and would be

happy

to contribute.


On Wed, Jan 20, 2016 at 12:09 PM Jean-Baptiste Onofré 
wrote:


Hi Sean,

It's a fair point, but not present in most of the proposals. It's
something that we can address in the "Community" section.

Regards
JB

On 01/20/2016 05:55 PM, Sean Busbey wrote:

Great proposal. I like that your proposal includes a well presented
roadmap, but I don't see any goals that directly address building a

larger

community. Y'all have any ideas around outreach that will help with
adoption?

As a start, I recommend y'all add a section to the proposal on the wiki
page for "Additional Interested Contributors" so that folks who want to
sign up to participate in the project can do so without requesting
additions to the initial committer list.

On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:


Hello everyone,

Attached to this message is a proposed new project - Apache Dataflow,

a

unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the proposal

is

in draft form on the wiki where we will make any required changes:

https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of language-specific

SDKs

for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration

Patterns

(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines

simplify

the mechanics of large-scale batch and streaming data processing and

can

run on a number of runtimes like Apache Flink, Apache Spark, and

Google

Cloud Dataflow (a cloud service). Dataflow also brings DSL in

different

languages, allowing users to easily implement their data integration
processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for distributed

data

processing at any scale. Dataflow provides a unified programming

model,

a

software development kit to define and construct data processing

pipelines,

and runners to execute Dataflow pipelines in several runtime engines,

like

Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be

used

for a variety of streaming or batch data processing goals including

ETL,

stream analysis, and aggregate computation. The underlying programming
model for Dataflow provides MapReduce-like parallelism, combined with
support for powerful data windowing, and fine-grained correctness

control.


== Background ==

Dataflow started as a set of Google projects focused on making data
processing easier, faster, and less costly. The Dataflow model is a
successor to MapReduce, FlumeJava, and Millwheel inside Google and is
focused on providing a unified solution for batch and stream

processing.

These projects on which Dataflow is based have been published in

several

papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable programming
layer. When you define a data processing pipeline with the Dataflow

model,

you are creating a job which is capable of being processed by any

number of

Dataflow processing engines. Several engines have been developed to

run

Dataflow pipelines in other open source runtimes, including a Dataflow
runner for Apache Flink and Apache Spark. There is also a “direct

runner”,

for execution on the developer machine (mainly for dev/debug

purposes).

Another runner allows a Dataflow program to run on a managed service,
Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK

is

already available on GitHub, and independent from the Google Cloud

Dataflow

service. Another Python SDK is currently in active development.

In this proposal, the Dataflow SDKs, model, and a set of runners will

be

submitted as an OSS project under the ASF. The runners which are a

part

of

this proposal include those for Spark (from Cloudera), Flink (from

data

Artisans), and local development (from Google); the Google Cloud

Dataflow

service runner is not included in this proposal. Further references to
Dataflow will refer to the Dataflow model, SD

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Vaibhav Gumashta

Hi Jean,

I¹d like to contribute as well.

Thanks,
‹Vaibhav

On 1/20/16, 11:19 AM, "Jean-Baptiste Onofré"  wrote:

>Hey James,
>
>you are on the proposal ;)
>
>Thanks !
>Regards
>JB
>
>On 01/20/2016 07:20 PM, James Carman wrote:
>> Well, I for one would be very interested in this project and would be
>>happy
>> to contribute.
>>
>>
>> On Wed, Jan 20, 2016 at 12:09 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi Sean,
>>>
>>> It's a fair point, but not present in most of the proposals. It's
>>> something that we can address in the "Community" section.
>>>
>>> Regards
>>> JB
>>>
>>> On 01/20/2016 05:55 PM, Sean Busbey wrote:
 Great proposal. I like that your proposal includes a well presented
 roadmap, but I don't see any goals that directly address building a
>>> larger
 community. Y'all have any ideas around outreach that will help with
 adoption?

 As a start, I recommend y'all add a section to the proposal on the
wiki
 page for "Additional Interested Contributors" so that folks who want
to
 sign up to participate in the project can do so without requesting
 additions to the initial committer list.

 On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
 jamesmal...@google.com.invalid> wrote:

> Hello everyone,
>
> Attached to this message is a proposed new project - Apache
>Dataflow, a
> unified programming model for data processing and integration.
>
> The text of the proposal is included below. Additionally, the
>proposal
>>> is
> in draft form on the wiki where we will make any required changes:
>
> https://wiki.apache.org/incubator/DataflowProposal
>
> We look forward to your feedback and input.
>
> Best,
>
> James
>
> 
>
> = Apache Dataflow =
>
> == Abstract ==
>
> Dataflow is an open source, unified model and set of
>language-specific
>>> SDKs
> for defining and executing data processing workflows, and also data
> ingestion and integration flows, supporting Enterprise Integration
>>> Patterns
> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
>simplify
> the mechanics of large-scale batch and streaming data processing and
>can
> run on a number of runtimes like Apache Flink, Apache Spark, and
>Google
> Cloud Dataflow (a cloud service). Dataflow also brings DSL in
>different
> languages, allowing users to easily implement their data integration
> processes.
>
> == Proposal ==
>
> Dataflow is a simple, flexible, and powerful system for distributed
>data
> processing at any scale. Dataflow provides a unified programming
>model,
>>> a
> software development kit to define and construct data processing
>>> pipelines,
> and runners to execute Dataflow pipelines in several runtime engines,
>>> like
> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
>>> used
> for a variety of streaming or batch data processing goals including
>ETL,
> stream analysis, and aggregate computation. The underlying
>programming
> model for Dataflow provides MapReduce-like parallelism, combined with
> support for powerful data windowing, and fine-grained correctness
>>> control.
>
> == Background ==
>
> Dataflow started as a set of Google projects focused on making data
> processing easier, faster, and less costly. The Dataflow model is a
> successor to MapReduce, FlumeJava, and Millwheel inside Google and is
> focused on providing a unified solution for batch and stream
>processing.
> These projects on which Dataflow is based have been published in
>several
> papers made available to the public:
>
> * MapReduce - http://research.google.com/archive/mapreduce.html
>
> * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>
> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>
> * MillWheel - http://research.google.com/pubs/pub41378.html
>
> Dataflow was designed from the start to provide a portable
>programming
> layer. When you define a data processing pipeline with the Dataflow
>>> model,
> you are creating a job which is capable of being processed by any
>>> number of
> Dataflow processing engines. Several engines have been developed to
>run
> Dataflow pipelines in other open source runtimes, including a
>Dataflow
> runner for Apache Flink and Apache Spark. There is also a ³direct
>>> runner²,
> for execution on the developer machine (mainly for dev/debug
>purposes).
> Another runner allows a Dataflow program to run on a managed service,
> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java
>SDK
>>> is
> already available on GitHub, and independent from the Google Cloud
>>> Dataflow
> service. Another Python SDK is currently in active development.

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread David Jencks


> On Jan 20, 2016, at 8:32 AM, James Malone  
> wrote:
> 
> The Dataflow programming model has been designed with simplicity,
> scalability, and speed as key tenants.

s/tenants/tenets?

david jencks

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Joe Witt

Hello

This is a very interesting proposal and concept.  I'd like to contribute.

Thanks
Joe

On Wed, Jan 20, 2016 at 2:50 PM, James Carman
 wrote:
> Of course! I'd be happy to help
>
> On Wed, Jan 20, 2016 at 2:02 PM Jean-Baptiste Onofré 
> wrote:
>
>> Hi James,
>>
>> Can I add your to the proposal ?
>>
>> Regards
>> JB
>>
>> On 01/20/2016 07:20 PM, James Carman wrote:
>> > Well, I for one would be very interested in this project and would be
>> happy
>> > to contribute.
>> >
>> >
>> > On Wed, Jan 20, 2016 at 12:09 PM Jean-Baptiste Onofré 
>> > wrote:
>> >
>> >> Hi Sean,
>> >>
>> >> It's a fair point, but not present in most of the proposals. It's
>> >> something that we can address in the "Community" section.
>> >>
>> >> Regards
>> >> JB
>> >>
>> >> On 01/20/2016 05:55 PM, Sean Busbey wrote:
>> >>> Great proposal. I like that your proposal includes a well presented
>> >>> roadmap, but I don't see any goals that directly address building a
>> >> larger
>> >>> community. Y'all have any ideas around outreach that will help with
>> >>> adoption?
>> >>>
>> >>> As a start, I recommend y'all add a section to the proposal on the wiki
>> >>> page for "Additional Interested Contributors" so that folks who want to
>> >>> sign up to participate in the project can do so without requesting
>> >>> additions to the initial committer list.
>> >>>
>> >>> On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
>> >>> jamesmal...@google.com.invalid> wrote:
>> >>>
>>  Hello everyone,
>> 
>>  Attached to this message is a proposed new project - Apache Dataflow,
>> a
>>  unified programming model for data processing and integration.
>> 
>>  The text of the proposal is included below. Additionally, the proposal
>> >> is
>>  in draft form on the wiki where we will make any required changes:
>> 
>>  https://wiki.apache.org/incubator/DataflowProposal
>> 
>>  We look forward to your feedback and input.
>> 
>>  Best,
>> 
>>  James
>> 
>>  
>> 
>>  = Apache Dataflow =
>> 
>>  == Abstract ==
>> 
>>  Dataflow is an open source, unified model and set of language-specific
>> >> SDKs
>>  for defining and executing data processing workflows, and also data
>>  ingestion and integration flows, supporting Enterprise Integration
>> >> Patterns
>>  (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
>> simplify
>>  the mechanics of large-scale batch and streaming data processing and
>> can
>>  run on a number of runtimes like Apache Flink, Apache Spark, and
>> Google
>>  Cloud Dataflow (a cloud service). Dataflow also brings DSL in
>> different
>>  languages, allowing users to easily implement their data integration
>>  processes.
>> 
>>  == Proposal ==
>> 
>>  Dataflow is a simple, flexible, and powerful system for distributed
>> data
>>  processing at any scale. Dataflow provides a unified programming
>> model,
>> >> a
>>  software development kit to define and construct data processing
>> >> pipelines,
>>  and runners to execute Dataflow pipelines in several runtime engines,
>> >> like
>>  Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
>> >> used
>>  for a variety of streaming or batch data processing goals including
>> ETL,
>>  stream analysis, and aggregate computation. The underlying programming
>>  model for Dataflow provides MapReduce-like parallelism, combined with
>>  support for powerful data windowing, and fine-grained correctness
>> >> control.
>> 
>>  == Background ==
>> 
>>  Dataflow started as a set of Google projects focused on making data
>>  processing easier, faster, and less costly. The Dataflow model is a
>>  successor to MapReduce, FlumeJava, and Millwheel inside Google and is
>>  focused on providing a unified solution for batch and stream
>> processing.
>>  These projects on which Dataflow is based have been published in
>> several
>>  papers made available to the public:
>> 
>>  * MapReduce - http://research.google.com/archive/mapreduce.html
>> 
>>  * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>> 
>>  * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>> 
>>  * MillWheel - http://research.google.com/pubs/pub41378.html
>> 
>>  Dataflow was designed from the start to provide a portable programming
>>  layer. When you define a data processing pipeline with the Dataflow
>> >> model,
>>  you are creating a job which is capable of being processed by any
>> >> number of
>>  Dataflow processing engines. Several engines have been developed to
>> run
>>  Dataflow pipelines in other open source runtimes, including a Dataflow
>>  runner for Apache Flink and Apache Spark. There is also a “direct
>> >> runner”,
>>  for execution on the developer machine (mainly for dev/debug
>> purposes).
>>  Anothe

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread James Carman

Of course! I'd be happy to help

On Wed, Jan 20, 2016 at 2:02 PM Jean-Baptiste Onofré 
wrote:

> Hi James,
>
> Can I add your to the proposal ?
>
> Regards
> JB
>
> On 01/20/2016 07:20 PM, James Carman wrote:
> > Well, I for one would be very interested in this project and would be
> happy
> > to contribute.
> >
> >
> > On Wed, Jan 20, 2016 at 12:09 PM Jean-Baptiste Onofré 
> > wrote:
> >
> >> Hi Sean,
> >>
> >> It's a fair point, but not present in most of the proposals. It's
> >> something that we can address in the "Community" section.
> >>
> >> Regards
> >> JB
> >>
> >> On 01/20/2016 05:55 PM, Sean Busbey wrote:
> >>> Great proposal. I like that your proposal includes a well presented
> >>> roadmap, but I don't see any goals that directly address building a
> >> larger
> >>> community. Y'all have any ideas around outreach that will help with
> >>> adoption?
> >>>
> >>> As a start, I recommend y'all add a section to the proposal on the wiki
> >>> page for "Additional Interested Contributors" so that folks who want to
> >>> sign up to participate in the project can do so without requesting
> >>> additions to the initial committer list.
> >>>
> >>> On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
> >>> jamesmal...@google.com.invalid> wrote:
> >>>
>  Hello everyone,
> 
>  Attached to this message is a proposed new project - Apache Dataflow,
> a
>  unified programming model for data processing and integration.
> 
>  The text of the proposal is included below. Additionally, the proposal
> >> is
>  in draft form on the wiki where we will make any required changes:
> 
>  https://wiki.apache.org/incubator/DataflowProposal
> 
>  We look forward to your feedback and input.
> 
>  Best,
> 
>  James
> 
>  
> 
>  = Apache Dataflow =
> 
>  == Abstract ==
> 
>  Dataflow is an open source, unified model and set of language-specific
> >> SDKs
>  for defining and executing data processing workflows, and also data
>  ingestion and integration flows, supporting Enterprise Integration
> >> Patterns
>  (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
> simplify
>  the mechanics of large-scale batch and streaming data processing and
> can
>  run on a number of runtimes like Apache Flink, Apache Spark, and
> Google
>  Cloud Dataflow (a cloud service). Dataflow also brings DSL in
> different
>  languages, allowing users to easily implement their data integration
>  processes.
> 
>  == Proposal ==
> 
>  Dataflow is a simple, flexible, and powerful system for distributed
> data
>  processing at any scale. Dataflow provides a unified programming
> model,
> >> a
>  software development kit to define and construct data processing
> >> pipelines,
>  and runners to execute Dataflow pipelines in several runtime engines,
> >> like
>  Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
> >> used
>  for a variety of streaming or batch data processing goals including
> ETL,
>  stream analysis, and aggregate computation. The underlying programming
>  model for Dataflow provides MapReduce-like parallelism, combined with
>  support for powerful data windowing, and fine-grained correctness
> >> control.
> 
>  == Background ==
> 
>  Dataflow started as a set of Google projects focused on making data
>  processing easier, faster, and less costly. The Dataflow model is a
>  successor to MapReduce, FlumeJava, and Millwheel inside Google and is
>  focused on providing a unified solution for batch and stream
> processing.
>  These projects on which Dataflow is based have been published in
> several
>  papers made available to the public:
> 
>  * MapReduce - http://research.google.com/archive/mapreduce.html
> 
>  * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
> 
>  * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
> 
>  * MillWheel - http://research.google.com/pubs/pub41378.html
> 
>  Dataflow was designed from the start to provide a portable programming
>  layer. When you define a data processing pipeline with the Dataflow
> >> model,
>  you are creating a job which is capable of being processed by any
> >> number of
>  Dataflow processing engines. Several engines have been developed to
> run
>  Dataflow pipelines in other open source runtimes, including a Dataflow
>  runner for Apache Flink and Apache Spark. There is also a “direct
> >> runner”,
>  for execution on the developer machine (mainly for dev/debug
> purposes).
>  Another runner allows a Dataflow program to run on a managed service,
>  Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK
> >> is
>  already available on GitHub, and independent from the Google Cloud
> >> Dataflow
>  service. Another Python SDK is currently i

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Liang Chen

good proposal, look forward to participating this project contribution.



--
View this message in context: 
http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48014.html
Sent from the Apache Incubator - General mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré


You are on the proposal ;)

Thanks !
Regards
JB

On 01/20/2016 08:04 PM, P. Taylor Goetz wrote:

Nice proposal.

I’d be interested in contributing as well. I’m about at my mentor limit with 
projects, but I’d be willing to contribute in other/similar ways.

-Taylor


On Jan 20, 2016, at 12:46 PM, Jean-Baptiste Onofré  wrote:

Great, I add you in the initial committer list then ;)

I quickly discussed with James, we gonna create a section for additional people 
as proposed by Sean.

Thanks !
Regards
JB

On 01/20/2016 06:33 PM, Debo Dutta (dedutta) wrote:

Hi JB

Would love to join now.

regards
debo

On 1/20/16, 9:31 AM, "Jean-Baptiste Onofré"  wrote:


Hi Debo,

Awesome: do you want to join now (in the initial committer list) and
once we are in the incubation ?

Let me know, I can update the proposal.

Regards
JB

On 01/20/2016 06:23 PM, Debo Dutta (dedutta) wrote:

+1

Proposal looks good. Also a small section on relationships with Apache
Storm and Apache Samza would be great.

I would like to sign up, to help/contribute.

debo

On 1/20/16, 8:55 AM, "Sean Busbey"  wrote:


Great proposal. I like that your proposal includes a well presented
roadmap, but I don't see any goals that directly address building a
larger
community. Y'all have any ideas around outreach that will help with
adoption?

As a start, I recommend y'all add a section to the proposal on the wiki
page for "Additional Interested Contributors" so that folks who want to
sign up to participate in the project can do so without requesting
additions to the initial committer list.

On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:


Hello everyone,

Attached to this message is a proposed new project - Apache Dataflow,
a
unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the proposal
is
in draft form on the wiki where we will make any required changes:

https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of language-specific
SDKs
for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration
Patterns
(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
simplify
the mechanics of large-scale batch and streaming data processing and
can
run on a number of runtimes like Apache Flink, Apache Spark, and
Google
Cloud Dataflow (a cloud service). Dataflow also brings DSL in
different
languages, allowing users to easily implement their data integration
processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for distributed
data
processing at any scale. Dataflow provides a unified programming
model,
a
software development kit to define and construct data processing
pipelines,
and runners to execute Dataflow pipelines in several runtime engines,
like
Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
used
for a variety of streaming or batch data processing goals including
ETL,
stream analysis, and aggregate computation. The underlying programming
model for Dataflow provides MapReduce-like parallelism, combined with
support for powerful data windowing, and fine-grained correctness
control.

== Background ==

Dataflow started as a set of Google projects focused on making data
processing easier, faster, and less costly. The Dataflow model is a
successor to MapReduce, FlumeJava, and Millwheel inside Google and is
focused on providing a unified solution for batch and stream
processing.
These projects on which Dataflow is based have been published in
several
papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable programming
layer. When you define a data processing pipeline with the Dataflow
model,
you are creating a job which is capable of being processed by any
number of
Dataflow processing engines. Several engines have been developed to
run
Dataflow pipelines in other open source runtimes, including a Dataflow
runner for Apache Flink and Apache Spark. There is also a ³direct
runner²,
for execution on the developer machine (mainly for dev/debug
purposes).
Another runner allows a Dataflow program to run on a managed service,
Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK
is
already available on GitHub, and independent from the Google Cloud
Dataflow
service. Another Python SDK is currently in active development.

In this proposal, the Dataflow SDKs, model, and a set of runners will
be
submitted as an OSS project under the ASF. The runners which are a
part
of
this proposal

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré


Hey James,

you are on the proposal ;)

Thanks !
Regards
JB

On 01/20/2016 07:20 PM, James Carman wrote:

Well, I for one would be very interested in this project and would be happy
to contribute.


On Wed, Jan 20, 2016 at 12:09 PM Jean-Baptiste Onofré 
wrote:


Hi Sean,

It's a fair point, but not present in most of the proposals. It's
something that we can address in the "Community" section.

Regards
JB

On 01/20/2016 05:55 PM, Sean Busbey wrote:

Great proposal. I like that your proposal includes a well presented
roadmap, but I don't see any goals that directly address building a

larger

community. Y'all have any ideas around outreach that will help with
adoption?

As a start, I recommend y'all add a section to the proposal on the wiki
page for "Additional Interested Contributors" so that folks who want to
sign up to participate in the project can do so without requesting
additions to the initial committer list.

On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:


Hello everyone,

Attached to this message is a proposed new project - Apache Dataflow, a
unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the proposal

is

in draft form on the wiki where we will make any required changes:

https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of language-specific

SDKs

for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration

Patterns

(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
the mechanics of large-scale batch and streaming data processing and can
run on a number of runtimes like Apache Flink, Apache Spark, and Google
Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
languages, allowing users to easily implement their data integration
processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for distributed data
processing at any scale. Dataflow provides a unified programming model,

a

software development kit to define and construct data processing

pipelines,

and runners to execute Dataflow pipelines in several runtime engines,

like

Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be

used

for a variety of streaming or batch data processing goals including ETL,
stream analysis, and aggregate computation. The underlying programming
model for Dataflow provides MapReduce-like parallelism, combined with
support for powerful data windowing, and fine-grained correctness

control.


== Background ==

Dataflow started as a set of Google projects focused on making data
processing easier, faster, and less costly. The Dataflow model is a
successor to MapReduce, FlumeJava, and Millwheel inside Google and is
focused on providing a unified solution for batch and stream processing.
These projects on which Dataflow is based have been published in several
papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable programming
layer. When you define a data processing pipeline with the Dataflow

model,

you are creating a job which is capable of being processed by any

number of

Dataflow processing engines. Several engines have been developed to run
Dataflow pipelines in other open source runtimes, including a Dataflow
runner for Apache Flink and Apache Spark. There is also a “direct

runner”,

for execution on the developer machine (mainly for dev/debug purposes).
Another runner allows a Dataflow program to run on a managed service,
Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK

is

already available on GitHub, and independent from the Google Cloud

Dataflow

service. Another Python SDK is currently in active development.

In this proposal, the Dataflow SDKs, model, and a set of runners will be
submitted as an OSS project under the ASF. The runners which are a part

of

this proposal include those for Spark (from Cloudera), Flink (from data
Artisans), and local development (from Google); the Google Cloud

Dataflow

service runner is not included in this proposal. Further references to
Dataflow will refer to the Dataflow model, SDKs, and runners which are a
part of this proposal (Apache Dataflow) only. The initial submission

will

contain the already-released Java SDK; Google intends to submit the

Python

SDK later in the incubation process. The Google Cloud Dataflow service

will

continue to be one of many runners for Dataflow, built on Google Cloud
Platform, to run Dataflow p

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread P. Taylor Goetz

Nice proposal.

I’d be interested in contributing as well. I’m about at my mentor limit with 
projects, but I’d be willing to contribute in other/similar ways.

-Taylor

> On Jan 20, 2016, at 12:46 PM, Jean-Baptiste Onofré  wrote:
> 
> Great, I add you in the initial committer list then ;)
> 
> I quickly discussed with James, we gonna create a section for additional 
> people as proposed by Sean.
> 
> Thanks !
> Regards
> JB
> 
> On 01/20/2016 06:33 PM, Debo Dutta (dedutta) wrote:
>> Hi JB
>> 
>> Would love to join now.
>> 
>> regards
>> debo
>> 
>> On 1/20/16, 9:31 AM, "Jean-Baptiste Onofré"  wrote:
>> 
>>> Hi Debo,
>>> 
>>> Awesome: do you want to join now (in the initial committer list) and
>>> once we are in the incubation ?
>>> 
>>> Let me know, I can update the proposal.
>>> 
>>> Regards
>>> JB
>>> 
>>> On 01/20/2016 06:23 PM, Debo Dutta (dedutta) wrote:
 +1
 
 Proposal looks good. Also a small section on relationships with Apache
 Storm and Apache Samza would be great.
 
 I would like to sign up, to help/contribute.
 
 debo
 
 On 1/20/16, 8:55 AM, "Sean Busbey"  wrote:
 
> Great proposal. I like that your proposal includes a well presented
> roadmap, but I don't see any goals that directly address building a
> larger
> community. Y'all have any ideas around outreach that will help with
> adoption?
> 
> As a start, I recommend y'all add a section to the proposal on the wiki
> page for "Additional Interested Contributors" so that folks who want to
> sign up to participate in the project can do so without requesting
> additions to the initial committer list.
> 
> On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
> jamesmal...@google.com.invalid> wrote:
> 
>> Hello everyone,
>> 
>> Attached to this message is a proposed new project - Apache Dataflow,
>> a
>> unified programming model for data processing and integration.
>> 
>> The text of the proposal is included below. Additionally, the proposal
>> is
>> in draft form on the wiki where we will make any required changes:
>> 
>> https://wiki.apache.org/incubator/DataflowProposal
>> 
>> We look forward to your feedback and input.
>> 
>> Best,
>> 
>> James
>> 
>> 
>> 
>> = Apache Dataflow =
>> 
>> == Abstract ==
>> 
>> Dataflow is an open source, unified model and set of language-specific
>> SDKs
>> for defining and executing data processing workflows, and also data
>> ingestion and integration flows, supporting Enterprise Integration
>> Patterns
>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
>> simplify
>> the mechanics of large-scale batch and streaming data processing and
>> can
>> run on a number of runtimes like Apache Flink, Apache Spark, and
>> Google
>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in
>> different
>> languages, allowing users to easily implement their data integration
>> processes.
>> 
>> == Proposal ==
>> 
>> Dataflow is a simple, flexible, and powerful system for distributed
>> data
>> processing at any scale. Dataflow provides a unified programming
>> model,
>> a
>> software development kit to define and construct data processing
>> pipelines,
>> and runners to execute Dataflow pipelines in several runtime engines,
>> like
>> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
>> used
>> for a variety of streaming or batch data processing goals including
>> ETL,
>> stream analysis, and aggregate computation. The underlying programming
>> model for Dataflow provides MapReduce-like parallelism, combined with
>> support for powerful data windowing, and fine-grained correctness
>> control.
>> 
>> == Background ==
>> 
>> Dataflow started as a set of Google projects focused on making data
>> processing easier, faster, and less costly. The Dataflow model is a
>> successor to MapReduce, FlumeJava, and Millwheel inside Google and is
>> focused on providing a unified solution for batch and stream
>> processing.
>> These projects on which Dataflow is based have been published in
>> several
>> papers made available to the public:
>> 
>> * MapReduce - http://research.google.com/archive/mapreduce.html
>> 
>> * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>> 
>> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>> 
>> * MillWheel - http://research.google.com/pubs/pub41378.html
>> 
>> Dataflow was designed from the start to provide a portable programming
>> layer. When you define a data processing pipeline with the Dataflow
>> model,
>> you are creating a job which is capable of being processed by any
>> number of
>> Dataflow pro

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré


Hi James,

Can I add your to the proposal ?

Regards
JB

On 01/20/2016 07:20 PM, James Carman wrote:

Well, I for one would be very interested in this project and would be happy
to contribute.


On Wed, Jan 20, 2016 at 12:09 PM Jean-Baptiste Onofré 
wrote:


Hi Sean,

It's a fair point, but not present in most of the proposals. It's
something that we can address in the "Community" section.

Regards
JB

On 01/20/2016 05:55 PM, Sean Busbey wrote:

Great proposal. I like that your proposal includes a well presented
roadmap, but I don't see any goals that directly address building a

larger

community. Y'all have any ideas around outreach that will help with
adoption?

As a start, I recommend y'all add a section to the proposal on the wiki
page for "Additional Interested Contributors" so that folks who want to
sign up to participate in the project can do so without requesting
additions to the initial committer list.

On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:


Hello everyone,

Attached to this message is a proposed new project - Apache Dataflow, a
unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the proposal

is

in draft form on the wiki where we will make any required changes:

https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of language-specific

SDKs

for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration

Patterns

(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
the mechanics of large-scale batch and streaming data processing and can
run on a number of runtimes like Apache Flink, Apache Spark, and Google
Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
languages, allowing users to easily implement their data integration
processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for distributed data
processing at any scale. Dataflow provides a unified programming model,

a

software development kit to define and construct data processing

pipelines,

and runners to execute Dataflow pipelines in several runtime engines,

like

Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be

used

for a variety of streaming or batch data processing goals including ETL,
stream analysis, and aggregate computation. The underlying programming
model for Dataflow provides MapReduce-like parallelism, combined with
support for powerful data windowing, and fine-grained correctness

control.


== Background ==

Dataflow started as a set of Google projects focused on making data
processing easier, faster, and less costly. The Dataflow model is a
successor to MapReduce, FlumeJava, and Millwheel inside Google and is
focused on providing a unified solution for batch and stream processing.
These projects on which Dataflow is based have been published in several
papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable programming
layer. When you define a data processing pipeline with the Dataflow

model,

you are creating a job which is capable of being processed by any

number of

Dataflow processing engines. Several engines have been developed to run
Dataflow pipelines in other open source runtimes, including a Dataflow
runner for Apache Flink and Apache Spark. There is also a “direct

runner”,

for execution on the developer machine (mainly for dev/debug purposes).
Another runner allows a Dataflow program to run on a managed service,
Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK

is

already available on GitHub, and independent from the Google Cloud

Dataflow

service. Another Python SDK is currently in active development.

In this proposal, the Dataflow SDKs, model, and a set of runners will be
submitted as an OSS project under the ASF. The runners which are a part

of

this proposal include those for Spark (from Cloudera), Flink (from data
Artisans), and local development (from Google); the Google Cloud

Dataflow

service runner is not included in this proposal. Further references to
Dataflow will refer to the Dataflow model, SDKs, and runners which are a
part of this proposal (Apache Dataflow) only. The initial submission

will

contain the already-released Java SDK; Google intends to submit the

Python

SDK later in the incubation process. The Google Cloud Dataflow service

will

continue to be one of many runners for Dataflow, built on Google Cloud
Platform, to run Dataflow pipel

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread James Carman

Well, I for one would be very interested in this project and would be happy
to contribute.


On Wed, Jan 20, 2016 at 12:09 PM Jean-Baptiste Onofré 
wrote:

> Hi Sean,
>
> It's a fair point, but not present in most of the proposals. It's
> something that we can address in the "Community" section.
>
> Regards
> JB
>
> On 01/20/2016 05:55 PM, Sean Busbey wrote:
> > Great proposal. I like that your proposal includes a well presented
> > roadmap, but I don't see any goals that directly address building a
> larger
> > community. Y'all have any ideas around outreach that will help with
> > adoption?
> >
> > As a start, I recommend y'all add a section to the proposal on the wiki
> > page for "Additional Interested Contributors" so that folks who want to
> > sign up to participate in the project can do so without requesting
> > additions to the initial committer list.
> >
> > On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
> > jamesmal...@google.com.invalid> wrote:
> >
> >> Hello everyone,
> >>
> >> Attached to this message is a proposed new project - Apache Dataflow, a
> >> unified programming model for data processing and integration.
> >>
> >> The text of the proposal is included below. Additionally, the proposal
> is
> >> in draft form on the wiki where we will make any required changes:
> >>
> >> https://wiki.apache.org/incubator/DataflowProposal
> >>
> >> We look forward to your feedback and input.
> >>
> >> Best,
> >>
> >> James
> >>
> >> 
> >>
> >> = Apache Dataflow =
> >>
> >> == Abstract ==
> >>
> >> Dataflow is an open source, unified model and set of language-specific
> SDKs
> >> for defining and executing data processing workflows, and also data
> >> ingestion and integration flows, supporting Enterprise Integration
> Patterns
> >> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
> >> the mechanics of large-scale batch and streaming data processing and can
> >> run on a number of runtimes like Apache Flink, Apache Spark, and Google
> >> Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
> >> languages, allowing users to easily implement their data integration
> >> processes.
> >>
> >> == Proposal ==
> >>
> >> Dataflow is a simple, flexible, and powerful system for distributed data
> >> processing at any scale. Dataflow provides a unified programming model,
> a
> >> software development kit to define and construct data processing
> pipelines,
> >> and runners to execute Dataflow pipelines in several runtime engines,
> like
> >> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
> used
> >> for a variety of streaming or batch data processing goals including ETL,
> >> stream analysis, and aggregate computation. The underlying programming
> >> model for Dataflow provides MapReduce-like parallelism, combined with
> >> support for powerful data windowing, and fine-grained correctness
> control.
> >>
> >> == Background ==
> >>
> >> Dataflow started as a set of Google projects focused on making data
> >> processing easier, faster, and less costly. The Dataflow model is a
> >> successor to MapReduce, FlumeJava, and Millwheel inside Google and is
> >> focused on providing a unified solution for batch and stream processing.
> >> These projects on which Dataflow is based have been published in several
> >> papers made available to the public:
> >>
> >> * MapReduce - http://research.google.com/archive/mapreduce.html
> >>
> >> * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
> >>
> >> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
> >>
> >> * MillWheel - http://research.google.com/pubs/pub41378.html
> >>
> >> Dataflow was designed from the start to provide a portable programming
> >> layer. When you define a data processing pipeline with the Dataflow
> model,
> >> you are creating a job which is capable of being processed by any
> number of
> >> Dataflow processing engines. Several engines have been developed to run
> >> Dataflow pipelines in other open source runtimes, including a Dataflow
> >> runner for Apache Flink and Apache Spark. There is also a “direct
> runner”,
> >> for execution on the developer machine (mainly for dev/debug purposes).
> >> Another runner allows a Dataflow program to run on a managed service,
> >> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK
> is
> >> already available on GitHub, and independent from the Google Cloud
> Dataflow
> >> service. Another Python SDK is currently in active development.
> >>
> >> In this proposal, the Dataflow SDKs, model, and a set of runners will be
> >> submitted as an OSS project under the ASF. The runners which are a part
> of
> >> this proposal include those for Spark (from Cloudera), Flink (from data
> >> Artisans), and local development (from Google); the Google Cloud
> Dataflow
> >> service runner is not included in this proposal. Further references to
> >> Dataflow will refer to the Dataflow model, SDKs, and runners which are a

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Henry Saputra

This is great proposal. Been working with Apache Flink for a while so love
to help with this project.

- Henry



On Wed, Jan 20, 2016 at 8:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:

> Hello everyone,
>
> Attached to this message is a proposed new project - Apache Dataflow, a
> unified programming model for data processing and integration.
>
> The text of the proposal is included below. Additionally, the proposal is
> in draft form on the wiki where we will make any required changes:
>
> https://wiki.apache.org/incubator/DataflowProposal
>
> We look forward to your feedback and input.
>
> Best,
>
> James
>
> 
>
> = Apache Dataflow =
>
> == Abstract ==
>
> Dataflow is an open source, unified model and set of language-specific SDKs
> for defining and executing data processing workflows, and also data
> ingestion and integration flows, supporting Enterprise Integration Patterns
> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
> the mechanics of large-scale batch and streaming data processing and can
> run on a number of runtimes like Apache Flink, Apache Spark, and Google
> Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
> languages, allowing users to easily implement their data integration
> processes.
>
> == Proposal ==
>
> Dataflow is a simple, flexible, and powerful system for distributed data
> processing at any scale. Dataflow provides a unified programming model, a
> software development kit to define and construct data processing pipelines,
> and runners to execute Dataflow pipelines in several runtime engines, like
> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used
> for a variety of streaming or batch data processing goals including ETL,
> stream analysis, and aggregate computation. The underlying programming
> model for Dataflow provides MapReduce-like parallelism, combined with
> support for powerful data windowing, and fine-grained correctness control.
>
> == Background ==
>
> Dataflow started as a set of Google projects focused on making data
> processing easier, faster, and less costly. The Dataflow model is a
> successor to MapReduce, FlumeJava, and Millwheel inside Google and is
> focused on providing a unified solution for batch and stream processing.
> These projects on which Dataflow is based have been published in several
> papers made available to the public:
>
> * MapReduce - http://research.google.com/archive/mapreduce.html
>
> * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>
> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>
> * MillWheel - http://research.google.com/pubs/pub41378.html
>
> Dataflow was designed from the start to provide a portable programming
> layer. When you define a data processing pipeline with the Dataflow model,
> you are creating a job which is capable of being processed by any number of
> Dataflow processing engines. Several engines have been developed to run
> Dataflow pipelines in other open source runtimes, including a Dataflow
> runner for Apache Flink and Apache Spark. There is also a “direct runner”,
> for execution on the developer machine (mainly for dev/debug purposes).
> Another runner allows a Dataflow program to run on a managed service,
> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is
> already available on GitHub, and independent from the Google Cloud Dataflow
> service. Another Python SDK is currently in active development.
>
> In this proposal, the Dataflow SDKs, model, and a set of runners will be
> submitted as an OSS project under the ASF. The runners which are a part of
> this proposal include those for Spark (from Cloudera), Flink (from data
> Artisans), and local development (from Google); the Google Cloud Dataflow
> service runner is not included in this proposal. Further references to
> Dataflow will refer to the Dataflow model, SDKs, and runners which are a
> part of this proposal (Apache Dataflow) only. The initial submission will
> contain the already-released Java SDK; Google intends to submit the Python
> SDK later in the incubation process. The Google Cloud Dataflow service will
> continue to be one of many runners for Dataflow, built on Google Cloud
> Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will
> develop against the Apache project additions, updates, and changes. Google
> Cloud Dataflow will become one user of Apache Dataflow and will participate
> in the project openly and publicly.
>
> The Dataflow programming model has been designed with simplicity,
> scalability, and speed as key tenants. In the Dataflow model, you only need
> to think about four top-level concepts when constructing your data
> processing job:
>
> * Pipelines - The data processing job made of a series of computations
> including input, processing, and output
>
> * PCollections - Bounded (or unbounded) datasets which represent the input,
> intermediate and output data in pipeli

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré

As suggested, I added "Additional Interested Contributors" section, and 
already added Debo.


Thanks !
Regards
JB

On 01/20/2016 05:55 PM, Sean Busbey wrote:

Great proposal. I like that your proposal includes a well presented
roadmap, but I don't see any goals that directly address building a larger
community. Y'all have any ideas around outreach that will help with
adoption?

As a start, I recommend y'all add a section to the proposal on the wiki
page for "Additional Interested Contributors" so that folks who want to
sign up to participate in the project can do so without requesting
additions to the initial committer list.

On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:


Hello everyone,

Attached to this message is a proposed new project - Apache Dataflow, a
unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the proposal is
in draft form on the wiki where we will make any required changes:

https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of language-specific SDKs
for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration Patterns
(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
the mechanics of large-scale batch and streaming data processing and can
run on a number of runtimes like Apache Flink, Apache Spark, and Google
Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
languages, allowing users to easily implement their data integration
processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for distributed data
processing at any scale. Dataflow provides a unified programming model, a
software development kit to define and construct data processing pipelines,
and runners to execute Dataflow pipelines in several runtime engines, like
Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used
for a variety of streaming or batch data processing goals including ETL,
stream analysis, and aggregate computation. The underlying programming
model for Dataflow provides MapReduce-like parallelism, combined with
support for powerful data windowing, and fine-grained correctness control.

== Background ==

Dataflow started as a set of Google projects focused on making data
processing easier, faster, and less costly. The Dataflow model is a
successor to MapReduce, FlumeJava, and Millwheel inside Google and is
focused on providing a unified solution for batch and stream processing.
These projects on which Dataflow is based have been published in several
papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable programming
layer. When you define a data processing pipeline with the Dataflow model,
you are creating a job which is capable of being processed by any number of
Dataflow processing engines. Several engines have been developed to run
Dataflow pipelines in other open source runtimes, including a Dataflow
runner for Apache Flink and Apache Spark. There is also a “direct runner”,
for execution on the developer machine (mainly for dev/debug purposes).
Another runner allows a Dataflow program to run on a managed service,
Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is
already available on GitHub, and independent from the Google Cloud Dataflow
service. Another Python SDK is currently in active development.

In this proposal, the Dataflow SDKs, model, and a set of runners will be
submitted as an OSS project under the ASF. The runners which are a part of
this proposal include those for Spark (from Cloudera), Flink (from data
Artisans), and local development (from Google); the Google Cloud Dataflow
service runner is not included in this proposal. Further references to
Dataflow will refer to the Dataflow model, SDKs, and runners which are a
part of this proposal (Apache Dataflow) only. The initial submission will
contain the already-released Java SDK; Google intends to submit the Python
SDK later in the incubation process. The Google Cloud Dataflow service will
continue to be one of many runners for Dataflow, built on Google Cloud
Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will
develop against the Apache project additions, updates, and changes. Google
Cloud Dataflow will become one user of Apache Dataflow and will participate
in the project openly and publicly.

The Dataflow programming model has been designed with simplicity,
scalability, and speed as key tenant

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré

Yes, you are right, we also know that other companies use dataflow 
wording (it's the case at Hortonworks for instance).


We gonna start a thread to propose alternative names.

Regards
JB

On 01/20/2016 06:41 PM, Gregory Chase wrote:

This is also a similarly named Spring project:
http://cloud.spring.io/spring-cloud-dataflow/

On Wed, Jan 20, 2016 at 9:40 AM, Marvin Humphrey 
wrote:


On Wed, Jan 20, 2016 at 9:17 AM, Jean-Baptiste Onofré 
wrote:


We're proposing Apache Dataflow naming because Google Cloud Dataflow is

an

already known name and "brand".


I don't see anything in the proposal about Google ceasing the use of the
brand
"Google Cloud Dataflow".  Yet the co-existence of "Google Cloud Dataflow"
and
"Apache Dataflow" would conflict with Apache requirements for vendor
neutrality and project independence.

The issue seems similar to the recent proposal to incubate "Apache
OpenMiracl"
while allowing the "Miracl" company to continue distribution of the
"Miracl"
project.  That situation was was resolved by renaming the Apache project to
"Milagro", allowing the Miracl company to continue benefitting from the
brand
they had invested in so heavily.

 http://markmail.org/message/tpiphl55rcyezcvd

Marvin Humphrey

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré


Great, I add you in the initial committer list then ;)

I quickly discussed with James, we gonna create a section for additional 
people as proposed by Sean.


Thanks !
Regards
JB

On 01/20/2016 06:33 PM, Debo Dutta (dedutta) wrote:

Hi JB

Would love to join now.

regards
debo

On 1/20/16, 9:31 AM, "Jean-Baptiste Onofré"  wrote:


Hi Debo,

Awesome: do you want to join now (in the initial committer list) and
once we are in the incubation ?

Let me know, I can update the proposal.

Regards
JB

On 01/20/2016 06:23 PM, Debo Dutta (dedutta) wrote:

+1

Proposal looks good. Also a small section on relationships with Apache
Storm and Apache Samza would be great.

I would like to sign up, to help/contribute.

debo

On 1/20/16, 8:55 AM, "Sean Busbey"  wrote:


Great proposal. I like that your proposal includes a well presented
roadmap, but I don't see any goals that directly address building a
larger
community. Y'all have any ideas around outreach that will help with
adoption?

As a start, I recommend y'all add a section to the proposal on the wiki
page for "Additional Interested Contributors" so that folks who want to
sign up to participate in the project can do so without requesting
additions to the initial committer list.

On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:


Hello everyone,

Attached to this message is a proposed new project - Apache Dataflow,
a
unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the proposal
is
in draft form on the wiki where we will make any required changes:

https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of language-specific
SDKs
for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration
Patterns
(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
simplify
the mechanics of large-scale batch and streaming data processing and
can
run on a number of runtimes like Apache Flink, Apache Spark, and
Google
Cloud Dataflow (a cloud service). Dataflow also brings DSL in
different
languages, allowing users to easily implement their data integration
processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for distributed
data
processing at any scale. Dataflow provides a unified programming
model,
a
software development kit to define and construct data processing
pipelines,
and runners to execute Dataflow pipelines in several runtime engines,
like
Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
used
for a variety of streaming or batch data processing goals including
ETL,
stream analysis, and aggregate computation. The underlying programming
model for Dataflow provides MapReduce-like parallelism, combined with
support for powerful data windowing, and fine-grained correctness
control.

== Background ==

Dataflow started as a set of Google projects focused on making data
processing easier, faster, and less costly. The Dataflow model is a
successor to MapReduce, FlumeJava, and Millwheel inside Google and is
focused on providing a unified solution for batch and stream
processing.
These projects on which Dataflow is based have been published in
several
papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable programming
layer. When you define a data processing pipeline with the Dataflow
model,
you are creating a job which is capable of being processed by any
number of
Dataflow processing engines. Several engines have been developed to
run
Dataflow pipelines in other open source runtimes, including a Dataflow
runner for Apache Flink and Apache Spark. There is also a ³direct
runner²,
for execution on the developer machine (mainly for dev/debug
purposes).
Another runner allows a Dataflow program to run on a managed service,
Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK
is
already available on GitHub, and independent from the Google Cloud
Dataflow
service. Another Python SDK is currently in active development.

In this proposal, the Dataflow SDKs, model, and a set of runners will
be
submitted as an OSS project under the ASF. The runners which are a
part
of
this proposal include those for Spark (from Cloudera), Flink (from
data
Artisans), and local development (from Google); the Google Cloud
Dataflow
service runner is not included in this proposal. Further references to
Dataflow will refer to the Dataflow model, SDKs, and runners which
are a
part of this proposal (Apache Dataflow) only. The i

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré

Good point Marvin, Google Dataflow SDK will "disappear" for Apache 
Dataflow, but you are right, the complete Dataflow "brand" will stay at 
Google. Let me double check with James and the team.


Regards
JB

On 01/20/2016 06:40 PM, Marvin Humphrey wrote:

On Wed, Jan 20, 2016 at 9:17 AM, Jean-Baptiste Onofré  wrote:


We're proposing Apache Dataflow naming because Google Cloud Dataflow is an
already known name and "brand".


I don't see anything in the proposal about Google ceasing the use of the brand
"Google Cloud Dataflow".  Yet the co-existence of "Google Cloud Dataflow" and
"Apache Dataflow" would conflict with Apache requirements for vendor
neutrality and project independence.

The issue seems similar to the recent proposal to incubate "Apache OpenMiracl"
while allowing the "Miracl" company to continue distribution of the "Miracl"
project.  That situation was was resolved by renaming the Apache project to
"Milagro", allowing the Miracl company to continue benefitting from the brand
they had invested in so heavily.

 http://markmail.org/message/tpiphl55rcyezcvd

Marvin Humphrey

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Gregory Chase

This is also a similarly named Spring project:
http://cloud.spring.io/spring-cloud-dataflow/

On Wed, Jan 20, 2016 at 9:40 AM, Marvin Humphrey 
wrote:

> On Wed, Jan 20, 2016 at 9:17 AM, Jean-Baptiste Onofré 
> wrote:
>
> > We're proposing Apache Dataflow naming because Google Cloud Dataflow is
> an
> > already known name and "brand".
>
> I don't see anything in the proposal about Google ceasing the use of the
> brand
> "Google Cloud Dataflow".  Yet the co-existence of "Google Cloud Dataflow"
> and
> "Apache Dataflow" would conflict with Apache requirements for vendor
> neutrality and project independence.
>
> The issue seems similar to the recent proposal to incubate "Apache
> OpenMiracl"
> while allowing the "Miracl" company to continue distribution of the
> "Miracl"
> project.  That situation was was resolved by renaming the Apache project to
> "Milagro", allowing the Miracl company to continue benefitting from the
> brand
> they had invested in so heavily.
>
> http://markmail.org/message/tpiphl55rcyezcvd
>
> Marvin Humphrey
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>


-- 
Greg Chase

Director of Big Data Communities
http://www.pivotal.io/big-data

Pivotal Software
http://www.pivotal.io/

650-215-0477
@GregChase
Blog: http://geekmarketing.biz/

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Marvin Humphrey

On Wed, Jan 20, 2016 at 9:17 AM, Jean-Baptiste Onofré  wrote:

> We're proposing Apache Dataflow naming because Google Cloud Dataflow is an
> already known name and "brand".

I don't see anything in the proposal about Google ceasing the use of the brand
"Google Cloud Dataflow".  Yet the co-existence of "Google Cloud Dataflow" and
"Apache Dataflow" would conflict with Apache requirements for vendor
neutrality and project independence.

The issue seems similar to the recent proposal to incubate "Apache OpenMiracl"
while allowing the "Miracl" company to continue distribution of the "Miracl"
project.  That situation was was resolved by renaming the Apache project to
"Milagro", allowing the Miracl company to continue benefitting from the brand
they had invested in so heavily.

http://markmail.org/message/tpiphl55rcyezcvd

Marvin Humphrey

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread James Malone

> Great proposal. I like that your proposal includes a well presented
> roadmap, but I don't see any goals that directly address building a larger
> community. Y'all have any ideas around outreach that will help with
> adoption?
>

Thank you and fair point. We have a few additional ideas which we can put
into the Community section.


>
> As a start, I recommend y'all add a section to the proposal on the wiki
> page for "Additional Interested Contributors" so that folks who want to
> sign up to participate in the project can do so without requesting
> additions to the initial committer list.
>
>
This is a great idea and I think it makes a lot of sense to add an "Additional
Interested Contributors" section to the proposal.


> On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
> jamesmal...@google.com.invalid> wrote:
>
> > Hello everyone,
> >
> > Attached to this message is a proposed new project - Apache Dataflow, a
> > unified programming model for data processing and integration.
> >
> > The text of the proposal is included below. Additionally, the proposal is
> > in draft form on the wiki where we will make any required changes:
> >
> > https://wiki.apache.org/incubator/DataflowProposal
> >
> > We look forward to your feedback and input.
> >
> > Best,
> >
> > James
> >
> > 
> >
> > = Apache Dataflow =
> >
> > == Abstract ==
> >
> > Dataflow is an open source, unified model and set of language-specific
> SDKs
> > for defining and executing data processing workflows, and also data
> > ingestion and integration flows, supporting Enterprise Integration
> Patterns
> > (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
> > the mechanics of large-scale batch and streaming data processing and can
> > run on a number of runtimes like Apache Flink, Apache Spark, and Google
> > Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
> > languages, allowing users to easily implement their data integration
> > processes.
> >
> > == Proposal ==
> >
> > Dataflow is a simple, flexible, and powerful system for distributed data
> > processing at any scale. Dataflow provides a unified programming model, a
> > software development kit to define and construct data processing
> pipelines,
> > and runners to execute Dataflow pipelines in several runtime engines,
> like
> > Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
> used
> > for a variety of streaming or batch data processing goals including ETL,
> > stream analysis, and aggregate computation. The underlying programming
> > model for Dataflow provides MapReduce-like parallelism, combined with
> > support for powerful data windowing, and fine-grained correctness
> control.
> >
> > == Background ==
> >
> > Dataflow started as a set of Google projects focused on making data
> > processing easier, faster, and less costly. The Dataflow model is a
> > successor to MapReduce, FlumeJava, and Millwheel inside Google and is
> > focused on providing a unified solution for batch and stream processing.
> > These projects on which Dataflow is based have been published in several
> > papers made available to the public:
> >
> > * MapReduce - http://research.google.com/archive/mapreduce.html
> >
> > * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
> >
> > * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
> >
> > * MillWheel - http://research.google.com/pubs/pub41378.html
> >
> > Dataflow was designed from the start to provide a portable programming
> > layer. When you define a data processing pipeline with the Dataflow
> model,
> > you are creating a job which is capable of being processed by any number
> of
> > Dataflow processing engines. Several engines have been developed to run
> > Dataflow pipelines in other open source runtimes, including a Dataflow
> > runner for Apache Flink and Apache Spark. There is also a “direct
> runner”,
> > for execution on the developer machine (mainly for dev/debug purposes).
> > Another runner allows a Dataflow program to run on a managed service,
> > Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is
> > already available on GitHub, and independent from the Google Cloud
> Dataflow
> > service. Another Python SDK is currently in active development.
> >
> > In this proposal, the Dataflow SDKs, model, and a set of runners will be
> > submitted as an OSS project under the ASF. The runners which are a part
> of
> > this proposal include those for Spark (from Cloudera), Flink (from data
> > Artisans), and local development (from Google); the Google Cloud Dataflow
> > service runner is not included in this proposal. Further references to
> > Dataflow will refer to the Dataflow model, SDKs, and runners which are a
> > part of this proposal (Apache Dataflow) only. The initial submission will
> > contain the already-released Java SDK; Google intends to submit the
> Python
> > SDK later in the incubation process. The Google Cloud Dataflow service
>

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Debo Dutta (dedutta)

Hi JB

Would love to join now.

regards
debo

On 1/20/16, 9:31 AM, "Jean-Baptiste Onofré"  wrote:

>Hi Debo,
>
>Awesome: do you want to join now (in the initial committer list) and
>once we are in the incubation ?
>
>Let me know, I can update the proposal.
>
>Regards
>JB
>
>On 01/20/2016 06:23 PM, Debo Dutta (dedutta) wrote:
>> +1
>>
>> Proposal looks good. Also a small section on relationships with Apache
>> Storm and Apache Samza would be great.
>>
>> I would like to sign up, to help/contribute.
>>
>> debo
>>
>> On 1/20/16, 8:55 AM, "Sean Busbey"  wrote:
>>
>>> Great proposal. I like that your proposal includes a well presented
>>> roadmap, but I don't see any goals that directly address building a
>>>larger
>>> community. Y'all have any ideas around outreach that will help with
>>> adoption?
>>>
>>> As a start, I recommend y'all add a section to the proposal on the wiki
>>> page for "Additional Interested Contributors" so that folks who want to
>>> sign up to participate in the project can do so without requesting
>>> additions to the initial committer list.
>>>
>>> On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
>>> jamesmal...@google.com.invalid> wrote:
>>>
 Hello everyone,

 Attached to this message is a proposed new project - Apache Dataflow,
a
 unified programming model for data processing and integration.

 The text of the proposal is included below. Additionally, the proposal
 is
 in draft form on the wiki where we will make any required changes:

 https://wiki.apache.org/incubator/DataflowProposal

 We look forward to your feedback and input.

 Best,

 James

 = Apache Dataflow =

 == Abstract ==

 Dataflow is an open source, unified model and set of language-specific
 SDKs
 for defining and executing data processing workflows, and also data
 ingestion and integration flows, supporting Enterprise Integration
 Patterns
 (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
simplify
 the mechanics of large-scale batch and streaming data processing and
can
 run on a number of runtimes like Apache Flink, Apache Spark, and
Google
 Cloud Dataflow (a cloud service). Dataflow also brings DSL in
different
 languages, allowing users to easily implement their data integration
 processes.

 == Proposal ==

 Dataflow is a simple, flexible, and powerful system for distributed
data
 processing at any scale. Dataflow provides a unified programming
model,
 a
 software development kit to define and construct data processing
 pipelines,
 and runners to execute Dataflow pipelines in several runtime engines,
 like
 Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
 used
 for a variety of streaming or batch data processing goals including
ETL,
 stream analysis, and aggregate computation. The underlying programming
 model for Dataflow provides MapReduce-like parallelism, combined with
 support for powerful data windowing, and fine-grained correctness
 control.

 == Background ==

 Dataflow started as a set of Google projects focused on making data
 processing easier, faster, and less costly. The Dataflow model is a
 successor to MapReduce, FlumeJava, and Millwheel inside Google and is
 focused on providing a unified solution for batch and stream
processing.
 These projects on which Dataflow is based have been published in
several
 papers made available to the public:

 * MapReduce - http://research.google.com/archive/mapreduce.html

 * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

 * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

 * MillWheel - http://research.google.com/pubs/pub41378.html

 Dataflow was designed from the start to provide a portable programming
 layer. When you define a data processing pipeline with the Dataflow
 model,
 you are creating a job which is capable of being processed by any
 number of
 Dataflow processing engines. Several engines have been developed to
run
 Dataflow pipelines in other open source runtimes, including a Dataflow
 runner for Apache Flink and Apache Spark. There is also a ³direct
 runner²,
 for execution on the developer machine (mainly for dev/debug
purposes).
 Another runner allows a Dataflow program to run on a managed service,
 Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK
 is
 already available on GitHub, and independent from the Google Cloud
 Dataflow
 service. Another Python SDK is currently in active development.

 In this proposal, the Dataflow SDKs, model, and a set of runners will
be
 submitted as an OSS project under the ASF. The runners which are a
part
 of
 thi

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré


Hi Debo,

Awesome: do you want to join now (in the initial committer list) and 
once we are in the incubation ?


Let me know, I can update the proposal.

Regards
JB

On 01/20/2016 06:23 PM, Debo Dutta (dedutta) wrote:

+1

Proposal looks good. Also a small section on relationships with Apache
Storm and Apache Samza would be great.

I would like to sign up, to help/contribute.

debo

On 1/20/16, 8:55 AM, "Sean Busbey"  wrote:


Great proposal. I like that your proposal includes a well presented
roadmap, but I don't see any goals that directly address building a larger
community. Y'all have any ideas around outreach that will help with
adoption?

As a start, I recommend y'all add a section to the proposal on the wiki
page for "Additional Interested Contributors" so that folks who want to
sign up to participate in the project can do so without requesting
additions to the initial committer list.

On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:


Hello everyone,

Attached to this message is a proposed new project - Apache Dataflow, a
unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the proposal
is
in draft form on the wiki where we will make any required changes:

https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of language-specific
SDKs
for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration
Patterns
(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
the mechanics of large-scale batch and streaming data processing and can
run on a number of runtimes like Apache Flink, Apache Spark, and Google
Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
languages, allowing users to easily implement their data integration
processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for distributed data
processing at any scale. Dataflow provides a unified programming model,
a
software development kit to define and construct data processing
pipelines,
and runners to execute Dataflow pipelines in several runtime engines,
like
Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
used
for a variety of streaming or batch data processing goals including ETL,
stream analysis, and aggregate computation. The underlying programming
model for Dataflow provides MapReduce-like parallelism, combined with
support for powerful data windowing, and fine-grained correctness
control.

== Background ==

Dataflow started as a set of Google projects focused on making data
processing easier, faster, and less costly. The Dataflow model is a
successor to MapReduce, FlumeJava, and Millwheel inside Google and is
focused on providing a unified solution for batch and stream processing.
These projects on which Dataflow is based have been published in several
papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable programming
layer. When you define a data processing pipeline with the Dataflow
model,
you are creating a job which is capable of being processed by any
number of
Dataflow processing engines. Several engines have been developed to run
Dataflow pipelines in other open source runtimes, including a Dataflow
runner for Apache Flink and Apache Spark. There is also a ³direct
runner²,
for execution on the developer machine (mainly for dev/debug purposes).
Another runner allows a Dataflow program to run on a managed service,
Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK
is
already available on GitHub, and independent from the Google Cloud
Dataflow
service. Another Python SDK is currently in active development.

In this proposal, the Dataflow SDKs, model, and a set of runners will be
submitted as an OSS project under the ASF. The runners which are a part
of
this proposal include those for Spark (from Cloudera), Flink (from data
Artisans), and local development (from Google); the Google Cloud
Dataflow
service runner is not included in this proposal. Further references to
Dataflow will refer to the Dataflow model, SDKs, and runners which are a
part of this proposal (Apache Dataflow) only. The initial submission
will
contain the already-released Java SDK; Google intends to submit the
Python
SDK later in the incubation process. The Google Cloud Dataflow service
will
continue to be one of many runners for Dataflow, built on Google Cloud
Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will
develop against t

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Debo Dutta (dedutta)

+1

Proposal looks good. Also a small section on relationships with Apache
Storm and Apache Samza would be great.

I would like to sign up, to help/contribute.

debo

On 1/20/16, 8:55 AM, "Sean Busbey"  wrote:

>Great proposal. I like that your proposal includes a well presented
>roadmap, but I don't see any goals that directly address building a larger
>community. Y'all have any ideas around outreach that will help with
>adoption?
>
>As a start, I recommend y'all add a section to the proposal on the wiki
>page for "Additional Interested Contributors" so that folks who want to
>sign up to participate in the project can do so without requesting
>additions to the initial committer list.
>
>On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
>jamesmal...@google.com.invalid> wrote:
>
>> Hello everyone,
>>
>> Attached to this message is a proposed new project - Apache Dataflow, a
>> unified programming model for data processing and integration.
>>
>> The text of the proposal is included below. Additionally, the proposal
>>is
>> in draft form on the wiki where we will make any required changes:
>>
>> https://wiki.apache.org/incubator/DataflowProposal
>>
>> We look forward to your feedback and input.
>>
>> Best,
>>
>> James
>>
>> 
>>
>> = Apache Dataflow =
>>
>> == Abstract ==
>>
>> Dataflow is an open source, unified model and set of language-specific
>>SDKs
>> for defining and executing data processing workflows, and also data
>> ingestion and integration flows, supporting Enterprise Integration
>>Patterns
>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
>> the mechanics of large-scale batch and streaming data processing and can
>> run on a number of runtimes like Apache Flink, Apache Spark, and Google
>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
>> languages, allowing users to easily implement their data integration
>> processes.
>>
>> == Proposal ==
>>
>> Dataflow is a simple, flexible, and powerful system for distributed data
>> processing at any scale. Dataflow provides a unified programming model,
>>a
>> software development kit to define and construct data processing
>>pipelines,
>> and runners to execute Dataflow pipelines in several runtime engines,
>>like
>> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be
>>used
>> for a variety of streaming or batch data processing goals including ETL,
>> stream analysis, and aggregate computation. The underlying programming
>> model for Dataflow provides MapReduce-like parallelism, combined with
>> support for powerful data windowing, and fine-grained correctness
>>control.
>>
>> == Background ==
>>
>> Dataflow started as a set of Google projects focused on making data
>> processing easier, faster, and less costly. The Dataflow model is a
>> successor to MapReduce, FlumeJava, and Millwheel inside Google and is
>> focused on providing a unified solution for batch and stream processing.
>> These projects on which Dataflow is based have been published in several
>> papers made available to the public:
>>
>> * MapReduce - http://research.google.com/archive/mapreduce.html
>>
>> * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>>
>> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>>
>> * MillWheel - http://research.google.com/pubs/pub41378.html
>>
>> Dataflow was designed from the start to provide a portable programming
>> layer. When you define a data processing pipeline with the Dataflow
>>model,
>> you are creating a job which is capable of being processed by any
>>number of
>> Dataflow processing engines. Several engines have been developed to run
>> Dataflow pipelines in other open source runtimes, including a Dataflow
>> runner for Apache Flink and Apache Spark. There is also a ³direct
>>runner²,
>> for execution on the developer machine (mainly for dev/debug purposes).
>> Another runner allows a Dataflow program to run on a managed service,
>> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK
>>is
>> already available on GitHub, and independent from the Google Cloud
>>Dataflow
>> service. Another Python SDK is currently in active development.
>>
>> In this proposal, the Dataflow SDKs, model, and a set of runners will be
>> submitted as an OSS project under the ASF. The runners which are a part
>>of
>> this proposal include those for Spark (from Cloudera), Flink (from data
>> Artisans), and local development (from Google); the Google Cloud
>>Dataflow
>> service runner is not included in this proposal. Further references to
>> Dataflow will refer to the Dataflow model, SDKs, and runners which are a
>> part of this proposal (Apache Dataflow) only. The initial submission
>>will
>> contain the already-released Java SDK; Google intends to submit the
>>Python
>> SDK later in the incubation process. The Google Cloud Dataflow service
>>will
>> continue to be one of many runners for Dataflow, built on Google Cloud
>> Platform, to run Dataflow pip

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré


Hi Marvin,

you raise a point that we have a bit anticipated ;)

We're proposing Apache Dataflow naming because Google Cloud Dataflow is 
an already known name and "brand".


The naming is not directly related to dataflow programming: it's more 
representative of the data flowing inside a pipeline.


Regards
JB

On 01/20/2016 06:12 PM, Marvin Humphrey wrote:

On Wed, Jan 20, 2016 at 8:32 AM, James Malone
 wrote:


== Abstract ==

Dataflow is an open source, unified model and set of language-specific SDKs
for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration Patterns
(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
the mechanics of large-scale batch and streaming data processing and can
run on a number of runtimes like Apache Flink, Apache Spark, and Google
Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
languages, allowing users to easily implement their data integration
processes.


In general this seems like an excellent project and a well-thought-through and
viable proposal -- I certainly anticipate that it will be accepted for
incubation in one form or another.

However, how does this "Dataflow" project relate to the programming paradigm
of "dataflow programming"?

 https://en.wikipedia.org/wiki/Dataflow_programming

Besides the potential for confusion, it seems like the proposed project name
would be tough to defend as a trademark.


With respect to trademark rights, Google does not hold a trademark on the
phrase “Dataflow.” Based on feedback and guidance we receive during the
incubation process, we are open to renaming the project if necessary for
trademark or other concerns.


If a renaming is going to happen, there are advantages to renaming sooner
rather than later and sparing the community additional disruption.

Marvin Humphrey

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Marvin Humphrey

On Wed, Jan 20, 2016 at 8:32 AM, James Malone
 wrote:

> == Abstract ==
>
> Dataflow is an open source, unified model and set of language-specific SDKs
> for defining and executing data processing workflows, and also data
> ingestion and integration flows, supporting Enterprise Integration Patterns
> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
> the mechanics of large-scale batch and streaming data processing and can
> run on a number of runtimes like Apache Flink, Apache Spark, and Google
> Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
> languages, allowing users to easily implement their data integration
> processes.

In general this seems like an excellent project and a well-thought-through and
viable proposal -- I certainly anticipate that it will be accepted for
incubation in one form or another.

However, how does this "Dataflow" project relate to the programming paradigm
of "dataflow programming"?

https://en.wikipedia.org/wiki/Dataflow_programming

Besides the potential for confusion, it seems like the proposed project name
would be tough to defend as a trademark.

> With respect to trademark rights, Google does not hold a trademark on the
> phrase “Dataflow.” Based on feedback and guidance we receive during the
> incubation process, we are open to renaming the project if necessary for
> trademark or other concerns.

If a renaming is going to happen, there are advantages to renaming sooner
rather than later and sparing the community additional disruption.

Marvin Humphrey

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré


Hi Sean,

It's a fair point, but not present in most of the proposals. It's 
something that we can address in the "Community" section.


Regards
JB

On 01/20/2016 05:55 PM, Sean Busbey wrote:

Great proposal. I like that your proposal includes a well presented
roadmap, but I don't see any goals that directly address building a larger
community. Y'all have any ideas around outreach that will help with
adoption?

As a start, I recommend y'all add a section to the proposal on the wiki
page for "Additional Interested Contributors" so that folks who want to
sign up to participate in the project can do so without requesting
additions to the initial committer list.

On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:


Hello everyone,

Attached to this message is a proposed new project - Apache Dataflow, a
unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the proposal is
in draft form on the wiki where we will make any required changes:

https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of language-specific SDKs
for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration Patterns
(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
the mechanics of large-scale batch and streaming data processing and can
run on a number of runtimes like Apache Flink, Apache Spark, and Google
Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
languages, allowing users to easily implement their data integration
processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for distributed data
processing at any scale. Dataflow provides a unified programming model, a
software development kit to define and construct data processing pipelines,
and runners to execute Dataflow pipelines in several runtime engines, like
Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used
for a variety of streaming or batch data processing goals including ETL,
stream analysis, and aggregate computation. The underlying programming
model for Dataflow provides MapReduce-like parallelism, combined with
support for powerful data windowing, and fine-grained correctness control.

== Background ==

Dataflow started as a set of Google projects focused on making data
processing easier, faster, and less costly. The Dataflow model is a
successor to MapReduce, FlumeJava, and Millwheel inside Google and is
focused on providing a unified solution for batch and stream processing.
These projects on which Dataflow is based have been published in several
papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable programming
layer. When you define a data processing pipeline with the Dataflow model,
you are creating a job which is capable of being processed by any number of
Dataflow processing engines. Several engines have been developed to run
Dataflow pipelines in other open source runtimes, including a Dataflow
runner for Apache Flink and Apache Spark. There is also a “direct runner”,
for execution on the developer machine (mainly for dev/debug purposes).
Another runner allows a Dataflow program to run on a managed service,
Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is
already available on GitHub, and independent from the Google Cloud Dataflow
service. Another Python SDK is currently in active development.

In this proposal, the Dataflow SDKs, model, and a set of runners will be
submitted as an OSS project under the ASF. The runners which are a part of
this proposal include those for Spark (from Cloudera), Flink (from data
Artisans), and local development (from Google); the Google Cloud Dataflow
service runner is not included in this proposal. Further references to
Dataflow will refer to the Dataflow model, SDKs, and runners which are a
part of this proposal (Apache Dataflow) only. The initial submission will
contain the already-released Java SDK; Google intends to submit the Python
SDK later in the incubation process. The Google Cloud Dataflow service will
continue to be one of many runners for Dataflow, built on Google Cloud
Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will
develop against the Apache project additions, updates, and changes. Google
Cloud Dataflow will become one user of Apache Dataflow and will participate
in the project openly and publicly.

The Dataflow programming model has been designed with simplicity,
sc

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Sean Busbey

Great proposal. I like that your proposal includes a well presented
roadmap, but I don't see any goals that directly address building a larger
community. Y'all have any ideas around outreach that will help with
adoption?

As a start, I recommend y'all add a section to the proposal on the wiki
page for "Additional Interested Contributors" so that folks who want to
sign up to participate in the project can do so without requesting
additions to the initial committer list.

On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
jamesmal...@google.com.invalid> wrote:

> Hello everyone,
>
> Attached to this message is a proposed new project - Apache Dataflow, a
> unified programming model for data processing and integration.
>
> The text of the proposal is included below. Additionally, the proposal is
> in draft form on the wiki where we will make any required changes:
>
> https://wiki.apache.org/incubator/DataflowProposal
>
> We look forward to your feedback and input.
>
> Best,
>
> James
>
> 
>
> = Apache Dataflow =
>
> == Abstract ==
>
> Dataflow is an open source, unified model and set of language-specific SDKs
> for defining and executing data processing workflows, and also data
> ingestion and integration flows, supporting Enterprise Integration Patterns
> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
> the mechanics of large-scale batch and streaming data processing and can
> run on a number of runtimes like Apache Flink, Apache Spark, and Google
> Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
> languages, allowing users to easily implement their data integration
> processes.
>
> == Proposal ==
>
> Dataflow is a simple, flexible, and powerful system for distributed data
> processing at any scale. Dataflow provides a unified programming model, a
> software development kit to define and construct data processing pipelines,
> and runners to execute Dataflow pipelines in several runtime engines, like
> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used
> for a variety of streaming or batch data processing goals including ETL,
> stream analysis, and aggregate computation. The underlying programming
> model for Dataflow provides MapReduce-like parallelism, combined with
> support for powerful data windowing, and fine-grained correctness control.
>
> == Background ==
>
> Dataflow started as a set of Google projects focused on making data
> processing easier, faster, and less costly. The Dataflow model is a
> successor to MapReduce, FlumeJava, and Millwheel inside Google and is
> focused on providing a unified solution for batch and stream processing.
> These projects on which Dataflow is based have been published in several
> papers made available to the public:
>
> * MapReduce - http://research.google.com/archive/mapreduce.html
>
> * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>
> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>
> * MillWheel - http://research.google.com/pubs/pub41378.html
>
> Dataflow was designed from the start to provide a portable programming
> layer. When you define a data processing pipeline with the Dataflow model,
> you are creating a job which is capable of being processed by any number of
> Dataflow processing engines. Several engines have been developed to run
> Dataflow pipelines in other open source runtimes, including a Dataflow
> runner for Apache Flink and Apache Spark. There is also a “direct runner”,
> for execution on the developer machine (mainly for dev/debug purposes).
> Another runner allows a Dataflow program to run on a managed service,
> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is
> already available on GitHub, and independent from the Google Cloud Dataflow
> service. Another Python SDK is currently in active development.
>
> In this proposal, the Dataflow SDKs, model, and a set of runners will be
> submitted as an OSS project under the ASF. The runners which are a part of
> this proposal include those for Spark (from Cloudera), Flink (from data
> Artisans), and local development (from Google); the Google Cloud Dataflow
> service runner is not included in this proposal. Further references to
> Dataflow will refer to the Dataflow model, SDKs, and runners which are a
> part of this proposal (Apache Dataflow) only. The initial submission will
> contain the already-released Java SDK; Google intends to submit the Python
> SDK later in the incubation process. The Google Cloud Dataflow service will
> continue to be one of many runners for Dataflow, built on Google Cloud
> Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will
> develop against the Apache project additions, updates, and changes. Google
> Cloud Dataflow will become one user of Apache Dataflow and will participate
> in the project openly and publicly.
>
> The Dataflow programming model has been designed with simplicity,
> scalability, and speed as key tenants.

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Marko Rodriguez

Hi,

This is a cool idea. Its like Apache TinkerPop 
(http://tinkerpop.incubator.apache.org/) but for data flow/stream systems as 
opposed to graph systems. Our tag line is "the JDBC for graphs." You would be 
"the JDBC for streams." :)

You might be interested in looking TinkerPop's Gremlin language as its a data 
flow language.
http://tinkerpop.apache.org/docs/3.1.0-incubating/#traversal
Moreover, its a virtual machine that allows other languages to compile to it.

http://www.datastax.com/dev/blog/the-benefits-of-the-gremlin-graph-traversal-machine
https://github.com/dkuppitz/sparql-gremlin
https://github.com/twilmes/sql-gremlin

Do you have a URL to any documentation on Apache Dataflow's DSL? Perhaps there 
are ideas we can steal!

Take care,
Marko.  

http://markorodriguez.com

On Jan 20, 2016, at 9:32 AM, James Malone  
wrote:

> Hello everyone,
> 
> Attached to this message is a proposed new project - Apache Dataflow, a
> unified programming model for data processing and integration.
> 
> The text of the proposal is included below. Additionally, the proposal is
> in draft form on the wiki where we will make any required changes:
> 
> https://wiki.apache.org/incubator/DataflowProposal
> 
> We look forward to your feedback and input.
> 
> Best,
> 
> James
> 
> 
> 
> = Apache Dataflow =
> 
> == Abstract ==
> 
> Dataflow is an open source, unified model and set of language-specific SDKs
> for defining and executing data processing workflows, and also data
> ingestion and integration flows, supporting Enterprise Integration Patterns
> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
> the mechanics of large-scale batch and streaming data processing and can
> run on a number of runtimes like Apache Flink, Apache Spark, and Google
> Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
> languages, allowing users to easily implement their data integration
> processes.
> 
> == Proposal ==
> 
> Dataflow is a simple, flexible, and powerful system for distributed data
> processing at any scale. Dataflow provides a unified programming model, a
> software development kit to define and construct data processing pipelines,
> and runners to execute Dataflow pipelines in several runtime engines, like
> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used
> for a variety of streaming or batch data processing goals including ETL,
> stream analysis, and aggregate computation. The underlying programming
> model for Dataflow provides MapReduce-like parallelism, combined with
> support for powerful data windowing, and fine-grained correctness control.
> 
> == Background ==
> 
> Dataflow started as a set of Google projects focused on making data
> processing easier, faster, and less costly. The Dataflow model is a
> successor to MapReduce, FlumeJava, and Millwheel inside Google and is
> focused on providing a unified solution for batch and stream processing.
> These projects on which Dataflow is based have been published in several
> papers made available to the public:
> 
> * MapReduce - http://research.google.com/archive/mapreduce.html
> 
> * Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
> 
> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
> 
> * MillWheel - http://research.google.com/pubs/pub41378.html
> 
> Dataflow was designed from the start to provide a portable programming
> layer. When you define a data processing pipeline with the Dataflow model,
> you are creating a job which is capable of being processed by any number of
> Dataflow processing engines. Several engines have been developed to run
> Dataflow pipelines in other open source runtimes, including a Dataflow
> runner for Apache Flink and Apache Spark. There is also a “direct runner”,
> for execution on the developer machine (mainly for dev/debug purposes).
> Another runner allows a Dataflow program to run on a managed service,
> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is
> already available on GitHub, and independent from the Google Cloud Dataflow
> service. Another Python SDK is currently in active development.
> 
> In this proposal, the Dataflow SDKs, model, and a set of runners will be
> submitted as an OSS project under the ASF. The runners which are a part of
> this proposal include those for Spark (from Cloudera), Flink (from data
> Artisans), and local development (from Google); the Google Cloud Dataflow
> service runner is not included in this proposal. Further references to
> Dataflow will refer to the Dataflow model, SDKs, and runners which are a
> part of this proposal (Apache Dataflow) only. The initial submission will
> contain the already-released Java SDK; Google intends to submit the Python
> SDK later in the incubation process. The Google Cloud Dataflow service will
> continue to be one of many runners for Dataflow, built on Google Cloud
> Platform, to run Dataflo

Re: [DISCUSS] Apache Dataflow Incubator Proposal

2016-01-20 Thread Jean-Baptiste Onofré


Hi all,

I second James there, and really excited to be champion on the project 
(and work on the codebase as well).


I blogged about a quick dataflow technical introduction:

http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/

Thanks James !

We are looking forward your feedbacks.

Regards
JB

On 01/20/2016 05:32 PM, James Malone wrote:

Hello everyone,

Attached to this message is a proposed new project - Apache Dataflow, a
unified programming model for data processing and integration.

The text of the proposal is included below. Additionally, the proposal is
in draft form on the wiki where we will make any required changes:

https://wiki.apache.org/incubator/DataflowProposal

We look forward to your feedback and input.

Best,

James



= Apache Dataflow =

== Abstract ==

Dataflow is an open source, unified model and set of language-specific SDKs
for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration Patterns
(EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify
the mechanics of large-scale batch and streaming data processing and can
run on a number of runtimes like Apache Flink, Apache Spark, and Google
Cloud Dataflow (a cloud service). Dataflow also brings DSL in different
languages, allowing users to easily implement their data integration
processes.

== Proposal ==

Dataflow is a simple, flexible, and powerful system for distributed data
processing at any scale. Dataflow provides a unified programming model, a
software development kit to define and construct data processing pipelines,
and runners to execute Dataflow pipelines in several runtime engines, like
Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used
for a variety of streaming or batch data processing goals including ETL,
stream analysis, and aggregate computation. The underlying programming
model for Dataflow provides MapReduce-like parallelism, combined with
support for powerful data windowing, and fine-grained correctness control.

== Background ==

Dataflow started as a set of Google projects focused on making data
processing easier, faster, and less costly. The Dataflow model is a
successor to MapReduce, FlumeJava, and Millwheel inside Google and is
focused on providing a unified solution for batch and stream processing.
These projects on which Dataflow is based have been published in several
papers made available to the public:

* MapReduce - http://research.google.com/archive/mapreduce.html

* Dataflow model  - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

* FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf

* MillWheel - http://research.google.com/pubs/pub41378.html

Dataflow was designed from the start to provide a portable programming
layer. When you define a data processing pipeline with the Dataflow model,
you are creating a job which is capable of being processed by any number of
Dataflow processing engines. Several engines have been developed to run
Dataflow pipelines in other open source runtimes, including a Dataflow
runner for Apache Flink and Apache Spark. There is also a “direct runner”,
for execution on the developer machine (mainly for dev/debug purposes).
Another runner allows a Dataflow program to run on a managed service,
Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is
already available on GitHub, and independent from the Google Cloud Dataflow
service. Another Python SDK is currently in active development.

In this proposal, the Dataflow SDKs, model, and a set of runners will be
submitted as an OSS project under the ASF. The runners which are a part of
this proposal include those for Spark (from Cloudera), Flink (from data
Artisans), and local development (from Google); the Google Cloud Dataflow
service runner is not included in this proposal. Further references to
Dataflow will refer to the Dataflow model, SDKs, and runners which are a
part of this proposal (Apache Dataflow) only. The initial submission will
contain the already-released Java SDK; Google intends to submit the Python
SDK later in the incubation process. The Google Cloud Dataflow service will
continue to be one of many runners for Dataflow, built on Google Cloud
Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will
develop against the Apache project additions, updates, and changes. Google
Cloud Dataflow will become one user of Apache Dataflow and will participate
in the project openly and publicly.

The Dataflow programming model has been designed with simplicity,
scalability, and speed as key tenants. In the Dataflow model, you only need
to think about four top-level concepts when constructing your data
processing job:

* Pipelines - The data processing job made of a series of computations
including input, processing, and output

* PCollections - Bounded (or unbounded) datasets which represent the input,
intermediate and output data in pipelines

* PTransforms -

91 matches

Mail list logo