Re: [DISCUSS] Apache Dataflow Incubator Proposal
Same reason you hear "can't trademark certain parts of speech". Half informed zealots go off half cocked. For instance, here is apparently authoritative advice that a trademark can only be an adjective: http://www.ramseylawgroup.com/viewarticle.php?id=21 And here is a real world analysis: http://itre.cis.upenn.edu/myl/languagelog/archives/000943.html The fact is, I bought a BMW, not a BMW car. It is still trademark in spite of my linguistic faux pas. Sent from my iPhone > On Jan 28, 2016, at 18:47, Greg Stein wrote: > > Hrm. Given that, I'm confused why I keep hearing "oh, natural word, can't > be trademarked." - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
> On 28 Jan 16, at 21:47, Greg Stein wrote: > > On Thu, Jan 28, 2016 at 6:29 PM, Doug Cutting wrote: > >> On Thu, Jan 28, 2016 at 3:11 PM, Greg Stein wrote: >>> As a regular english word, "beam" cannot be trademarked, by others/us. >> >> Like Windows® or Apple®? > > > oh, snap. True. > > Hrm. Given that, I'm confused why I keep hearing "oh, natural word, can't > be trademarked." > > Thx, > -g "Snap" is not yet trademarked, so you get off "free." But do look at this: Very clear, clean, concise, comprehensible. http://www.inta.org/TrademarkBasics/FactSheets/Pages/TrademarksvsGenericTermsFactSheet.aspx Relevant quotes: • Fact Sheets Home Trademarks vs. Generic Terms Updated, June 2015 1. What is meant by “generic term”? Generic terms are common words or terms, often found in the dictionary, that identify products and services and are not specific to any particular source. It is not possible to register as a trademark a term that is generic for the goods and/or services identified in the application. If a trademark becomes generic, often as a result of improper use, rights in the mark may no longer be enforceable. 2. Are generic terms considered a category of trademarks? In assessing their suitability as trademarks, words can be divided into five categories. These categories range from fanciful, invented words, which typically are strong trademarks, to generic terms, which are not protectable at all. The stronger the mark, the more protection it will be given against other marks. The categories, ranked in decreasing order in terms of strength, are: a. Fanciful Marks—coined (made-up) words that have no relation to the goods being described (e.g., EXXON for petroleum products). ***b. Arbitrary Marks—existing words that contribute no meaning to the goods being described (e.g., APPLE for computers).*** c. Suggestive Marks—words that suggest meaning or relation but that do not describe the goods themselves (e.g., COPPERTONE for suntan lotion). d. Descriptive Marks—marks that describe either the goods or a characteristic of the goods. Often it is very difficult to enforce trademark rights in a descriptive mark unless the mark has acquired a secondary meaning (e.g., SHOELAND for a shoe store). e. Generic Terms—words that are the accepted and recognized description of a class of goods or services (e.g., computer software, facial tissue). Interestingly, and perhaps not surprising to some of us, Windows™ is far more plausibly a suitable descriptor of a software user interface that steps out of the confines of the command line (which I prefer, as it happens) than Apple is of an electronic calculating engine made of dirty silicon and shiny metal, colourful plastic, and exotic minerals. Unless, that is, one thinks of what apple means in relation to tempting knowledge that goes beyond good and evil. Regarding "Beam." I think the items Inta.org offers give some guidance? louis signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [DISCUSS] Apache Dataflow Incubator Proposal
On Thu, Jan 28, 2016 at 6:29 PM, Doug Cutting wrote: > On Thu, Jan 28, 2016 at 3:11 PM, Greg Stein wrote: > > As a regular english word, "beam" cannot be trademarked, by others/us. > > Like Windows® or Apple®? oh, snap. True. Hrm. Given that, I'm confused why I keep hearing "oh, natural word, can't be trademarked." Thx, -g
Re: [DISCUSS] Apache Dataflow Incubator Proposal
On Thu, Jan 28, 2016 at 3:11 PM, Greg Stein wrote: > As a regular english word, "beam" cannot be trademarked, by others/us. Like Windows® or Apple®? Doug - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
As a regular english word, "beam" cannot be trademarked, by others/us. Yet the *pair* of words, "Apache Beam" can be implicitly/explicitly trademarked. On Thu, Jan 28, 2016 at 11:51 AM, Alex Harui wrote: > > > On 1/28/16, 3:26 AM, "Jean-Baptiste Onofré" wrote: > > >I prefer Beam ;) > > I like the name and the logic behind choosing it. Some concerns are that > a Google search of "Beam Software" turned up [1] and [2] among others, > which might mean that Apache Beam won't work as a TLP name. "Beam" is a > good stem word so maybe you can add something to the front: "FlowBeam", > "DataBeam", etc. > > -Alex > > [1] http://www.beamsoftware.com > [2] https://earth.esa.int/web/sentinel/-/beam > > >
Re: [DISCUSS] Apache Dataflow Incubator Proposal
On 1/28/16, 3:26 AM, "Jean-Baptiste Onofré" wrote: >I prefer Beam ;) I like the name and the logic behind choosing it. Some concerns are that a Google search of "Beam Software" turned up [1] and [2] among others, which might mean that Apache Beam won't work as a TLP name. "Beam" is a good stem word so maybe you can add something to the front: "FlowBeam", "DataBeam", etc. -Alex [1] http://www.beamsoftware.com [2] https://earth.esa.int/web/sentinel/-/beam
Re: [DISCUSS] Apache Dataflow Incubator Proposal
LOL ;) Regards JB On 01/28/2016 01:00 PM, Hadrian Zbarcea wrote: Hi Bertrand, Your suggested name only has an entropy of 0.19153. You may consider adding a special character or two :). To standardize (every company seems to like this lately) we may consider using SHA-256 for our project names. Being of fixed length it will help Sally in her templates for press releases. Sorry, couldn't resit :) Hadrian On 01/28/2016 06:25 AM, Bertrand Delacretaz wrote: On Thu, Jan 28, 2016 at 11:41 AM, Serge Huber wrote: ...Please ignore my last message, I missed the fact that a project was already existing with the name “Arrow”... Hehe, that's the risk when using common names, with about 200 projects here. Naming your project sdkjhkjhsdfxyhs is safer in this respect but has other disadvantages. -Bertrand - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi Bertrand, Your suggested name only has an entropy of 0.19153. You may consider adding a special character or two :). To standardize (every company seems to like this lately) we may consider using SHA-256 for our project names. Being of fixed length it will help Sally in her templates for press releases. Sorry, couldn't resit :) Hadrian On 01/28/2016 06:25 AM, Bertrand Delacretaz wrote: On Thu, Jan 28, 2016 at 11:41 AM, Serge Huber wrote: ...Please ignore my last message, I missed the fact that a project was already existing with the name “Arrow”... Hehe, that's the risk when using common names, with about 200 projects here. Naming your project sdkjhkjhsdfxyhs is safer in this respect but has other disadvantages. -Bertrand - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
I prefer Beam ;) Regards JB On 01/28/2016 12:25 PM, Bertrand Delacretaz wrote: On Thu, Jan 28, 2016 at 11:41 AM, Serge Huber wrote: ...Please ignore my last message, I missed the fact that a project was already existing with the name “Arrow”... Hehe, that's the risk when using common names, with about 200 projects here. Naming your project sdkjhkjhsdfxyhs is safer in this respect but has other disadvantages. -Bertrand - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
On Thu, Jan 28, 2016 at 11:41 AM, Serge Huber wrote: > ...Please ignore my last message, I missed the fact that a project was > already existing > with the name “Arrow”... Hehe, that's the risk when using common names, with about 200 projects here. Naming your project sdkjhkjhsdfxyhs is safer in this respect but has other disadvantages. -Bertrand - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Please ignore my last message, I missed the fact that a project was already existing with the name “Arrow”. cheers, Serge… > On 28 janv. 2016, at 11:01, Bertrand Delacretaz > wrote: > > On Wed, Jan 27, 2016 at 6:22 PM, James Malone > wrote: >> ...To that end, the name we propose to use is: >> >> Apache Beam > > The name sounds good to me and it's indeed a good idea to set it now. > > -Bertrand > > - > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
On a lighter side… Just to mess with you guys if you’re looking at an alternative name I might suggest : Apache Arrow I thought of this because I was thinking of how an arrow flows through the air, might be deviated, and it also associates with something sharp, dangerous and fast :) And of course an Apache would use an arrow more than a beam :) But maybe Beam is more appropriate, but I just thought I’d put this out there. cheers, Serge… > On 28 janv. 2016, at 11:01, Bertrand Delacretaz > wrote: > > On Wed, Jan 27, 2016 at 6:22 PM, James Malone > wrote: >> ...To that end, the name we propose to use is: >> >> Apache Beam > > The name sounds good to me and it's indeed a good idea to set it now. > > -Bertrand > > - > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
On Wed, Jan 27, 2016 at 6:22 PM, James Malone wrote: > ...To that end, the name we propose to use is: > > Apache Beam The name sounds good to me and it's indeed a good idea to set it now. -Bertrand - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi everyone, Based on the feedback concerning naming, we would like to rename the proposal and the project. We want to do this early to ensure we don't disrupt the project based on naming. To that end, the name we propose to use is: Apache Beam The name Beam is based on a joining of Batch and strEAM to showcase the unified nature of the model and tools. We also wanted to select a name which is simple, memorable, and also not already in use. Best, James On Fri, Jan 22, 2016 at 2:07 AM, Jean-Baptiste Onofré wrote: > It makes perfect sense, and it's something that we already discussed. > > Thanks James and Marvin. > > @James, yes, we are going to deal with that together, not a problem at > all. I agree that renaming should happen now. > As discussed, we should be back with a new name early next week. > > I'm happy to see the discussion now (and thanks again Marvin for details > and always helpful messages): it's exactly the purpose of sending the > discussion thread on the incubator mailing list. > > Thanks guys ! > > Regards > JB > > > On 01/22/2016 02:19 AM, James Malone wrote: > >> Thank you for such a detailed response Marvin! >> >> Everything you mention makes a lot of sense. Needless to say, we don't >> want >> to squander cycles, break any rules, or throw velocity into disarray all >> due to a name. >> >> To that end, I am going to work with JB to amend the proposal with respect >> to renaming. I'm also going to clarify a name change would be an >> immediate-term to-do item so it does not block creation creation of lists, >> repositories, and so on. >> >> Best, >> >> James >> >> I am going to work with JB to amend the proposal to indicate >> >> On Thu, Jan 21, 2016 at 9:30 AM, Marvin Humphrey >> wrote: >> >> On Wed, Jan 20, 2016 at 3:30 PM, James Malone >>> wrote: >>> If we need to rename, we would ideally choose a new name, change the project name at that time, and start our refactoring with that new name. >>> Is >>> is acceptable for us to flag a name change as something we need to do as >>> a >>> near-term (1st month) item in incubation (if accepted)? If a rename is required I'd like to add it to our to-do roadmap but also not block our proposal on a renaming. I ask so we can address this concern in the best way possible. >>> >>> That's acceptable. Project naming issues do not block entry into the >>> Incubator, they block graduation from the Incubator. >>> >>> Because "dataflow" is descriptive, it will be hard to defend as >>> a trademark. The Wikipedia article on trademark distinctiveness explains >>> things well: >>> >>> https://en.wikipedia.org/wiki/Trademark_distinctiveness >>> >>> A weak mark both increases the amount of volunteer effort that goes >>> into dealing with infringement cases and makes bad outcomes more likely. >>> It is not an absolute requirement that Apache projects have defensible >>> names, >>> but painful past experience has taught us that mishandled branding can >>> deal >>> surprising amounts of damage to a project community. >>> >>> But beyond that, the issue of "Google Cloud Dataflow" vs. "Apache >>> Dataflow" is >>> a blocker. One or the other will have to be renamed, and since the >>> software >>> is being donated but apparently not the brand, it sounds like renaming >>> the >>> prospective Apache project will be required and you should add that task >>> to >>> your roadmap. >>> >>> Changing names in the middle of incubation is disruptive because it >>> requires >>> renaming infrastructure resources, impacting both the Apache >>> Infrastructure >>> team and also the podling's developer and user communities. My >>> suggestion >>> would be that immediately after the VOTE to enter incubation concludes, >>> you >>> only create a dev mailing list and deal with the renaming immediately, >>> delaying the creation of other resources until after the renaming is >>> resolved. >>> However, the exact plan is something you can work out with your Mentors. >>> >>> Marvin Humphrey >>> >>> - >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >>> For additional commands, e-mail: general-h...@incubator.apache.org >>> >>> >>> >> > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com > > - > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi Renaud and Bertrand, No worries Bertrand ! And thanks Renaud ;) Regards JB On 01/26/2016 04:19 PM, Bertrand Delacretaz wrote: Bonjour Renaud, On Tue, Jan 26, 2016 at 4:04 PM, Renaud Richardet wrote: ...Please add me to “Additional Interested Contributors” section as well I've done this, happily! (JB I hope you don't mind). (I know Renaud for quite some time and I think he can make great contributions to Dataflow). -Bertrand - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Bonjour Renaud, On Tue, Jan 26, 2016 at 4:04 PM, Renaud Richardet wrote: > ...Please add me to “Additional Interested Contributors” section as well I've done this, happily! (JB I hope you don't mind). (I know Renaud for quite some time and I think he can make great contributions to Dataflow). -Bertrand - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Bonjour, Please add me to “Additional Interested Contributors” section as well. I am an Apache UIMA committer, and would like to use Dataflow to process large amounts of text [1]. I just started a POC [2] and really like the API so far. Thanks, Renaud [1] https://github.com/BlueBrain/bluima [2] https://github.com/renaud/textdataflow
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hey Ajay, great: I added you on the proposal. Thanks ! Regards JB On 01/25/2016 06:25 AM, Ajay Yadav wrote: Great proposal. I would also like to contribute to the project especially the Python SDK, if possible. Cheers Ajay Yadava On Sun, Jan 24, 2016 at 1:25 AM, Jean-Baptiste Onofré wrote: Hi Seshu, it does both: streaming and batching data processing. Regards JB On 01/23/2016 03:01 PM, Adunuthula, Seshu wrote: Did not get a chance to play with it yet, Within Google is it used more as a MR replacement or a Stream processing engine? Or it does both of them fantastically well? On 1/22/16, 10:58 AM, "Frances Perry" wrote: Crunch started as a clone of FlumeJava, which was Google internal. In the meantime inside Google, FlumeJava evolved into Dataflow. So all three share a number of concepts like PCollections, ParDo, DoFn, etc. However, Dataflow adds a number of new things -- the biggest being a unified batch/streaming semantics using concepts like Windowing and Triggers. Tyler Akidau's OReilly post has a really nice explanation: https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 On Fri, Jan 22, 2016 at 10:42 AM, Ashish wrote: Crunch has Spark pipelines, but not sure about the runner abstraction. May be Josh Wills or Tom White can provide more insight on this topic. They are core devs for both projects :) On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré wrote: Hi, I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce pipeline, it doesn't provide runner abstraction. It's based on FlumeJava. The logic is very similar (with DoFns, pipelines, ...). Correct me if I'm wrong, but Crunch started after Google Dataflow, especially because Dataflow was not opensourced at that time. So, I agree it's very similar/close. Regards JB On 01/22/2016 05:51 PM, Ashish wrote: Hi JB, Curious to know about how it compares to Apache Crunch? Constructs looks very familiar (had used Crunch long ago) Thoughts? - Ashish On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré wrote: Hi Seshu, I blogged about Apache Dataflow proposal: http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/ You can see in the "what's next ?" section that new runners, skins and sources are on our roadmap. Definitely, a storm runner could be part of this. Regards JB On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote: Awesome to see CloudDataFlow coming to Apache. The Stream Processing area has been in general fragmented with a variety of solutions, hoping the community galvanizes around Apache Data Flow. We are still in the "Apache Storm" world, Any chance for folks building a "Storm Runner²? On 1/20/16, 9:39 AM, "James Malone" wrote: Great proposal. I like that your proposal includes a well presented roadmap, but I don't see any goals that directly address building a larger community. Y'all have any ideas around outreach that will help with adoption? Thank you and fair point. We have a few additional ideas which we can put into the Community section. As a start, I recommend y'all add a section to the proposal on the wiki page for "Additional Interested Contributors" so that folks who want to sign up to participate in the project can do so without requesting additions to the initial committer list. This is a great idea and I think it makes a lot of sense to add an "Additional Interested Contributors" section to the proposal. On Wed, Jan 20, 2016 at 10:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data proc
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Great proposal. I would also like to contribute to the project especially the Python SDK, if possible. Cheers Ajay Yadava On Sun, Jan 24, 2016 at 1:25 AM, Jean-Baptiste Onofré wrote: > Hi Seshu, > > it does both: streaming and batching data processing. > > Regards > JB > > On 01/23/2016 03:01 PM, Adunuthula, Seshu wrote: > >> Did not get a chance to play with it yet, Within Google is it used more as >> a MR replacement or a Stream processing engine? Or it does both of them >> fantastically well? >> >> >> On 1/22/16, 10:58 AM, "Frances Perry" wrote: >> >> Crunch started as a clone of FlumeJava, which was Google internal. In the >>> meantime inside Google, FlumeJava evolved into Dataflow. So all three >>> share >>> a number of concepts like PCollections, ParDo, DoFn, etc. However, >>> Dataflow >>> adds a number of new things -- the biggest being a unified >>> batch/streaming >>> semantics using concepts like Windowing and Triggers. Tyler Akidau's >>> OReilly post has a really nice explanation: >>> https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 >>> >>> On Fri, Jan 22, 2016 at 10:42 AM, Ashish >>> wrote: >>> >>> Crunch has Spark pipelines, but not sure about the runner abstraction. May be Josh Wills or Tom White can provide more insight on this topic. They are core devs for both projects :) On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré wrote: > Hi, > > I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce > pipeline, it > doesn't provide runner abstraction. It's based on FlumeJava. > > The logic is very similar (with DoFns, pipelines, ...). Correct me if > I'm > wrong, but Crunch started after Google Dataflow, especially because > Dataflow > was not opensourced at that time. > > So, I agree it's very similar/close. > > Regards > JB > > > On 01/22/2016 05:51 PM, Ashish wrote: > >> >> Hi JB, >> >> Curious to know about how it compares to Apache Crunch? Constructs >> looks very familiar (had used Crunch long ago) >> >> Thoughts? >> >> - Ashish >> >> On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré >> > > wrote: >> >>> >>> Hi Seshu, >>> >>> I blogged about Apache Dataflow proposal: >>> http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/ >>> >>> You can see in the "what's next ?" section that new runners, skins >>> >> and > sources are on our roadmap. Definitely, a storm runner could be >>> >> part of > this. >>> >>> Regards >>> JB >>> >>> >>> On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote: >>> Awesome to see CloudDataFlow coming to Apache. The Stream >>> Processing > area has been in general fragmented with a variety of solutions, hoping >>> the > community galvanizes around Apache Data Flow. We are still in the "Apache Storm" world, Any chance for folks >>> building > a "Storm Runner²? On 1/20/16, 9:39 AM, "James Malone" >>> > wrote: Great proposal. I like that your proposal includes a well >> > presented > roadmap, but I don't see any goals that directly address >> > building a > larger >> community. Y'all have any ideas around outreach that will help >> > with > adoption? >> >> > Thank you and fair point. We have a few additional ideas which we > can > put > into the Community section. > > > >> As a start, I recommend y'all add a section to the proposal on >> > the > wiki >> page for "Additional Interested Contributors" so that folks who >> > want > to >> sign up to participate in the project can do so without >> > requesting > additions to the initial committer list. >> >> >> This is a great idea and I think it makes a lot of sense to add an > "Additional > Interested Contributors" section to the proposal. > > > On Wed, Jan 20, 2016 at 10:32 AM, James Malone < >> jamesmal...@google.com.invalid> wrote: >> >> Hello everyone, >>> >>> Attached to this message is a proposed new project - Apache >>> >> Dataflow, > >> >> a >> >>> >>> >>> unified programming model for data processing and integration. >>> >>> The text of the proposal is included below. Additionally, the >>> >> >> >> proposal
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi Seshu, it does both: streaming and batching data processing. Regards JB On 01/23/2016 03:01 PM, Adunuthula, Seshu wrote: Did not get a chance to play with it yet, Within Google is it used more as a MR replacement or a Stream processing engine? Or it does both of them fantastically well? On 1/22/16, 10:58 AM, "Frances Perry" wrote: Crunch started as a clone of FlumeJava, which was Google internal. In the meantime inside Google, FlumeJava evolved into Dataflow. So all three share a number of concepts like PCollections, ParDo, DoFn, etc. However, Dataflow adds a number of new things -- the biggest being a unified batch/streaming semantics using concepts like Windowing and Triggers. Tyler Akidau's OReilly post has a really nice explanation: https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 On Fri, Jan 22, 2016 at 10:42 AM, Ashish wrote: Crunch has Spark pipelines, but not sure about the runner abstraction. May be Josh Wills or Tom White can provide more insight on this topic. They are core devs for both projects :) On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré wrote: Hi, I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce pipeline, it doesn't provide runner abstraction. It's based on FlumeJava. The logic is very similar (with DoFns, pipelines, ...). Correct me if I'm wrong, but Crunch started after Google Dataflow, especially because Dataflow was not opensourced at that time. So, I agree it's very similar/close. Regards JB On 01/22/2016 05:51 PM, Ashish wrote: Hi JB, Curious to know about how it compares to Apache Crunch? Constructs looks very familiar (had used Crunch long ago) Thoughts? - Ashish On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré wrote: Hi Seshu, I blogged about Apache Dataflow proposal: http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/ You can see in the "what's next ?" section that new runners, skins and sources are on our roadmap. Definitely, a storm runner could be part of this. Regards JB On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote: Awesome to see CloudDataFlow coming to Apache. The Stream Processing area has been in general fragmented with a variety of solutions, hoping the community galvanizes around Apache Data Flow. We are still in the "Apache Storm" world, Any chance for folks building a "Storm Runner²? On 1/20/16, 9:39 AM, "James Malone" wrote: Great proposal. I like that your proposal includes a well presented roadmap, but I don't see any goals that directly address building a larger community. Y'all have any ideas around outreach that will help with adoption? Thank you and fair point. We have a few additional ideas which we can put into the Community section. As a start, I recommend y'all add a section to the proposal on the wiki page for "Additional Interested Contributors" so that folks who want to sign up to participate in the project can do so without requesting additions to the initial committer list. This is a great idea and I think it makes a lot of sense to add an "Additional Interested Contributors" section to the proposal. On Wed, Jan 20, 2016 at 10:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Did not get a chance to play with it yet, Within Google is it used more as a MR replacement or a Stream processing engine? Or it does both of them fantastically well? On 1/22/16, 10:58 AM, "Frances Perry" wrote: >Crunch started as a clone of FlumeJava, which was Google internal. In the >meantime inside Google, FlumeJava evolved into Dataflow. So all three >share >a number of concepts like PCollections, ParDo, DoFn, etc. However, >Dataflow >adds a number of new things -- the biggest being a unified batch/streaming >semantics using concepts like Windowing and Triggers. Tyler Akidau's >OReilly post has a really nice explanation: >https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 > >On Fri, Jan 22, 2016 at 10:42 AM, Ashish wrote: > >> Crunch has Spark pipelines, but not sure about the runner abstraction. >> >> May be Josh Wills or Tom White can provide more insight on this topic. >> They are core devs for both projects :) >> >> On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré >> wrote: >> > Hi, >> > >> > I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce >> pipeline, it >> > doesn't provide runner abstraction. It's based on FlumeJava. >> > >> > The logic is very similar (with DoFns, pipelines, ...). Correct me if >>I'm >> > wrong, but Crunch started after Google Dataflow, especially because >> Dataflow >> > was not opensourced at that time. >> > >> > So, I agree it's very similar/close. >> > >> > Regards >> > JB >> > >> > >> > On 01/22/2016 05:51 PM, Ashish wrote: >> >> >> >> Hi JB, >> >> >> >> Curious to know about how it compares to Apache Crunch? Constructs >> >> looks very familiar (had used Crunch long ago) >> >> >> >> Thoughts? >> >> >> >> - Ashish >> >> >> >> On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré >> >> >> wrote: >> >>> >> >>> Hi Seshu, >> >>> >> >>> I blogged about Apache Dataflow proposal: >> >>> http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/ >> >>> >> >>> You can see in the "what's next ?" section that new runners, skins >>and >> >>> sources are on our roadmap. Definitely, a storm runner could be >>part of >> >>> this. >> >>> >> >>> Regards >> >>> JB >> >>> >> >>> >> >>> On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote: >> >> >> Awesome to see CloudDataFlow coming to Apache. The Stream >>Processing >> area >> has been in general fragmented with a variety of solutions, hoping >>the >> community galvanizes around Apache Data Flow. >> >> We are still in the "Apache Storm" world, Any chance for folks >> building >> a >> "Storm Runner²? >> >> >> On 1/20/16, 9:39 AM, "James Malone" >> >> wrote: >> >> >> Great proposal. I like that your proposal includes a well >>presented >> >> roadmap, but I don't see any goals that directly address >>building a >> >> larger >> >> community. Y'all have any ideas around outreach that will help >>with >> >> adoption? >> >> >> > >> > Thank you and fair point. We have a few additional ideas which we >>can >> > put >> > into the Community section. >> > >> > >> >> >> >> As a start, I recommend y'all add a section to the proposal on >>the >> >> wiki >> >> page for "Additional Interested Contributors" so that folks who >>want >> >> to >> >> sign up to participate in the project can do so without >>requesting >> >> additions to the initial committer list. >> >> >> >> >> > This is a great idea and I think it makes a lot of sense to add an >> > "Additional >> > Interested Contributors" section to the proposal. >> > >> > >> >> On Wed, Jan 20, 2016 at 10:32 AM, James Malone < >> >> jamesmal...@google.com.invalid> wrote: >> >> >> >>> Hello everyone, >> >>> >> >>> Attached to this message is a proposed new project - Apache >> Dataflow, >> >> >> >> >> >> a >> >>> >> >>> >> >>> unified programming model for data processing and integration. >> >>> >> >>> The text of the proposal is included below. Additionally, the >> >> >> >> >> >> proposal is >> >>> >> >>> >> >>> in draft form on the wiki where we will make any required >>changes: >> >>> >> >>> https://wiki.apache.org/incubator/DataflowProposal >> >>> >> >>> We look forward to your feedback and input. >> >>> >> >>> Best, >> >>> >> >>> James >> >>> >> >>> >> >>> >> >>> = Apache Dataflow = >> >>> >> >>> == Abstract == >> >>> >> >>> Dataflow is an open source, unified model and set of >> >>> language-specific >> >> >> >> >> >> SDKs >> >>> >> >>> >> >>> for defining and executing data processing workflows, and also >>data >> >>> ingestion and integration flows, supporting Enterprise >>Integration >> >> >> >> >> >> Patterns >> >>> >> >>> >> >>> (EIPs) and Domain Specific Languages (DSLs). Dataflow
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Very great proposal! Agreed to Stain, the name, Dataflow, is used widely long time, maybe should think about another one. Thanks. Best Regards! - Luke Han On Sat, Jan 23, 2016 at 2:17 PM, Alex Harui wrote: > > > On 1/22/16, 10:58 AM, "Frances Perry" wrote: > > >Crunch started as a clone of FlumeJava, which was Google internal. In the > >meantime inside Google, FlumeJava evolved into Dataflow. So all three > >share > >a number of concepts like PCollections, ParDo, DoFn, etc. However, > >Dataflow > >adds a number of new things -- the biggest being a unified batch/streaming > >semantics using concepts like Windowing and Triggers. > > And somewhere in there might be your new podling name. WinTrig or > Wintrigue or something like that. > > -Alex > >
Re: [DISCUSS] Apache Dataflow Incubator Proposal
On 1/22/16, 10:58 AM, "Frances Perry" wrote: >Crunch started as a clone of FlumeJava, which was Google internal. In the >meantime inside Google, FlumeJava evolved into Dataflow. So all three >share >a number of concepts like PCollections, ParDo, DoFn, etc. However, >Dataflow >adds a number of new things -- the biggest being a unified batch/streaming >semantics using concepts like Windowing and Triggers. And somewhere in there might be your new podling name. WinTrig or Wintrigue or something like that. -Alex
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Thanks Frances ! That explains it. Wrote a couple of posts on basic usage of Crunch, may be its time to rewrite them with Dataflow. On Fri, Jan 22, 2016 at 10:58 AM, Frances Perry wrote: > Crunch started as a clone of FlumeJava, which was Google internal. In the > meantime inside Google, FlumeJava evolved into Dataflow. So all three share > a number of concepts like PCollections, ParDo, DoFn, etc. However, Dataflow > adds a number of new things -- the biggest being a unified batch/streaming > semantics using concepts like Windowing and Triggers. Tyler Akidau's > OReilly post has a really nice explanation: > https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 > > On Fri, Jan 22, 2016 at 10:42 AM, Ashish wrote: > >> Crunch has Spark pipelines, but not sure about the runner abstraction. >> >> May be Josh Wills or Tom White can provide more insight on this topic. >> They are core devs for both projects :) >> >> On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré >> wrote: >> > Hi, >> > >> > I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce >> pipeline, it >> > doesn't provide runner abstraction. It's based on FlumeJava. >> > >> > The logic is very similar (with DoFns, pipelines, ...). Correct me if I'm >> > wrong, but Crunch started after Google Dataflow, especially because >> Dataflow >> > was not opensourced at that time. >> > >> > So, I agree it's very similar/close. >> > >> > Regards >> > JB >> > >> > >> > On 01/22/2016 05:51 PM, Ashish wrote: >> >> >> >> Hi JB, >> >> >> >> Curious to know about how it compares to Apache Crunch? Constructs >> >> looks very familiar (had used Crunch long ago) >> >> >> >> Thoughts? >> >> >> >> - Ashish >> >> >> >> On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré >> >> wrote: >> >>> >> >>> Hi Seshu, >> >>> >> >>> I blogged about Apache Dataflow proposal: >> >>> http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/ >> >>> >> >>> You can see in the "what's next ?" section that new runners, skins and >> >>> sources are on our roadmap. Definitely, a storm runner could be part of >> >>> this. >> >>> >> >>> Regards >> >>> JB >> >>> >> >>> >> >>> On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote: >> >> >> Awesome to see CloudDataFlow coming to Apache. The Stream Processing >> area >> has been in general fragmented with a variety of solutions, hoping the >> community galvanizes around Apache Data Flow. >> >> We are still in the "Apache Storm" world, Any chance for folks >> building >> a >> "Storm Runner²? >> >> >> On 1/20/16, 9:39 AM, "James Malone" >> wrote: >> >> >> Great proposal. I like that your proposal includes a well presented >> >> roadmap, but I don't see any goals that directly address building a >> >> larger >> >> community. Y'all have any ideas around outreach that will help with >> >> adoption? >> >> >> > >> > Thank you and fair point. We have a few additional ideas which we can >> > put >> > into the Community section. >> > >> > >> >> >> >> As a start, I recommend y'all add a section to the proposal on the >> >> wiki >> >> page for "Additional Interested Contributors" so that folks who want >> >> to >> >> sign up to participate in the project can do so without requesting >> >> additions to the initial committer list. >> >> >> >> >> > This is a great idea and I think it makes a lot of sense to add an >> > "Additional >> > Interested Contributors" section to the proposal. >> > >> > >> >> On Wed, Jan 20, 2016 at 10:32 AM, James Malone < >> >> jamesmal...@google.com.invalid> wrote: >> >> >> >>> Hello everyone, >> >>> >> >>> Attached to this message is a proposed new project - Apache >> Dataflow, >> >> >> >> >> >> a >> >>> >> >>> >> >>> unified programming model for data processing and integration. >> >>> >> >>> The text of the proposal is included below. Additionally, the >> >> >> >> >> >> proposal is >> >>> >> >>> >> >>> in draft form on the wiki where we will make any required changes: >> >>> >> >>> https://wiki.apache.org/incubator/DataflowProposal >> >>> >> >>> We look forward to your feedback and input. >> >>> >> >>> Best, >> >>> >> >>> James >> >>> >> >>> >> >>> >> >>> = Apache Dataflow = >> >>> >> >>> == Abstract == >> >>> >> >>> Dataflow is an open source, unified model and set of >> >>> language-specific >> >> >> >> >> >> SDKs >> >>> >> >>> >> >>> for defining and executing data processing workflows, and also data >> >>> ingestion and integration flows, supporting Enterprise Integration >> >> >> >> >> >> Patterns >> >>> >> >>> >> >>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines >> >> >> >> >> >> simplify >> >
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Crunch started as a clone of FlumeJava, which was Google internal. In the meantime inside Google, FlumeJava evolved into Dataflow. So all three share a number of concepts like PCollections, ParDo, DoFn, etc. However, Dataflow adds a number of new things -- the biggest being a unified batch/streaming semantics using concepts like Windowing and Triggers. Tyler Akidau's OReilly post has a really nice explanation: https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 On Fri, Jan 22, 2016 at 10:42 AM, Ashish wrote: > Crunch has Spark pipelines, but not sure about the runner abstraction. > > May be Josh Wills or Tom White can provide more insight on this topic. > They are core devs for both projects :) > > On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré > wrote: > > Hi, > > > > I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce > pipeline, it > > doesn't provide runner abstraction. It's based on FlumeJava. > > > > The logic is very similar (with DoFns, pipelines, ...). Correct me if I'm > > wrong, but Crunch started after Google Dataflow, especially because > Dataflow > > was not opensourced at that time. > > > > So, I agree it's very similar/close. > > > > Regards > > JB > > > > > > On 01/22/2016 05:51 PM, Ashish wrote: > >> > >> Hi JB, > >> > >> Curious to know about how it compares to Apache Crunch? Constructs > >> looks very familiar (had used Crunch long ago) > >> > >> Thoughts? > >> > >> - Ashish > >> > >> On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré > >> wrote: > >>> > >>> Hi Seshu, > >>> > >>> I blogged about Apache Dataflow proposal: > >>> http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/ > >>> > >>> You can see in the "what's next ?" section that new runners, skins and > >>> sources are on our roadmap. Definitely, a storm runner could be part of > >>> this. > >>> > >>> Regards > >>> JB > >>> > >>> > >>> On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote: > > > Awesome to see CloudDataFlow coming to Apache. The Stream Processing > area > has been in general fragmented with a variety of solutions, hoping the > community galvanizes around Apache Data Flow. > > We are still in the "Apache Storm" world, Any chance for folks > building > a > "Storm Runner²? > > > On 1/20/16, 9:39 AM, "James Malone" > wrote: > > >> Great proposal. I like that your proposal includes a well presented > >> roadmap, but I don't see any goals that directly address building a > >> larger > >> community. Y'all have any ideas around outreach that will help with > >> adoption? > >> > > > > Thank you and fair point. We have a few additional ideas which we can > > put > > into the Community section. > > > > > >> > >> As a start, I recommend y'all add a section to the proposal on the > >> wiki > >> page for "Additional Interested Contributors" so that folks who want > >> to > >> sign up to participate in the project can do so without requesting > >> additions to the initial committer list. > >> > >> > > This is a great idea and I think it makes a lot of sense to add an > > "Additional > > Interested Contributors" section to the proposal. > > > > > >> On Wed, Jan 20, 2016 at 10:32 AM, James Malone < > >> jamesmal...@google.com.invalid> wrote: > >> > >>> Hello everyone, > >>> > >>> Attached to this message is a proposed new project - Apache > Dataflow, > >> > >> > >> a > >>> > >>> > >>> unified programming model for data processing and integration. > >>> > >>> The text of the proposal is included below. Additionally, the > >> > >> > >> proposal is > >>> > >>> > >>> in draft form on the wiki where we will make any required changes: > >>> > >>> https://wiki.apache.org/incubator/DataflowProposal > >>> > >>> We look forward to your feedback and input. > >>> > >>> Best, > >>> > >>> James > >>> > >>> > >>> > >>> = Apache Dataflow = > >>> > >>> == Abstract == > >>> > >>> Dataflow is an open source, unified model and set of > >>> language-specific > >> > >> > >> SDKs > >>> > >>> > >>> for defining and executing data processing workflows, and also data > >>> ingestion and integration flows, supporting Enterprise Integration > >> > >> > >> Patterns > >>> > >>> > >>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines > >> > >> > >> simplify > >>> > >>> > >>> the mechanics of large-scale batch and streaming data processing > and > >> > >> > >> can > >>> > >>> > >>> run on a number of runtimes like Apache Flink, Apache Spark, and > >> > >> > >> Google > >>> > >>> > >>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in > >
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Crunch has Spark pipelines, but not sure about the runner abstraction. May be Josh Wills or Tom White can provide more insight on this topic. They are core devs for both projects :) On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré wrote: > Hi, > > I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce pipeline, it > doesn't provide runner abstraction. It's based on FlumeJava. > > The logic is very similar (with DoFns, pipelines, ...). Correct me if I'm > wrong, but Crunch started after Google Dataflow, especially because Dataflow > was not opensourced at that time. > > So, I agree it's very similar/close. > > Regards > JB > > > On 01/22/2016 05:51 PM, Ashish wrote: >> >> Hi JB, >> >> Curious to know about how it compares to Apache Crunch? Constructs >> looks very familiar (had used Crunch long ago) >> >> Thoughts? >> >> - Ashish >> >> On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré >> wrote: >>> >>> Hi Seshu, >>> >>> I blogged about Apache Dataflow proposal: >>> http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/ >>> >>> You can see in the "what's next ?" section that new runners, skins and >>> sources are on our roadmap. Definitely, a storm runner could be part of >>> this. >>> >>> Regards >>> JB >>> >>> >>> On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote: Awesome to see CloudDataFlow coming to Apache. The Stream Processing area has been in general fragmented with a variety of solutions, hoping the community galvanizes around Apache Data Flow. We are still in the "Apache Storm" world, Any chance for folks building a "Storm Runner²? On 1/20/16, 9:39 AM, "James Malone" wrote: >> Great proposal. I like that your proposal includes a well presented >> roadmap, but I don't see any goals that directly address building a >> larger >> community. Y'all have any ideas around outreach that will help with >> adoption? >> > > Thank you and fair point. We have a few additional ideas which we can > put > into the Community section. > > >> >> As a start, I recommend y'all add a section to the proposal on the >> wiki >> page for "Additional Interested Contributors" so that folks who want >> to >> sign up to participate in the project can do so without requesting >> additions to the initial committer list. >> >> > This is a great idea and I think it makes a lot of sense to add an > "Additional > Interested Contributors" section to the proposal. > > >> On Wed, Jan 20, 2016 at 10:32 AM, James Malone < >> jamesmal...@google.com.invalid> wrote: >> >>> Hello everyone, >>> >>> Attached to this message is a proposed new project - Apache Dataflow, >> >> >> a >>> >>> >>> unified programming model for data processing and integration. >>> >>> The text of the proposal is included below. Additionally, the >> >> >> proposal is >>> >>> >>> in draft form on the wiki where we will make any required changes: >>> >>> https://wiki.apache.org/incubator/DataflowProposal >>> >>> We look forward to your feedback and input. >>> >>> Best, >>> >>> James >>> >>> >>> >>> = Apache Dataflow = >>> >>> == Abstract == >>> >>> Dataflow is an open source, unified model and set of >>> language-specific >> >> >> SDKs >>> >>> >>> for defining and executing data processing workflows, and also data >>> ingestion and integration flows, supporting Enterprise Integration >> >> >> Patterns >>> >>> >>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines >> >> >> simplify >>> >>> >>> the mechanics of large-scale batch and streaming data processing and >> >> >> can >>> >>> >>> run on a number of runtimes like Apache Flink, Apache Spark, and >> >> >> Google >>> >>> >>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in >> >> >> different >>> >>> >>> languages, allowing users to easily implement their data integration >>> processes. >>> >>> == Proposal == >>> >>> Dataflow is a simple, flexible, and powerful system for distributed >> >> >> data >>> >>> >>> processing at any scale. Dataflow provides a unified programming >> >> >> model, a >>> >>> >>> software development kit to define and construct data processing >> >> >> pipelines, >>> >>> >>> and runners to execute Dataflow pipelines in several runtime engines, >> >> >> like >>> >>> >>> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be >> >> >> used >>> >>> >>> for a variety of streaming or batch data processing goals including >> >>
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi, I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce pipeline, it doesn't provide runner abstraction. It's based on FlumeJava. The logic is very similar (with DoFns, pipelines, ...). Correct me if I'm wrong, but Crunch started after Google Dataflow, especially because Dataflow was not opensourced at that time. So, I agree it's very similar/close. Regards JB On 01/22/2016 05:51 PM, Ashish wrote: Hi JB, Curious to know about how it compares to Apache Crunch? Constructs looks very familiar (had used Crunch long ago) Thoughts? - Ashish On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré wrote: Hi Seshu, I blogged about Apache Dataflow proposal: http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/ You can see in the "what's next ?" section that new runners, skins and sources are on our roadmap. Definitely, a storm runner could be part of this. Regards JB On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote: Awesome to see CloudDataFlow coming to Apache. The Stream Processing area has been in general fragmented with a variety of solutions, hoping the community galvanizes around Apache Data Flow. We are still in the "Apache Storm" world, Any chance for folks building a "Storm Runner²? On 1/20/16, 9:39 AM, "James Malone" wrote: Great proposal. I like that your proposal includes a well presented roadmap, but I don't see any goals that directly address building a larger community. Y'all have any ideas around outreach that will help with adoption? Thank you and fair point. We have a few additional ideas which we can put into the Community section. As a start, I recommend y'all add a section to the proposal on the wiki page for "Additional Interested Contributors" so that folks who want to sign up to participate in the project can do so without requesting additions to the initial committer list. This is a great idea and I think it makes a lot of sense to add an "Additional Interested Contributors" section to the proposal. On Wed, Jan 20, 2016 at 10:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a ³direct runn
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi JB, Curious to know about how it compares to Apache Crunch? Constructs looks very familiar (had used Crunch long ago) Thoughts? - Ashish On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré wrote: > Hi Seshu, > > I blogged about Apache Dataflow proposal: > http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/ > > You can see in the "what's next ?" section that new runners, skins and > sources are on our roadmap. Definitely, a storm runner could be part of > this. > > Regards > JB > > > On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote: >> >> Awesome to see CloudDataFlow coming to Apache. The Stream Processing area >> has been in general fragmented with a variety of solutions, hoping the >> community galvanizes around Apache Data Flow. >> >> We are still in the "Apache Storm" world, Any chance for folks building a >> "Storm Runner²? >> >> >> On 1/20/16, 9:39 AM, "James Malone" >> wrote: >> Great proposal. I like that your proposal includes a well presented roadmap, but I don't see any goals that directly address building a larger community. Y'all have any ideas around outreach that will help with adoption? >>> >>> Thank you and fair point. We have a few additional ideas which we can put >>> into the Community section. >>> >>> As a start, I recommend y'all add a section to the proposal on the wiki page for "Additional Interested Contributors" so that folks who want to sign up to participate in the project can do so without requesting additions to the initial committer list. >>> This is a great idea and I think it makes a lot of sense to add an >>> "Additional >>> Interested Contributors" section to the proposal. >>> >>> On Wed, Jan 20, 2016 at 10:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: > Hello everyone, > > Attached to this message is a proposed new project - Apache Dataflow, a > > unified programming model for data processing and integration. > > The text of the proposal is included below. Additionally, the proposal is > > in draft form on the wiki where we will make any required changes: > > https://wiki.apache.org/incubator/DataflowProposal > > We look forward to your feedback and input. > > Best, > > James > > > > = Apache Dataflow = > > == Abstract == > > Dataflow is an open source, unified model and set of language-specific SDKs > > for defining and executing data processing workflows, and also data > ingestion and integration flows, supporting Enterprise Integration Patterns > > (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify > > the mechanics of large-scale batch and streaming data processing and can > > run on a number of runtimes like Apache Flink, Apache Spark, and Google > > Cloud Dataflow (a cloud service). Dataflow also brings DSL in different > > languages, allowing users to easily implement their data integration > processes. > > == Proposal == > > Dataflow is a simple, flexible, and powerful system for distributed data > > processing at any scale. Dataflow provides a unified programming model, a > > software development kit to define and construct data processing pipelines, > > and runners to execute Dataflow pipelines in several runtime engines, like > > Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used > > for a variety of streaming or batch data processing goals including ETL, > > stream analysis, and aggregate computation. The underlying programming > model for Dataflow provides MapReduce-like parallelism, combined with > support for powerful data windowing, and fine-grained correctness control. > > > == Background == > > Dataflow started as a set of Google projects focused on making data > processing easier, faster, and less costly. The Dataflow model is a > successor to MapReduce, FlumeJava, and Millwheel inside Google and is > focused on providing a unified solution for batch and stream processing. > > These projects on which Dataflow is based have been published in several > > papers made available to the public: > > * MapReduce - http://research.google.com/archive/mapreduce.html > > * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf > > * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf > > * MillWheel - http://research.google.com/pubs/pub41378.html > > Dataflow was designed from the start to provide a portable programming > layer. When you define a data processing pipeline with the Dataflow model,
Re: [DISCUSS] Apache Dataflow Incubator Proposal
As a committer of another "dataflow" incubator Taverna, I think this looks like an exciting proposal. Agree on the confusion of the name, and it's probably better to get that sorted early. In Taverna we have used the term "dataflow" since 2004, and as a concept the paradigm was created in the 1960s. So Dataflow is a bit too broad and likely not trademarkable. Your model seems more of an Event-driven workflow, as you explain in the paper. You can do a renaming during the very first month of incubation (which several indubator projects have done) - it's a simple way to engage everyone in the newly formed/refreshed incubator community, who should then feel ownership to the name decission, rather than let selected few decide beforehand. In your case you do not already have a single community mailing list (?), so perhaps it would be harder to do this kind of community decission as a GitHub issue? Remember the later you rename, the more you have to rename, like mailing list address, code repositories, package names, documentation, website.. :) On 20 January 2016 at 17:12, Marvin Humphrey wrote: > On Wed, Jan 20, 2016 at 8:32 AM, James Malone > wrote: > >> == Abstract == >> >> Dataflow is an open source, unified model and set of language-specific SDKs >> for defining and executing data processing workflows, and also data >> ingestion and integration flows, supporting Enterprise Integration Patterns >> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify >> the mechanics of large-scale batch and streaming data processing and can >> run on a number of runtimes like Apache Flink, Apache Spark, and Google >> Cloud Dataflow (a cloud service). Dataflow also brings DSL in different >> languages, allowing users to easily implement their data integration >> processes. > > In general this seems like an excellent project and a well-thought-through and > viable proposal -- I certainly anticipate that it will be accepted for > incubation in one form or another. > > However, how does this "Dataflow" project relate to the programming paradigm > of "dataflow programming"? > > https://en.wikipedia.org/wiki/Dataflow_programming > > Besides the potential for confusion, it seems like the proposed project name > would be tough to defend as a trademark. > >> With respect to trademark rights, Google does not hold a trademark on the >> phrase “Dataflow.” Based on feedback and guidance we receive during the >> incubation process, we are open to renaming the project if necessary for >> trademark or other concerns. > > If a renaming is going to happen, there are advantages to renaming sooner > rather than later and sparing the community additional disruption. > > Marvin Humphrey > > - > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > -- Stian Soiland-Reyes Apache Taverna (incubating), Apache Commons RDF (incubating) http://orcid.org/-0001-9842-9718 - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Awesome to see CloudDataFlow coming to Apache. The Stream Processing area has been in general fragmented with a variety of solutions, hoping the community galvanizes around Apache Data Flow. We are still in the "Apache Storm" world, Any chance for folks building a "Storm Runner²? On 1/20/16, 9:39 AM, "James Malone" wrote: >> Great proposal. I like that your proposal includes a well presented >> roadmap, but I don't see any goals that directly address building a >>larger >> community. Y'all have any ideas around outreach that will help with >> adoption? >> > >Thank you and fair point. We have a few additional ideas which we can put >into the Community section. > > >> >> As a start, I recommend y'all add a section to the proposal on the wiki >> page for "Additional Interested Contributors" so that folks who want to >> sign up to participate in the project can do so without requesting >> additions to the initial committer list. >> >> >This is a great idea and I think it makes a lot of sense to add an >"Additional >Interested Contributors" section to the proposal. > > >> On Wed, Jan 20, 2016 at 10:32 AM, James Malone < >> jamesmal...@google.com.invalid> wrote: >> >> > Hello everyone, >> > >> > Attached to this message is a proposed new project - Apache Dataflow, >>a >> > unified programming model for data processing and integration. >> > >> > The text of the proposal is included below. Additionally, the >>proposal is >> > in draft form on the wiki where we will make any required changes: >> > >> > https://wiki.apache.org/incubator/DataflowProposal >> > >> > We look forward to your feedback and input. >> > >> > Best, >> > >> > James >> > >> > >> > >> > = Apache Dataflow = >> > >> > == Abstract == >> > >> > Dataflow is an open source, unified model and set of language-specific >> SDKs >> > for defining and executing data processing workflows, and also data >> > ingestion and integration flows, supporting Enterprise Integration >> Patterns >> > (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines >>simplify >> > the mechanics of large-scale batch and streaming data processing and >>can >> > run on a number of runtimes like Apache Flink, Apache Spark, and >>Google >> > Cloud Dataflow (a cloud service). Dataflow also brings DSL in >>different >> > languages, allowing users to easily implement their data integration >> > processes. >> > >> > == Proposal == >> > >> > Dataflow is a simple, flexible, and powerful system for distributed >>data >> > processing at any scale. Dataflow provides a unified programming >>model, a >> > software development kit to define and construct data processing >> pipelines, >> > and runners to execute Dataflow pipelines in several runtime engines, >> like >> > Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be >> used >> > for a variety of streaming or batch data processing goals including >>ETL, >> > stream analysis, and aggregate computation. The underlying programming >> > model for Dataflow provides MapReduce-like parallelism, combined with >> > support for powerful data windowing, and fine-grained correctness >> control. >> > >> > == Background == >> > >> > Dataflow started as a set of Google projects focused on making data >> > processing easier, faster, and less costly. The Dataflow model is a >> > successor to MapReduce, FlumeJava, and Millwheel inside Google and is >> > focused on providing a unified solution for batch and stream >>processing. >> > These projects on which Dataflow is based have been published in >>several >> > papers made available to the public: >> > >> > * MapReduce - http://research.google.com/archive/mapreduce.html >> > >> > * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf >> > >> > * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf >> > >> > * MillWheel - http://research.google.com/pubs/pub41378.html >> > >> > Dataflow was designed from the start to provide a portable programming >> > layer. When you define a data processing pipeline with the Dataflow >> model, >> > you are creating a job which is capable of being processed by any >>number >> of >> > Dataflow processing engines. Several engines have been developed to >>run >> > Dataflow pipelines in other open source runtimes, including a Dataflow >> > runner for Apache Flink and Apache Spark. There is also a ³direct >> runner², >> > for execution on the developer machine (mainly for dev/debug >>purposes). >> > Another runner allows a Dataflow program to run on a managed service, >> > Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java >>SDK is >> > already available on GitHub, and independent from the Google Cloud >> Dataflow >> > service. Another Python SDK is currently in active development. >> > >> > In this proposal, the Dataflow SDKs, model, and a set of runners will >>be >> > submitted as an OSS project under the ASF. The runners which are a >>part >> of >> > this proposal include those for Spark (from Cloudera), Flink
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi Seshu, I blogged about Apache Dataflow proposal: http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/ You can see in the "what's next ?" section that new runners, skins and sources are on our roadmap. Definitely, a storm runner could be part of this. Regards JB On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote: Awesome to see CloudDataFlow coming to Apache. The Stream Processing area has been in general fragmented with a variety of solutions, hoping the community galvanizes around Apache Data Flow. We are still in the "Apache Storm" world, Any chance for folks building a "Storm Runner²? On 1/20/16, 9:39 AM, "James Malone" wrote: Great proposal. I like that your proposal includes a well presented roadmap, but I don't see any goals that directly address building a larger community. Y'all have any ideas around outreach that will help with adoption? Thank you and fair point. We have a few additional ideas which we can put into the Community section. As a start, I recommend y'all add a section to the proposal on the wiki page for "Additional Interested Contributors" so that folks who want to sign up to participate in the project can do so without requesting additions to the initial committer list. This is a great idea and I think it makes a lot of sense to add an "Additional Interested Contributors" section to the proposal. On Wed, Jan 20, 2016 at 10:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a ³direct runner², for execution on the developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service. Another Python SDK is currently in active development. In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an OSS project under the ASF. The runners which are a part of this proposal include those for Spark (from Cloudera), Flink (from data Artisans), and local development (from Google); the Google Cloud Dataflow
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi Supun, I added you on the proposal. Thanks ! Regards JB On 01/22/2016 12:37 AM, Supun Kamburugamuve wrote: We are developing parallel machine learning algorithms for a research project and are very interested in DataFlow. I would like to contribute to this project as well. It will be great if you can add me. Thanks, Supun... On Thu, Jan 21, 2016 at 6:29 PM, Mayank Bansal wrote: Hi Jean, Nice Proposal. I wanted to contribute to this project. Can you please add me too? Thanks a lot for the help Thanks, Mayank On Thu, Jan 21, 2016 at 8:07 AM, Jean-Baptiste Onofré wrote: Hey Alex, awesome: I added you on the proposal. Thanks, Regards JB On 01/21/2016 05:03 PM, Alexander Bezzubov wrote: Hi, it's great to see DataFlow becoming part to Apache ecosystem, thank you bringing it in. I would be happy to get involved and help. -- Alex On Thu, Jan 21, 2016 at 8:42 PM, Jean-Baptiste Onofré wrote: Perfect: done, you are on the proposal. Thanks ! Regards JB On 01/21/2016 11:55 AM, chatz wrote: Charitha Elvitigala On 21 January 2016 at 16:17, Jean-Baptiste Onofré wrote: Hi Chatz, sure, what name should I use on the proposal, Charitha ? Regards JB On 01/21/2016 11:32 AM, chatz wrote: Hi Jean, I’d be interested in contributing as well. Thanks, Chatz On 21 January 2016 at 14:22, Jean-Baptiste Onofré wrote: Sweet: you are on the proposal ;) Thanks ! Regards JB On 01/21/2016 08:55 AM, Byung-Gon Chun wrote: This looks very interesting. I'm interested in contributing. Thanks. -Gon --- Byung-Gon Chun On Thu, Jan 21, 2016 at 1:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a “direct runner”, for execution on the developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service. Another Python SDK is currently in active development. In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an OSS project under the ASF. The runners which
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi Tsuyoshi Awesome: I added you on the proposal. Thanks ! Regards JB On 01/22/2016 04:29 AM, Tsuyoshi Ozawa wrote: Hi, I'm a core developer of Apache Hadoop and a contributor of Apache Tez. I'd be also interested in working on Apache Dataflow as an individual. Regards, - Tsuyoshi -Original Message- From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net] Sent: Thursday, January 21, 2016 2:38 PM To: general@incubator.apache.org Subject: Re: [DISCUSS] Apache Dataflow Incubator Proposal Hi, great: I added you in the proposal. Thanks ! Regards JB On 01/21/2016 12:24 AM, Prasanth Jayachandran wrote: Hi Jean I’d be interested in contributing as well. Thanks Prasanth Jayachandran On Jan 20, 2016, at 5:20 PM, ksobkowiak wrote: It's a great news the project is going to move to Apache. I'd be interested in contributing too Regards Krzysztof -- View this message in context: http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-D ataflow-Incubator-Proposal-tp47985p48025.html Sent from the Apache Incubator - General mailing list archive at Nabble.com. - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi Mayank, sure: you are in. Thanks ! Regards JB On 01/22/2016 12:29 AM, Mayank Bansal wrote: Hi Jean, Nice Proposal. I wanted to contribute to this project. Can you please add me too? Thanks a lot for the help Thanks, Mayank On Thu, Jan 21, 2016 at 8:07 AM, Jean-Baptiste Onofré mailto:j...@nanthrax.net>> wrote: Hey Alex, awesome: I added you on the proposal. Thanks, Regards JB On 01/21/2016 05:03 PM, Alexander Bezzubov wrote: Hi, it's great to see DataFlow becoming part to Apache ecosystem, thank you bringing it in. I would be happy to get involved and help. -- Alex On Thu, Jan 21, 2016 at 8:42 PM, Jean-Baptiste Onofré mailto:j...@nanthrax.net>> wrote: Perfect: done, you are on the proposal. Thanks ! Regards JB On 01/21/2016 11:55 AM, chatz wrote: Charitha Elvitigala On 21 January 2016 at 16:17, Jean-Baptiste Onofré mailto:j...@nanthrax.net>> wrote: Hi Chatz, sure, what name should I use on the proposal, Charitha ? Regards JB On 01/21/2016 11:32 AM, chatz wrote: Hi Jean, I’d be interested in contributing as well. Thanks, Chatz On 21 January 2016 at 14:22, Jean-Baptiste Onofré mailto:j...@nanthrax.net>> wrote: Sweet: you are on the proposal ;) Thanks ! Regards JB On 01/21/2016 08:55 AM, Byung-Gon Chun wrote: This looks very interesting. I'm interested in contributing. Thanks. -Gon --- Byung-Gon Chun On Thu, Jan 21, 2016 at 1:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes.
Re: [DISCUSS] Apache Dataflow Incubator Proposal
It makes perfect sense, and it's something that we already discussed. Thanks James and Marvin. @James, yes, we are going to deal with that together, not a problem at all. I agree that renaming should happen now. As discussed, we should be back with a new name early next week. I'm happy to see the discussion now (and thanks again Marvin for details and always helpful messages): it's exactly the purpose of sending the discussion thread on the incubator mailing list. Thanks guys ! Regards JB On 01/22/2016 02:19 AM, James Malone wrote: Thank you for such a detailed response Marvin! Everything you mention makes a lot of sense. Needless to say, we don't want to squander cycles, break any rules, or throw velocity into disarray all due to a name. To that end, I am going to work with JB to amend the proposal with respect to renaming. I'm also going to clarify a name change would be an immediate-term to-do item so it does not block creation creation of lists, repositories, and so on. Best, James I am going to work with JB to amend the proposal to indicate On Thu, Jan 21, 2016 at 9:30 AM, Marvin Humphrey wrote: On Wed, Jan 20, 2016 at 3:30 PM, James Malone wrote: If we need to rename, we would ideally choose a new name, change the project name at that time, and start our refactoring with that new name. Is is acceptable for us to flag a name change as something we need to do as a near-term (1st month) item in incubation (if accepted)? If a rename is required I'd like to add it to our to-do roadmap but also not block our proposal on a renaming. I ask so we can address this concern in the best way possible. That's acceptable. Project naming issues do not block entry into the Incubator, they block graduation from the Incubator. Because "dataflow" is descriptive, it will be hard to defend as a trademark. The Wikipedia article on trademark distinctiveness explains things well: https://en.wikipedia.org/wiki/Trademark_distinctiveness A weak mark both increases the amount of volunteer effort that goes into dealing with infringement cases and makes bad outcomes more likely. It is not an absolute requirement that Apache projects have defensible names, but painful past experience has taught us that mishandled branding can deal surprising amounts of damage to a project community. But beyond that, the issue of "Google Cloud Dataflow" vs. "Apache Dataflow" is a blocker. One or the other will have to be renamed, and since the software is being donated but apparently not the brand, it sounds like renaming the prospective Apache project will be required and you should add that task to your roadmap. Changing names in the middle of incubation is disruptive because it requires renaming infrastructure resources, impacting both the Apache Infrastructure team and also the podling's developer and user communities. My suggestion would be that immediately after the VOTE to enter incubation concludes, you only create a dev mailing list and deal with the renaming immediately, delaying the creation of other resources until after the renaming is resolved. However, the exact plan is something you can work out with your Mentors. Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
RE: [DISCUSS] Apache Dataflow Incubator Proposal
Hi, I'm a core developer of Apache Hadoop and a contributor of Apache Tez. I'd be also interested in working on Apache Dataflow as an individual. Regards, - Tsuyoshi -Original Message- From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net] Sent: Thursday, January 21, 2016 2:38 PM To: general@incubator.apache.org Subject: Re: [DISCUSS] Apache Dataflow Incubator Proposal Hi, great: I added you in the proposal. Thanks ! Regards JB On 01/21/2016 12:24 AM, Prasanth Jayachandran wrote: > Hi Jean > > I’d be interested in contributing as well. > > Thanks > Prasanth Jayachandran > >> On Jan 20, 2016, at 5:20 PM, ksobkowiak wrote: >> >> It's a great news the project is going to move to Apache. I'd be >> interested in contributing too >> >> Regards >> Krzysztof >> >> >> >> -- >> View this message in context: >> http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-D >> ataflow-Incubator-Proposal-tp47985p48025.html >> Sent from the Apache Incubator - General mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> For additional commands, e-mail: general-h...@incubator.apache.org >> >> > > > - > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Thank you for such a detailed response Marvin! Everything you mention makes a lot of sense. Needless to say, we don't want to squander cycles, break any rules, or throw velocity into disarray all due to a name. To that end, I am going to work with JB to amend the proposal with respect to renaming. I'm also going to clarify a name change would be an immediate-term to-do item so it does not block creation creation of lists, repositories, and so on. Best, James I am going to work with JB to amend the proposal to indicate On Thu, Jan 21, 2016 at 9:30 AM, Marvin Humphrey wrote: > On Wed, Jan 20, 2016 at 3:30 PM, James Malone > wrote: > > If we need to rename, we would ideally choose a new name, change the > > project name at that time, and start our refactoring with that new name. > Is > > is acceptable for us to flag a name change as something we need to do as > a > > near-term (1st month) item in incubation (if accepted)? If a rename is > > required I'd like to add it to our to-do roadmap but also not block our > > proposal on a renaming. I ask so we can address this concern in the best > > way possible. > > That's acceptable. Project naming issues do not block entry into the > Incubator, they block graduation from the Incubator. > > Because "dataflow" is descriptive, it will be hard to defend as > a trademark. The Wikipedia article on trademark distinctiveness explains > things well: > > https://en.wikipedia.org/wiki/Trademark_distinctiveness > > A weak mark both increases the amount of volunteer effort that goes > into dealing with infringement cases and makes bad outcomes more likely. > It is not an absolute requirement that Apache projects have defensible > names, > but painful past experience has taught us that mishandled branding can deal > surprising amounts of damage to a project community. > > But beyond that, the issue of "Google Cloud Dataflow" vs. "Apache > Dataflow" is > a blocker. One or the other will have to be renamed, and since the > software > is being donated but apparently not the brand, it sounds like renaming the > prospective Apache project will be required and you should add that task to > your roadmap. > > Changing names in the middle of incubation is disruptive because it > requires > renaming infrastructure resources, impacting both the Apache Infrastructure > team and also the podling's developer and user communities. My suggestion > would be that immediately after the VOTE to enter incubation concludes, you > only create a dev mailing list and deal with the renaming immediately, > delaying the creation of other resources until after the renaming is > resolved. > However, the exact plan is something you can work out with your Mentors. > > Marvin Humphrey > > - > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >
Re: [DISCUSS] Apache Dataflow Incubator Proposal
We are developing parallel machine learning algorithms for a research project and are very interested in DataFlow. I would like to contribute to this project as well. It will be great if you can add me. Thanks, Supun... On Thu, Jan 21, 2016 at 6:29 PM, Mayank Bansal wrote: > Hi Jean, > > Nice Proposal. > > I wanted to contribute to this project. Can you please add me too? > > Thanks a lot for the help > > Thanks, > Mayank > > On Thu, Jan 21, 2016 at 8:07 AM, Jean-Baptiste Onofré > wrote: > > > Hey Alex, > > > > awesome: I added you on the proposal. > > > > Thanks, > > Regards > > JB > > > > > > On 01/21/2016 05:03 PM, Alexander Bezzubov wrote: > > > >> Hi, > >> > >> it's great to see DataFlow becoming part to Apache ecosystem, thank you > >> bringing it in. > >> I would be happy to get involved and help. > >> > >> -- > >> Alex > >> > >> On Thu, Jan 21, 2016 at 8:42 PM, Jean-Baptiste Onofré > >> wrote: > >> > >> Perfect: done, you are on the proposal. > >>> > >>> Thanks ! > >>> Regards > >>> JB > >>> > >>> > >>> On 01/21/2016 11:55 AM, chatz wrote: > >>> > >>> Charitha Elvitigala > > On 21 January 2016 at 16:17, Jean-Baptiste Onofré > wrote: > > Hi Chatz, > > > > > sure, what name should I use on the proposal, Charitha ? > > > > Regards > > JB > > > > > > On 01/21/2016 11:32 AM, chatz wrote: > > > > Hi Jean, > > > >> > >> I’d be interested in contributing as well. > >> > >> Thanks, > >> > >> Chatz > >> > >> > >> On 21 January 2016 at 14:22, Jean-Baptiste Onofré > >> wrote: > >> > >> Sweet: you are on the proposal ;) > >> > >> > >>> Thanks ! > >>> Regards > >>> JB > >>> > >>> > >>> On 01/21/2016 08:55 AM, Byung-Gon Chun wrote: > >>> > >>> This looks very interesting. I'm interested in contributing. > >>> > >>> > Thanks. > -Gon > > --- > Byung-Gon Chun > > > On Thu, Jan 21, 2016 at 1:32 AM, James Malone < > jamesmal...@google.com.invalid> wrote: > > Hello everyone, > > > Attached to this message is a proposed new project - Apache > > Dataflow, a > > unified programming model for data processing and integration. > > > > The text of the proposal is included below. Additionally, the > > proposal > > is > > in draft form on the wiki where we will make any required > changes: > > > > https://wiki.apache.org/incubator/DataflowProposal > > > > We look forward to your feedback and input. > > > > Best, > > > > James > > > > > > > > = Apache Dataflow = > > > > == Abstract == > > > > Dataflow is an open source, unified model and set of > > language-specific > > SDKs > > for defining and executing data processing workflows, and also > data > > ingestion and integration flows, supporting Enterprise > Integration > > Patterns > > (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines > > simplify > > the mechanics of large-scale batch and streaming data processing > > and > > can > > run on a number of runtimes like Apache Flink, Apache Spark, and > > Google > > Cloud Dataflow (a cloud service). Dataflow also brings DSL in > > different > > languages, allowing users to easily implement their data > > integration > > processes. > > > > == Proposal == > > > > Dataflow is a simple, flexible, and powerful system for > distributed > > data > > processing at any scale. Dataflow provides a unified programming > > model, a > > software development kit to define and construct data processing > > pipelines, > > and runners to execute Dataflow pipelines in several runtime > > engines, > > like > > Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow > can > > be > > used > > for a variety of streaming or batch data processing goals > including > > ETL, > > stream analysis, and aggregate computation. The underlying > > programming > > model for Dataflow provides MapReduce-like parallelism, combined > > with > > support for powerful data windowing, and fine-grained correctness > > control. > > > > == Background == > > > > Dataflow started as a set of Google projects focused on making > data > > processing easier, faster, and less costly. The Dataflow model > is a > > successor to MapReduce, FlumeJava, and Millwheel inside Google > and > > is > > focused on
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi Jean, Nice Proposal. I wanted to contribute to this project. Can you please add me too? Thanks a lot for the help Thanks, Mayank On Thu, Jan 21, 2016 at 8:07 AM, Jean-Baptiste Onofré wrote: > Hey Alex, > > awesome: I added you on the proposal. > > Thanks, > Regards > JB > > > On 01/21/2016 05:03 PM, Alexander Bezzubov wrote: > >> Hi, >> >> it's great to see DataFlow becoming part to Apache ecosystem, thank you >> bringing it in. >> I would be happy to get involved and help. >> >> -- >> Alex >> >> On Thu, Jan 21, 2016 at 8:42 PM, Jean-Baptiste Onofré >> wrote: >> >> Perfect: done, you are on the proposal. >>> >>> Thanks ! >>> Regards >>> JB >>> >>> >>> On 01/21/2016 11:55 AM, chatz wrote: >>> >>> Charitha Elvitigala On 21 January 2016 at 16:17, Jean-Baptiste Onofré wrote: Hi Chatz, > > sure, what name should I use on the proposal, Charitha ? > > Regards > JB > > > On 01/21/2016 11:32 AM, chatz wrote: > > Hi Jean, > >> >> I’d be interested in contributing as well. >> >> Thanks, >> >> Chatz >> >> >> On 21 January 2016 at 14:22, Jean-Baptiste Onofré >> wrote: >> >> Sweet: you are on the proposal ;) >> >> >>> Thanks ! >>> Regards >>> JB >>> >>> >>> On 01/21/2016 08:55 AM, Byung-Gon Chun wrote: >>> >>> This looks very interesting. I'm interested in contributing. >>> >>> Thanks. -Gon --- Byung-Gon Chun On Thu, Jan 21, 2016 at 1:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache > Dataflow, a > unified programming model for data processing and integration. > > The text of the proposal is included below. Additionally, the > proposal > is > in draft form on the wiki where we will make any required changes: > > https://wiki.apache.org/incubator/DataflowProposal > > We look forward to your feedback and input. > > Best, > > James > > > > = Apache Dataflow = > > == Abstract == > > Dataflow is an open source, unified model and set of > language-specific > SDKs > for defining and executing data processing workflows, and also data > ingestion and integration flows, supporting Enterprise Integration > Patterns > (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines > simplify > the mechanics of large-scale batch and streaming data processing > and > can > run on a number of runtimes like Apache Flink, Apache Spark, and > Google > Cloud Dataflow (a cloud service). Dataflow also brings DSL in > different > languages, allowing users to easily implement their data > integration > processes. > > == Proposal == > > Dataflow is a simple, flexible, and powerful system for distributed > data > processing at any scale. Dataflow provides a unified programming > model, a > software development kit to define and construct data processing > pipelines, > and runners to execute Dataflow pipelines in several runtime > engines, > like > Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can > be > used > for a variety of streaming or batch data processing goals including > ETL, > stream analysis, and aggregate computation. The underlying > programming > model for Dataflow provides MapReduce-like parallelism, combined > with > support for powerful data windowing, and fine-grained correctness > control. > > == Background == > > Dataflow started as a set of Google projects focused on making data > processing easier, faster, and less costly. The Dataflow model is a > successor to MapReduce, FlumeJava, and Millwheel inside Google and > is > focused on providing a unified solution for batch and stream > processing. > These projects on which Dataflow is based have been published in > several > papers made available to the public: > > * MapReduce - http://research.google.com/archive/mapreduce.html > > * Dataflow model - > http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf > > * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf > > * MillWheel - http://research.google.com/pubs/pub41378.html > > Dataflow was designed from the start to provide a portable >>>
Re: [DISCUSS] Apache Dataflow Incubator Proposal
On Wed, Jan 20, 2016 at 3:30 PM, James Malone wrote: > If we need to rename, we would ideally choose a new name, change the > project name at that time, and start our refactoring with that new name. Is > is acceptable for us to flag a name change as something we need to do as a > near-term (1st month) item in incubation (if accepted)? If a rename is > required I'd like to add it to our to-do roadmap but also not block our > proposal on a renaming. I ask so we can address this concern in the best > way possible. That's acceptable. Project naming issues do not block entry into the Incubator, they block graduation from the Incubator. Because "dataflow" is descriptive, it will be hard to defend as a trademark. The Wikipedia article on trademark distinctiveness explains things well: https://en.wikipedia.org/wiki/Trademark_distinctiveness A weak mark both increases the amount of volunteer effort that goes into dealing with infringement cases and makes bad outcomes more likely. It is not an absolute requirement that Apache projects have defensible names, but painful past experience has taught us that mishandled branding can deal surprising amounts of damage to a project community. But beyond that, the issue of "Google Cloud Dataflow" vs. "Apache Dataflow" is a blocker. One or the other will have to be renamed, and since the software is being donated but apparently not the brand, it sounds like renaming the prospective Apache project will be required and you should add that task to your roadmap. Changing names in the middle of incubation is disruptive because it requires renaming infrastructure resources, impacting both the Apache Infrastructure team and also the podling's developer and user communities. My suggestion would be that immediately after the VOTE to enter incubation concludes, you only create a dev mailing list and deal with the renaming immediately, delaying the creation of other resources until after the renaming is resolved. However, the exact plan is something you can work out with your Mentors. Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi, it's great to see DataFlow becoming part to Apache ecosystem, thank you bringing it in. I would be happy to get involved and help. -- Alex On Thu, Jan 21, 2016 at 8:42 PM, Jean-Baptiste Onofré wrote: > Perfect: done, you are on the proposal. > > Thanks ! > Regards > JB > > > On 01/21/2016 11:55 AM, chatz wrote: > >> Charitha Elvitigala >> >> On 21 January 2016 at 16:17, Jean-Baptiste Onofré >> wrote: >> >> Hi Chatz, >>> >>> sure, what name should I use on the proposal, Charitha ? >>> >>> Regards >>> JB >>> >>> >>> On 01/21/2016 11:32 AM, chatz wrote: >>> >>> Hi Jean, I’d be interested in contributing as well. Thanks, Chatz On 21 January 2016 at 14:22, Jean-Baptiste Onofré wrote: Sweet: you are on the proposal ;) > > Thanks ! > Regards > JB > > > On 01/21/2016 08:55 AM, Byung-Gon Chun wrote: > > This looks very interesting. I'm interested in contributing. > >> >> Thanks. >> -Gon >> >> --- >> Byung-Gon Chun >> >> >> On Thu, Jan 21, 2016 at 1:32 AM, James Malone < >> jamesmal...@google.com.invalid> wrote: >> >> Hello everyone, >> >> >>> Attached to this message is a proposed new project - Apache >>> Dataflow, a >>> unified programming model for data processing and integration. >>> >>> The text of the proposal is included below. Additionally, the >>> proposal >>> is >>> in draft form on the wiki where we will make any required changes: >>> >>> https://wiki.apache.org/incubator/DataflowProposal >>> >>> We look forward to your feedback and input. >>> >>> Best, >>> >>> James >>> >>> >>> >>> = Apache Dataflow = >>> >>> == Abstract == >>> >>> Dataflow is an open source, unified model and set of >>> language-specific >>> SDKs >>> for defining and executing data processing workflows, and also data >>> ingestion and integration flows, supporting Enterprise Integration >>> Patterns >>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines >>> simplify >>> the mechanics of large-scale batch and streaming data processing and >>> can >>> run on a number of runtimes like Apache Flink, Apache Spark, and >>> Google >>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in >>> different >>> languages, allowing users to easily implement their data integration >>> processes. >>> >>> == Proposal == >>> >>> Dataflow is a simple, flexible, and powerful system for distributed >>> data >>> processing at any scale. Dataflow provides a unified programming >>> model, a >>> software development kit to define and construct data processing >>> pipelines, >>> and runners to execute Dataflow pipelines in several runtime engines, >>> like >>> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be >>> used >>> for a variety of streaming or batch data processing goals including >>> ETL, >>> stream analysis, and aggregate computation. The underlying >>> programming >>> model for Dataflow provides MapReduce-like parallelism, combined with >>> support for powerful data windowing, and fine-grained correctness >>> control. >>> >>> == Background == >>> >>> Dataflow started as a set of Google projects focused on making data >>> processing easier, faster, and less costly. The Dataflow model is a >>> successor to MapReduce, FlumeJava, and Millwheel inside Google and is >>> focused on providing a unified solution for batch and stream >>> processing. >>> These projects on which Dataflow is based have been published in >>> several >>> papers made available to the public: >>> >>> * MapReduce - http://research.google.com/archive/mapreduce.html >>> >>> * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf >>> >>> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf >>> >>> * MillWheel - http://research.google.com/pubs/pub41378.html >>> >>> Dataflow was designed from the start to provide a portable >>> programming >>> layer. When you define a data processing pipeline with the Dataflow >>> model, >>> you are creating a job which is capable of being processed by any >>> number >>> of >>> Dataflow processing engines. Several engines have been developed to >>> run >>> Dataflow pipelines in other open source runtimes, including a >>> Dataflow >>> runner for Apache Flink and Apache Spark. There is also a “direct >>> runner”, >>> for execution on the developer machine (mainly for dev/debug >>> purposes). >>> Another runner allows a Dataflow program to run on a managed service, >>> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java >>> SDK
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hey Alex, awesome: I added you on the proposal. Thanks, Regards JB On 01/21/2016 05:03 PM, Alexander Bezzubov wrote: Hi, it's great to see DataFlow becoming part to Apache ecosystem, thank you bringing it in. I would be happy to get involved and help. -- Alex On Thu, Jan 21, 2016 at 8:42 PM, Jean-Baptiste Onofré wrote: Perfect: done, you are on the proposal. Thanks ! Regards JB On 01/21/2016 11:55 AM, chatz wrote: Charitha Elvitigala On 21 January 2016 at 16:17, Jean-Baptiste Onofré wrote: Hi Chatz, sure, what name should I use on the proposal, Charitha ? Regards JB On 01/21/2016 11:32 AM, chatz wrote: Hi Jean, I’d be interested in contributing as well. Thanks, Chatz On 21 January 2016 at 14:22, Jean-Baptiste Onofré wrote: Sweet: you are on the proposal ;) Thanks ! Regards JB On 01/21/2016 08:55 AM, Byung-Gon Chun wrote: This looks very interesting. I'm interested in contributing. Thanks. -Gon --- Byung-Gon Chun On Thu, Jan 21, 2016 at 1:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a “direct runner”, for execution on the developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service. Another Python SDK is currently in active development. In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an OSS project under the ASF. The runners which are a part of this proposal include those for Spark (from Cloudera), Flink (from data Artisans), and local development (from Google); the Google Cloud Dataflow service runner is not included in this proposal. Further references to Dataflow will refer to the Dataflow model, SDKs, and runners which are a part of this proposal (Apache Dataflow) only. The initial submission will contain the already-released Java SDK; Google intends to submit the Python SDK later in the incubation process. The Google Cloud Dataflow service will continue to be one of many runners for Dataflow, built on Google Cloud Platform, to run Dataflo
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Perfect: done, you are on the proposal. Thanks ! Regards JB On 01/21/2016 11:55 AM, chatz wrote: Charitha Elvitigala On 21 January 2016 at 16:17, Jean-Baptiste Onofré wrote: Hi Chatz, sure, what name should I use on the proposal, Charitha ? Regards JB On 01/21/2016 11:32 AM, chatz wrote: Hi Jean, I’d be interested in contributing as well. Thanks, Chatz On 21 January 2016 at 14:22, Jean-Baptiste Onofré wrote: Sweet: you are on the proposal ;) Thanks ! Regards JB On 01/21/2016 08:55 AM, Byung-Gon Chun wrote: This looks very interesting. I'm interested in contributing. Thanks. -Gon --- Byung-Gon Chun On Thu, Jan 21, 2016 at 1:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a “direct runner”, for execution on the developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service. Another Python SDK is currently in active development. In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an OSS project under the ASF. The runners which are a part of this proposal include those for Spark (from Cloudera), Flink (from data Artisans), and local development (from Google); the Google Cloud Dataflow service runner is not included in this proposal. Further references to Dataflow will refer to the Dataflow model, SDKs, and runners which are a part of this proposal (Apache Dataflow) only. The initial submission will contain the already-released Java SDK; Google intends to submit the Python SDK later in the incubation process. The Google Cloud Dataflow service will continue to be one of many runners for Dataflow, built on Google Cloud Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will develop against the Apache project additions, updates, and changes. Google Cloud Dataflow will become one user of Apache Dataflow and will participate in the project openly and publicly. The Dataflow programming model has been designed with simplicity, scalability, and speed as key tenan
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Charitha Elvitigala On 21 January 2016 at 16:17, Jean-Baptiste Onofré wrote: > Hi Chatz, > > sure, what name should I use on the proposal, Charitha ? > > Regards > JB > > > On 01/21/2016 11:32 AM, chatz wrote: > >> Hi Jean, >> >> I’d be interested in contributing as well. >> >> Thanks, >> >> Chatz >> >> >> On 21 January 2016 at 14:22, Jean-Baptiste Onofré >> wrote: >> >> Sweet: you are on the proposal ;) >>> >>> Thanks ! >>> Regards >>> JB >>> >>> >>> On 01/21/2016 08:55 AM, Byung-Gon Chun wrote: >>> >>> This looks very interesting. I'm interested in contributing. Thanks. -Gon --- Byung-Gon Chun On Thu, Jan 21, 2016 at 1:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, > > Attached to this message is a proposed new project - Apache Dataflow, a > unified programming model for data processing and integration. > > The text of the proposal is included below. Additionally, the proposal > is > in draft form on the wiki where we will make any required changes: > > https://wiki.apache.org/incubator/DataflowProposal > > We look forward to your feedback and input. > > Best, > > James > > > > = Apache Dataflow = > > == Abstract == > > Dataflow is an open source, unified model and set of language-specific > SDKs > for defining and executing data processing workflows, and also data > ingestion and integration flows, supporting Enterprise Integration > Patterns > (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines > simplify > the mechanics of large-scale batch and streaming data processing and > can > run on a number of runtimes like Apache Flink, Apache Spark, and Google > Cloud Dataflow (a cloud service). Dataflow also brings DSL in different > languages, allowing users to easily implement their data integration > processes. > > == Proposal == > > Dataflow is a simple, flexible, and powerful system for distributed > data > processing at any scale. Dataflow provides a unified programming > model, a > software development kit to define and construct data processing > pipelines, > and runners to execute Dataflow pipelines in several runtime engines, > like > Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be > used > for a variety of streaming or batch data processing goals including > ETL, > stream analysis, and aggregate computation. The underlying programming > model for Dataflow provides MapReduce-like parallelism, combined with > support for powerful data windowing, and fine-grained correctness > control. > > == Background == > > Dataflow started as a set of Google projects focused on making data > processing easier, faster, and less costly. The Dataflow model is a > successor to MapReduce, FlumeJava, and Millwheel inside Google and is > focused on providing a unified solution for batch and stream > processing. > These projects on which Dataflow is based have been published in > several > papers made available to the public: > > * MapReduce - http://research.google.com/archive/mapreduce.html > > * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf > > * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf > > * MillWheel - http://research.google.com/pubs/pub41378.html > > Dataflow was designed from the start to provide a portable programming > layer. When you define a data processing pipeline with the Dataflow > model, > you are creating a job which is capable of being processed by any > number > of > Dataflow processing engines. Several engines have been developed to run > Dataflow pipelines in other open source runtimes, including a Dataflow > runner for Apache Flink and Apache Spark. There is also a “direct > runner”, > for execution on the developer machine (mainly for dev/debug purposes). > Another runner allows a Dataflow program to run on a managed service, > Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK > is > already available on GitHub, and independent from the Google Cloud > Dataflow > service. Another Python SDK is currently in active development. > > In this proposal, the Dataflow SDKs, model, and a set of runners will > be > submitted as an OSS project under the ASF. The runners which are a part > of > this proposal include those for Spark (from Cloudera), Flink (from data > Artisans), and local development (from Google); the Google Cloud > Dataflow > service runner is not included in this proposal. Further references to > Dataflow will refer to the Dataflow model, SDKs, and runners which are > a > part of this proposal (A
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi Chatz, sure, what name should I use on the proposal, Charitha ? Regards JB On 01/21/2016 11:32 AM, chatz wrote: Hi Jean, I’d be interested in contributing as well. Thanks, Chatz On 21 January 2016 at 14:22, Jean-Baptiste Onofré wrote: Sweet: you are on the proposal ;) Thanks ! Regards JB On 01/21/2016 08:55 AM, Byung-Gon Chun wrote: This looks very interesting. I'm interested in contributing. Thanks. -Gon --- Byung-Gon Chun On Thu, Jan 21, 2016 at 1:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a “direct runner”, for execution on the developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service. Another Python SDK is currently in active development. In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an OSS project under the ASF. The runners which are a part of this proposal include those for Spark (from Cloudera), Flink (from data Artisans), and local development (from Google); the Google Cloud Dataflow service runner is not included in this proposal. Further references to Dataflow will refer to the Dataflow model, SDKs, and runners which are a part of this proposal (Apache Dataflow) only. The initial submission will contain the already-released Java SDK; Google intends to submit the Python SDK later in the incubation process. The Google Cloud Dataflow service will continue to be one of many runners for Dataflow, built on Google Cloud Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will develop against the Apache project additions, updates, and changes. Google Cloud Dataflow will become one user of Apache Dataflow and will participate in the project openly and publicly. The Dataflow programming model has been designed with simplicity, scalability, and speed as key tenants. In the Dataflow model, you only need to think about four top-level concepts when constructing your data processing job: * Pipelines - The data processing job made of a series of c
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi Jean, I’d be interested in contributing as well. Thanks, Chatz On 21 January 2016 at 14:22, Jean-Baptiste Onofré wrote: > Sweet: you are on the proposal ;) > > Thanks ! > Regards > JB > > > On 01/21/2016 08:55 AM, Byung-Gon Chun wrote: > >> This looks very interesting. I'm interested in contributing. >> >> Thanks. >> -Gon >> >> --- >> Byung-Gon Chun >> >> >> On Thu, Jan 21, 2016 at 1:32 AM, James Malone < >> jamesmal...@google.com.invalid> wrote: >> >> Hello everyone, >>> >>> Attached to this message is a proposed new project - Apache Dataflow, a >>> unified programming model for data processing and integration. >>> >>> The text of the proposal is included below. Additionally, the proposal is >>> in draft form on the wiki where we will make any required changes: >>> >>> https://wiki.apache.org/incubator/DataflowProposal >>> >>> We look forward to your feedback and input. >>> >>> Best, >>> >>> James >>> >>> >>> >>> = Apache Dataflow = >>> >>> == Abstract == >>> >>> Dataflow is an open source, unified model and set of language-specific >>> SDKs >>> for defining and executing data processing workflows, and also data >>> ingestion and integration flows, supporting Enterprise Integration >>> Patterns >>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify >>> the mechanics of large-scale batch and streaming data processing and can >>> run on a number of runtimes like Apache Flink, Apache Spark, and Google >>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in different >>> languages, allowing users to easily implement their data integration >>> processes. >>> >>> == Proposal == >>> >>> Dataflow is a simple, flexible, and powerful system for distributed data >>> processing at any scale. Dataflow provides a unified programming model, a >>> software development kit to define and construct data processing >>> pipelines, >>> and runners to execute Dataflow pipelines in several runtime engines, >>> like >>> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be >>> used >>> for a variety of streaming or batch data processing goals including ETL, >>> stream analysis, and aggregate computation. The underlying programming >>> model for Dataflow provides MapReduce-like parallelism, combined with >>> support for powerful data windowing, and fine-grained correctness >>> control. >>> >>> == Background == >>> >>> Dataflow started as a set of Google projects focused on making data >>> processing easier, faster, and less costly. The Dataflow model is a >>> successor to MapReduce, FlumeJava, and Millwheel inside Google and is >>> focused on providing a unified solution for batch and stream processing. >>> These projects on which Dataflow is based have been published in several >>> papers made available to the public: >>> >>> * MapReduce - http://research.google.com/archive/mapreduce.html >>> >>> * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf >>> >>> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf >>> >>> * MillWheel - http://research.google.com/pubs/pub41378.html >>> >>> Dataflow was designed from the start to provide a portable programming >>> layer. When you define a data processing pipeline with the Dataflow >>> model, >>> you are creating a job which is capable of being processed by any number >>> of >>> Dataflow processing engines. Several engines have been developed to run >>> Dataflow pipelines in other open source runtimes, including a Dataflow >>> runner for Apache Flink and Apache Spark. There is also a “direct >>> runner”, >>> for execution on the developer machine (mainly for dev/debug purposes). >>> Another runner allows a Dataflow program to run on a managed service, >>> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is >>> already available on GitHub, and independent from the Google Cloud >>> Dataflow >>> service. Another Python SDK is currently in active development. >>> >>> In this proposal, the Dataflow SDKs, model, and a set of runners will be >>> submitted as an OSS project under the ASF. The runners which are a part >>> of >>> this proposal include those for Spark (from Cloudera), Flink (from data >>> Artisans), and local development (from Google); the Google Cloud Dataflow >>> service runner is not included in this proposal. Further references to >>> Dataflow will refer to the Dataflow model, SDKs, and runners which are a >>> part of this proposal (Apache Dataflow) only. The initial submission will >>> contain the already-released Java SDK; Google intends to submit the >>> Python >>> SDK later in the incubation process. The Google Cloud Dataflow service >>> will >>> continue to be one of many runners for Dataflow, built on Google Cloud >>> Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will >>> develop against the Apache project additions, updates, and changes. >>> Google >>> Cloud Dataflow will become one user of Apache Dataflow and will >>> participate >>> in the projec
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Sweet: you are on the proposal ;) Thanks ! Regards JB On 01/21/2016 08:55 AM, Byung-Gon Chun wrote: This looks very interesting. I'm interested in contributing. Thanks. -Gon --- Byung-Gon Chun On Thu, Jan 21, 2016 at 1:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a “direct runner”, for execution on the developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service. Another Python SDK is currently in active development. In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an OSS project under the ASF. The runners which are a part of this proposal include those for Spark (from Cloudera), Flink (from data Artisans), and local development (from Google); the Google Cloud Dataflow service runner is not included in this proposal. Further references to Dataflow will refer to the Dataflow model, SDKs, and runners which are a part of this proposal (Apache Dataflow) only. The initial submission will contain the already-released Java SDK; Google intends to submit the Python SDK later in the incubation process. The Google Cloud Dataflow service will continue to be one of many runners for Dataflow, built on Google Cloud Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will develop against the Apache project additions, updates, and changes. Google Cloud Dataflow will become one user of Apache Dataflow and will participate in the project openly and publicly. The Dataflow programming model has been designed with simplicity, scalability, and speed as key tenants. In the Dataflow model, you only need to think about four top-level concepts when constructing your data processing job: * Pipelines - The data processing job made of a series of computations including input, processing, and output * PCollections - Bounded (or unbounded) datasets which represent the input, intermediate and output data in pipelines * PTransforms - A data processing step in a pipeline in which one or more PColle
Re: [DISCUSS] Apache Dataflow Incubator Proposal
This looks very interesting. I'm interested in contributing. Thanks. -Gon --- Byung-Gon Chun On Thu, Jan 21, 2016 at 1:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: > Hello everyone, > > Attached to this message is a proposed new project - Apache Dataflow, a > unified programming model for data processing and integration. > > The text of the proposal is included below. Additionally, the proposal is > in draft form on the wiki where we will make any required changes: > > https://wiki.apache.org/incubator/DataflowProposal > > We look forward to your feedback and input. > > Best, > > James > > > > = Apache Dataflow = > > == Abstract == > > Dataflow is an open source, unified model and set of language-specific SDKs > for defining and executing data processing workflows, and also data > ingestion and integration flows, supporting Enterprise Integration Patterns > (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify > the mechanics of large-scale batch and streaming data processing and can > run on a number of runtimes like Apache Flink, Apache Spark, and Google > Cloud Dataflow (a cloud service). Dataflow also brings DSL in different > languages, allowing users to easily implement their data integration > processes. > > == Proposal == > > Dataflow is a simple, flexible, and powerful system for distributed data > processing at any scale. Dataflow provides a unified programming model, a > software development kit to define and construct data processing pipelines, > and runners to execute Dataflow pipelines in several runtime engines, like > Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used > for a variety of streaming or batch data processing goals including ETL, > stream analysis, and aggregate computation. The underlying programming > model for Dataflow provides MapReduce-like parallelism, combined with > support for powerful data windowing, and fine-grained correctness control. > > == Background == > > Dataflow started as a set of Google projects focused on making data > processing easier, faster, and less costly. The Dataflow model is a > successor to MapReduce, FlumeJava, and Millwheel inside Google and is > focused on providing a unified solution for batch and stream processing. > These projects on which Dataflow is based have been published in several > papers made available to the public: > > * MapReduce - http://research.google.com/archive/mapreduce.html > > * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf > > * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf > > * MillWheel - http://research.google.com/pubs/pub41378.html > > Dataflow was designed from the start to provide a portable programming > layer. When you define a data processing pipeline with the Dataflow model, > you are creating a job which is capable of being processed by any number of > Dataflow processing engines. Several engines have been developed to run > Dataflow pipelines in other open source runtimes, including a Dataflow > runner for Apache Flink and Apache Spark. There is also a “direct runner”, > for execution on the developer machine (mainly for dev/debug purposes). > Another runner allows a Dataflow program to run on a managed service, > Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is > already available on GitHub, and independent from the Google Cloud Dataflow > service. Another Python SDK is currently in active development. > > In this proposal, the Dataflow SDKs, model, and a set of runners will be > submitted as an OSS project under the ASF. The runners which are a part of > this proposal include those for Spark (from Cloudera), Flink (from data > Artisans), and local development (from Google); the Google Cloud Dataflow > service runner is not included in this proposal. Further references to > Dataflow will refer to the Dataflow model, SDKs, and runners which are a > part of this proposal (Apache Dataflow) only. The initial submission will > contain the already-released Java SDK; Google intends to submit the Python > SDK later in the incubation process. The Google Cloud Dataflow service will > continue to be one of many runners for Dataflow, built on Google Cloud > Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will > develop against the Apache project additions, updates, and changes. Google > Cloud Dataflow will become one user of Apache Dataflow and will participate > in the project openly and publicly. > > The Dataflow programming model has been designed with simplicity, > scalability, and speed as key tenants. In the Dataflow model, you only need > to think about four top-level concepts when constructing your data > processing job: > > * Pipelines - The data processing job made of a series of computations > including input, processing, and output > > * PCollections - Bounded (or unbounded) datasets which represent the input, > intermediate and output data in pipelines > > * PTransf
Re: [DISCUSS] Apache Dataflow Incubator Proposal
I added you on the proposal. Thanks ! Regards JB On 01/21/2016 07:36 AM, Hao Chen wrote: Nice proposal, exactly matches with what we wanna do in some projects, interested to contribute. Regards, Hao On Thu, Jan 21, 2016 at 12:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a “direct runner”, for execution on the developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service. Another Python SDK is currently in active development. In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an OSS project under the ASF. The runners which are a part of this proposal include those for Spark (from Cloudera), Flink (from data Artisans), and local development (from Google); the Google Cloud Dataflow service runner is not included in this proposal. Further references to Dataflow will refer to the Dataflow model, SDKs, and runners which are a part of this proposal (Apache Dataflow) only. The initial submission will contain the already-released Java SDK; Google intends to submit the Python SDK later in the incubation process. The Google Cloud Dataflow service will continue to be one of many runners for Dataflow, built on Google Cloud Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will develop against the Apache project additions, updates, and changes. Google Cloud Dataflow will become one user of Apache Dataflow and will participate in the project openly and publicly. The Dataflow programming model has been designed with simplicity, scalability, and speed as key tenants. In the Dataflow model, you only need to think about four top-level concepts when constructing your data processing job: * Pipelines - The data processing job made of a series of computations including input, processing, and output * PCollections - Bounded (or unbounded) datasets which represent the input, intermediate and output data in pipelines * PTransforms - A data processing step in a pipeline in which one or more P
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Great: done ;) Regards JB On 01/21/2016 07:15 AM, Edward J. Yoon wrote: Pls add me to "Additional Interested Contributors" section too. :-) -- Best Regards, Edward J. Yoon -Original Message- From: Jean-Baptiste Onofr� [mailto:j...@nanthrax.net] Sent: Thursday, January 21, 2016 2:39 PM To: general@incubator.apache.org Subject: Re: [DISCUSS] Apache Dataflow Incubator Proposal Cool ! I added you on the proposal. Regards JB On 01/21/2016 12:20 AM, ksobkowiak wrote: It's a great news the project is going to move to Apache. I'd be interested in contributing too Regards Krzysztof -- View this message in context: http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow -Incubator-Proposal-tp47985p48025.html Sent from the Apache Incubator - General mailing list archive at Nabble.com. - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Jean-Baptiste Onofr� jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Nice proposal, exactly matches with what we wanna do in some projects, interested to contribute. Regards, Hao On Thu, Jan 21, 2016 at 12:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: > Hello everyone, > > Attached to this message is a proposed new project - Apache Dataflow, a > unified programming model for data processing and integration. > > The text of the proposal is included below. Additionally, the proposal is > in draft form on the wiki where we will make any required changes: > > https://wiki.apache.org/incubator/DataflowProposal > > We look forward to your feedback and input. > > Best, > > James > > > > = Apache Dataflow = > > == Abstract == > > Dataflow is an open source, unified model and set of language-specific SDKs > for defining and executing data processing workflows, and also data > ingestion and integration flows, supporting Enterprise Integration Patterns > (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify > the mechanics of large-scale batch and streaming data processing and can > run on a number of runtimes like Apache Flink, Apache Spark, and Google > Cloud Dataflow (a cloud service). Dataflow also brings DSL in different > languages, allowing users to easily implement their data integration > processes. > > == Proposal == > > Dataflow is a simple, flexible, and powerful system for distributed data > processing at any scale. Dataflow provides a unified programming model, a > software development kit to define and construct data processing pipelines, > and runners to execute Dataflow pipelines in several runtime engines, like > Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used > for a variety of streaming or batch data processing goals including ETL, > stream analysis, and aggregate computation. The underlying programming > model for Dataflow provides MapReduce-like parallelism, combined with > support for powerful data windowing, and fine-grained correctness control. > > == Background == > > Dataflow started as a set of Google projects focused on making data > processing easier, faster, and less costly. The Dataflow model is a > successor to MapReduce, FlumeJava, and Millwheel inside Google and is > focused on providing a unified solution for batch and stream processing. > These projects on which Dataflow is based have been published in several > papers made available to the public: > > * MapReduce - http://research.google.com/archive/mapreduce.html > > * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf > > * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf > > * MillWheel - http://research.google.com/pubs/pub41378.html > > Dataflow was designed from the start to provide a portable programming > layer. When you define a data processing pipeline with the Dataflow model, > you are creating a job which is capable of being processed by any number of > Dataflow processing engines. Several engines have been developed to run > Dataflow pipelines in other open source runtimes, including a Dataflow > runner for Apache Flink and Apache Spark. There is also a “direct runner”, > for execution on the developer machine (mainly for dev/debug purposes). > Another runner allows a Dataflow program to run on a managed service, > Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is > already available on GitHub, and independent from the Google Cloud Dataflow > service. Another Python SDK is currently in active development. > > In this proposal, the Dataflow SDKs, model, and a set of runners will be > submitted as an OSS project under the ASF. The runners which are a part of > this proposal include those for Spark (from Cloudera), Flink (from data > Artisans), and local development (from Google); the Google Cloud Dataflow > service runner is not included in this proposal. Further references to > Dataflow will refer to the Dataflow model, SDKs, and runners which are a > part of this proposal (Apache Dataflow) only. The initial submission will > contain the already-released Java SDK; Google intends to submit the Python > SDK later in the incubation process. The Google Cloud Dataflow service will > continue to be one of many runners for Dataflow, built on Google Cloud > Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will > develop against the Apache project additions, updates, and changes. Google > Cloud Dataflow will become one user of Apache Dataflow and will participate > in the project openly and publicly. > > The Dataflow programming model has been designed with simplicity, > scalability, and speed as key tenants. In the Dataflow model, you only need > to think about four top-level concepts when constructing your data > processing job: > > * Pipelines - The data processing job made of a series of computations > including input, processing, and output > > * PCollections - Bounded (or unbounded) datasets which represent the input, > intermediate and output data in pipelin
RE: [DISCUSS] Apache Dataflow Incubator Proposal
Pls add me to "Additional Interested Contributors" section too. :-) -- Best Regards, Edward J. Yoon -Original Message- From: Jean-Baptiste Onofr� [mailto:j...@nanthrax.net] Sent: Thursday, January 21, 2016 2:39 PM To: general@incubator.apache.org Subject: Re: [DISCUSS] Apache Dataflow Incubator Proposal Cool ! I added you on the proposal. Regards JB On 01/21/2016 12:20 AM, ksobkowiak wrote: > It's a great news the project is going to move to Apache. I'd be interested > in contributing too > > Regards > Krzysztof > > > > -- > View this message in context: http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow -Incubator-Proposal-tp47985p48025.html > Sent from the Apache Incubator - General mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > -- Jean-Baptiste Onofr� jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Cool ! I added you on the proposal. Regards JB On 01/21/2016 12:20 AM, ksobkowiak wrote: It's a great news the project is going to move to Apache. I'd be interested in contributing too Regards Krzysztof -- View this message in context: http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48025.html Sent from the Apache Incubator - General mailing list archive at Nabble.com. - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi Hugo, I added you on the proposal Thanks Regards JB On 01/21/2016 12:54 AM, Hugo Louro wrote: Hello everyone, Very compelling proposal; congrats! I would be interested in contributing to this project from the beginning. Looking forward to it. Best, Hugo On Wed, Jan 20, 2016 at 3:24 PM, Prasanth Jayachandran < pjayachand...@hortonworks.com> wrote: Hi Jean I’d be interested in contributing as well. Thanks Prasanth Jayachandran On Jan 20, 2016, at 5:20 PM, ksobkowiak wrote: It's a great news the project is going to move to Apache. I'd be interested in contributing too Regards Krzysztof -- View this message in context: http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48025.html Sent from the Apache Incubator - General mailing list archive at Nabble.com. - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Awesome: you are in the proposal ;) Regards JB On 01/21/2016 12:55 AM, Johan Edstrom wrote: Looking forward, also interested in contributing. On Jan 20, 2016, at 4:54 PM, Hugo Louro wrote: Hello everyone, Very compelling proposal; congrats! I would be interested in contributing to this project from the beginning. Looking forward to it. Best, Hugo On Wed, Jan 20, 2016 at 3:24 PM, Prasanth Jayachandran < pjayachand...@hortonworks.com> wrote: Hi Jean I’d be interested in contributing as well. Thanks Prasanth Jayachandran On Jan 20, 2016, at 5:20 PM, ksobkowiak wrote: It's a great news the project is going to move to Apache. I'd be interested in contributing too Regards Krzysztof -- View this message in context: http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48025.html Sent from the Apache Incubator - General mailing list archive at Nabble.com. - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi, great: I added you in the proposal. Thanks ! Regards JB On 01/21/2016 12:24 AM, Prasanth Jayachandran wrote: Hi Jean I’d be interested in contributing as well. Thanks Prasanth Jayachandran On Jan 20, 2016, at 5:20 PM, ksobkowiak wrote: It's a great news the project is going to move to Apache. I'd be interested in contributing too Regards Krzysztof -- View this message in context: http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48025.html Sent from the Apache Incubator - General mailing list archive at Nabble.com. - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
RE: [DISCUSS] Apache Dataflow Incubator Proposal
Wow .. great news! -- Best Regards, Edward J. Yoon -Original Message- From: Johan Edstrom [mailto:seij...@gmail.com] Sent: Thursday, January 21, 2016 8:56 AM To: general@incubator.apache.org Subject: Re: [DISCUSS] Apache Dataflow Incubator Proposal Looking forward, also interested in contributing. > On Jan 20, 2016, at 4:54 PM, Hugo Louro wrote: > > Hello everyone, > > Very compelling proposal; congrats! I would be interested in contributing > to this project from the beginning. > > Looking forward to it. > > Best, > Hugo > > On Wed, Jan 20, 2016 at 3:24 PM, Prasanth Jayachandran < > pjayachand...@hortonworks.com> wrote: > >> Hi Jean >> >> I’d be interested in contributing as well. >> >> Thanks >> Prasanth Jayachandran >> >>> On Jan 20, 2016, at 5:20 PM, ksobkowiak >> wrote: >>> >>> It's a great news the project is going to move to Apache. I'd be >> interested >>> in contributing too >>> >>> Regards >>> Krzysztof >>> >>> >>> >>> -- >>> View this message in context: >> http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48025.html >>> Sent from the Apache Incubator - General mailing list archive at >> Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >>> For additional commands, e-mail: general-h...@incubator.apache.org >>> >>> >> >> - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Looking forward, also interested in contributing. > On Jan 20, 2016, at 4:54 PM, Hugo Louro wrote: > > Hello everyone, > > Very compelling proposal; congrats! I would be interested in contributing > to this project from the beginning. > > Looking forward to it. > > Best, > Hugo > > On Wed, Jan 20, 2016 at 3:24 PM, Prasanth Jayachandran < > pjayachand...@hortonworks.com> wrote: > >> Hi Jean >> >> I’d be interested in contributing as well. >> >> Thanks >> Prasanth Jayachandran >> >>> On Jan 20, 2016, at 5:20 PM, ksobkowiak >> wrote: >>> >>> It's a great news the project is going to move to Apache. I'd be >> interested >>> in contributing too >>> >>> Regards >>> Krzysztof >>> >>> >>> >>> -- >>> View this message in context: >> http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48025.html >>> Sent from the Apache Incubator - General mailing list archive at >> Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >>> For additional commands, e-mail: general-h...@incubator.apache.org >>> >>> >> >> - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hello everyone, Very compelling proposal; congrats! I would be interested in contributing to this project from the beginning. Looking forward to it. Best, Hugo On Wed, Jan 20, 2016 at 3:24 PM, Prasanth Jayachandran < pjayachand...@hortonworks.com> wrote: > Hi Jean > > I’d be interested in contributing as well. > > Thanks > Prasanth Jayachandran > > > On Jan 20, 2016, at 5:20 PM, ksobkowiak > wrote: > > > > It's a great news the project is going to move to Apache. I'd be > interested > > in contributing too > > > > Regards > > Krzysztof > > > > > > > > -- > > View this message in context: > http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48025.html > > Sent from the Apache Incubator - General mailing list archive at > Nabble.com. > > > > - > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > >
Re: [DISCUSS] Apache Dataflow Incubator Proposal
> > > I don't see anything in the proposal about Google ceasing the use of the > brand > "Google Cloud Dataflow". Yet the co-existence of "Google Cloud Dataflow" > and > "Apache Dataflow" would conflict with Apache requirements for vendor > neutrality and project independence. > > The issue seems similar to the recent proposal to incubate "Apache > OpenMiracl" > while allowing the "Miracl" company to continue distribution of the > "Miracl" > project. That situation was was resolved by renaming the Apache project to > "Milagro", allowing the Miracl company to continue benefitting from the > brand > they had invested in so heavily. > Apologies to my delay responding to the feedback about naming! We anticipated there may be some concerns about the naming. The project members also want to confront those concerns head-on so any issues related to naming don't take away from the technical merit of the proposal. We're open to coming up with a new name and renaming the proposed project if it's prudent or required. To that end, I have a question about the order of operations. If we need to rename, we would ideally choose a new name, change the project name at that time, and start our refactoring with that new name. Is is acceptable for us to flag a name change as something we need to do as a near-term (1st month) item in incubation (if accepted)? If a rename is required I'd like to add it to our to-do roadmap but also not block our proposal on a renaming. I ask so we can address this concern in the best way possible. > http://markmail.org/message/tpiphl55rcyezcvd > > Marvin Humphrey > > - > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi Jean I’d be interested in contributing as well. Thanks Prasanth Jayachandran > On Jan 20, 2016, at 5:20 PM, ksobkowiak wrote: > > It's a great news the project is going to move to Apache. I'd be interested > in contributing too > > Regards > Krzysztof > > > > -- > View this message in context: > http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48025.html > Sent from the Apache Incubator - General mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >
Re: [DISCUSS] Apache Dataflow Incubator Proposal
It's a great news the project is going to move to Apache. I'd be interested in contributing too Regards Krzysztof -- View this message in context: http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48025.html Sent from the Apache Incubator - General mailing list archive at Nabble.com. - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Cool: you are on the proposal Regards JB On 01/20/2016 09:52 PM, Vaibhav Gumashta wrote: Hi Jean, I¹d like to contribute as well. Thanks, ‹Vaibhav On 1/20/16, 11:19 AM, "Jean-Baptiste Onofré" wrote: Hey James, you are on the proposal ;) Thanks ! Regards JB On 01/20/2016 07:20 PM, James Carman wrote: Well, I for one would be very interested in this project and would be happy to contribute. On Wed, Jan 20, 2016 at 12:09 PM Jean-Baptiste Onofré wrote: Hi Sean, It's a fair point, but not present in most of the proposals. It's something that we can address in the "Community" section. Regards JB On 01/20/2016 05:55 PM, Sean Busbey wrote: Great proposal. I like that your proposal includes a well presented roadmap, but I don't see any goals that directly address building a larger community. Y'all have any ideas around outreach that will help with adoption? As a start, I recommend y'all add a section to the proposal on the wiki page for "Additional Interested Contributors" so that folks who want to sign up to participate in the project can do so without requesting additions to the initial committer list. On Wed, Jan 20, 2016 at 10:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a ³direct runner², for execution on the developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service. Another Python SDK is currently in active development. In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an OSS project under the ASF. The runners which are a part of this proposal include those for Spark (from Cloudera), Flink (from data Artisans), and local development (from Google); the Google Cloud Dataflow service runner is not included in this proposal. Further references to Dataflow will refer to the Dataflow model, SDKs, and runners which are a part of this proposal (Apache Dataflow) only. The initial submission will contain the already-released Java SDK; Google
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Great: you are on the proposal. Regards JB On 01/20/2016 09:00 PM, Joe Witt wrote: Hello This is a very interesting proposal and concept. I'd like to contribute. Thanks Joe On Wed, Jan 20, 2016 at 2:50 PM, James Carman wrote: Of course! I'd be happy to help On Wed, Jan 20, 2016 at 2:02 PM Jean-Baptiste Onofré wrote: Hi James, Can I add your to the proposal ? Regards JB On 01/20/2016 07:20 PM, James Carman wrote: Well, I for one would be very interested in this project and would be happy to contribute. On Wed, Jan 20, 2016 at 12:09 PM Jean-Baptiste Onofré wrote: Hi Sean, It's a fair point, but not present in most of the proposals. It's something that we can address in the "Community" section. Regards JB On 01/20/2016 05:55 PM, Sean Busbey wrote: Great proposal. I like that your proposal includes a well presented roadmap, but I don't see any goals that directly address building a larger community. Y'all have any ideas around outreach that will help with adoption? As a start, I recommend y'all add a section to the proposal on the wiki page for "Additional Interested Contributors" so that folks who want to sign up to participate in the project can do so without requesting additions to the initial committer list. On Wed, Jan 20, 2016 at 10:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a “direct runner”, for execution on the developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service. Another Python SDK is currently in active development. In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an OSS project under the ASF. The runners which are a part of this proposal include those for Spark (from Cloudera), Flink (from data Artisans), and local development (from Google); the Google Cloud Dataflow service runner is not included in this proposal. Further references to Dataflow will refer to the Dataflow model, SD
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi Jean, I¹d like to contribute as well. Thanks, ‹Vaibhav On 1/20/16, 11:19 AM, "Jean-Baptiste Onofré" wrote: >Hey James, > >you are on the proposal ;) > >Thanks ! >Regards >JB > >On 01/20/2016 07:20 PM, James Carman wrote: >> Well, I for one would be very interested in this project and would be >>happy >> to contribute. >> >> >> On Wed, Jan 20, 2016 at 12:09 PM Jean-Baptiste Onofré >> wrote: >> >>> Hi Sean, >>> >>> It's a fair point, but not present in most of the proposals. It's >>> something that we can address in the "Community" section. >>> >>> Regards >>> JB >>> >>> On 01/20/2016 05:55 PM, Sean Busbey wrote: Great proposal. I like that your proposal includes a well presented roadmap, but I don't see any goals that directly address building a >>> larger community. Y'all have any ideas around outreach that will help with adoption? As a start, I recommend y'all add a section to the proposal on the wiki page for "Additional Interested Contributors" so that folks who want to sign up to participate in the project can do so without requesting additions to the initial committer list. On Wed, Jan 20, 2016 at 10:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: > Hello everyone, > > Attached to this message is a proposed new project - Apache >Dataflow, a > unified programming model for data processing and integration. > > The text of the proposal is included below. Additionally, the >proposal >>> is > in draft form on the wiki where we will make any required changes: > > https://wiki.apache.org/incubator/DataflowProposal > > We look forward to your feedback and input. > > Best, > > James > > > > = Apache Dataflow = > > == Abstract == > > Dataflow is an open source, unified model and set of >language-specific >>> SDKs > for defining and executing data processing workflows, and also data > ingestion and integration flows, supporting Enterprise Integration >>> Patterns > (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines >simplify > the mechanics of large-scale batch and streaming data processing and >can > run on a number of runtimes like Apache Flink, Apache Spark, and >Google > Cloud Dataflow (a cloud service). Dataflow also brings DSL in >different > languages, allowing users to easily implement their data integration > processes. > > == Proposal == > > Dataflow is a simple, flexible, and powerful system for distributed >data > processing at any scale. Dataflow provides a unified programming >model, >>> a > software development kit to define and construct data processing >>> pipelines, > and runners to execute Dataflow pipelines in several runtime engines, >>> like > Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be >>> used > for a variety of streaming or batch data processing goals including >ETL, > stream analysis, and aggregate computation. The underlying >programming > model for Dataflow provides MapReduce-like parallelism, combined with > support for powerful data windowing, and fine-grained correctness >>> control. > > == Background == > > Dataflow started as a set of Google projects focused on making data > processing easier, faster, and less costly. The Dataflow model is a > successor to MapReduce, FlumeJava, and Millwheel inside Google and is > focused on providing a unified solution for batch and stream >processing. > These projects on which Dataflow is based have been published in >several > papers made available to the public: > > * MapReduce - http://research.google.com/archive/mapreduce.html > > * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf > > * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf > > * MillWheel - http://research.google.com/pubs/pub41378.html > > Dataflow was designed from the start to provide a portable >programming > layer. When you define a data processing pipeline with the Dataflow >>> model, > you are creating a job which is capable of being processed by any >>> number of > Dataflow processing engines. Several engines have been developed to >run > Dataflow pipelines in other open source runtimes, including a >Dataflow > runner for Apache Flink and Apache Spark. There is also a ³direct >>> runner², > for execution on the developer machine (mainly for dev/debug >purposes). > Another runner allows a Dataflow program to run on a managed service, > Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java >SDK >>> is > already available on GitHub, and independent from the Google Cloud >>> Dataflow > service. Another Python SDK is currently in active development.
Re: [DISCUSS] Apache Dataflow Incubator Proposal
> On Jan 20, 2016, at 8:32 AM, James Malone > wrote: > > The Dataflow programming model has been designed with simplicity, > scalability, and speed as key tenants. s/tenants/tenets? david jencks
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hello This is a very interesting proposal and concept. I'd like to contribute. Thanks Joe On Wed, Jan 20, 2016 at 2:50 PM, James Carman wrote: > Of course! I'd be happy to help > > On Wed, Jan 20, 2016 at 2:02 PM Jean-Baptiste Onofré > wrote: > >> Hi James, >> >> Can I add your to the proposal ? >> >> Regards >> JB >> >> On 01/20/2016 07:20 PM, James Carman wrote: >> > Well, I for one would be very interested in this project and would be >> happy >> > to contribute. >> > >> > >> > On Wed, Jan 20, 2016 at 12:09 PM Jean-Baptiste Onofré >> > wrote: >> > >> >> Hi Sean, >> >> >> >> It's a fair point, but not present in most of the proposals. It's >> >> something that we can address in the "Community" section. >> >> >> >> Regards >> >> JB >> >> >> >> On 01/20/2016 05:55 PM, Sean Busbey wrote: >> >>> Great proposal. I like that your proposal includes a well presented >> >>> roadmap, but I don't see any goals that directly address building a >> >> larger >> >>> community. Y'all have any ideas around outreach that will help with >> >>> adoption? >> >>> >> >>> As a start, I recommend y'all add a section to the proposal on the wiki >> >>> page for "Additional Interested Contributors" so that folks who want to >> >>> sign up to participate in the project can do so without requesting >> >>> additions to the initial committer list. >> >>> >> >>> On Wed, Jan 20, 2016 at 10:32 AM, James Malone < >> >>> jamesmal...@google.com.invalid> wrote: >> >>> >> Hello everyone, >> >> Attached to this message is a proposed new project - Apache Dataflow, >> a >> unified programming model for data processing and integration. >> >> The text of the proposal is included below. Additionally, the proposal >> >> is >> in draft form on the wiki where we will make any required changes: >> >> https://wiki.apache.org/incubator/DataflowProposal >> >> We look forward to your feedback and input. >> >> Best, >> >> James >> >> >> >> = Apache Dataflow = >> >> == Abstract == >> >> Dataflow is an open source, unified model and set of language-specific >> >> SDKs >> for defining and executing data processing workflows, and also data >> ingestion and integration flows, supporting Enterprise Integration >> >> Patterns >> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines >> simplify >> the mechanics of large-scale batch and streaming data processing and >> can >> run on a number of runtimes like Apache Flink, Apache Spark, and >> Google >> Cloud Dataflow (a cloud service). Dataflow also brings DSL in >> different >> languages, allowing users to easily implement their data integration >> processes. >> >> == Proposal == >> >> Dataflow is a simple, flexible, and powerful system for distributed >> data >> processing at any scale. Dataflow provides a unified programming >> model, >> >> a >> software development kit to define and construct data processing >> >> pipelines, >> and runners to execute Dataflow pipelines in several runtime engines, >> >> like >> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be >> >> used >> for a variety of streaming or batch data processing goals including >> ETL, >> stream analysis, and aggregate computation. The underlying programming >> model for Dataflow provides MapReduce-like parallelism, combined with >> support for powerful data windowing, and fine-grained correctness >> >> control. >> >> == Background == >> >> Dataflow started as a set of Google projects focused on making data >> processing easier, faster, and less costly. The Dataflow model is a >> successor to MapReduce, FlumeJava, and Millwheel inside Google and is >> focused on providing a unified solution for batch and stream >> processing. >> These projects on which Dataflow is based have been published in >> several >> papers made available to the public: >> >> * MapReduce - http://research.google.com/archive/mapreduce.html >> >> * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf >> >> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf >> >> * MillWheel - http://research.google.com/pubs/pub41378.html >> >> Dataflow was designed from the start to provide a portable programming >> layer. When you define a data processing pipeline with the Dataflow >> >> model, >> you are creating a job which is capable of being processed by any >> >> number of >> Dataflow processing engines. Several engines have been developed to >> run >> Dataflow pipelines in other open source runtimes, including a Dataflow >> runner for Apache Flink and Apache Spark. There is also a “direct >> >> runner”, >> for execution on the developer machine (mainly for dev/debug >> purposes). >> Anothe
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Of course! I'd be happy to help On Wed, Jan 20, 2016 at 2:02 PM Jean-Baptiste Onofré wrote: > Hi James, > > Can I add your to the proposal ? > > Regards > JB > > On 01/20/2016 07:20 PM, James Carman wrote: > > Well, I for one would be very interested in this project and would be > happy > > to contribute. > > > > > > On Wed, Jan 20, 2016 at 12:09 PM Jean-Baptiste Onofré > > wrote: > > > >> Hi Sean, > >> > >> It's a fair point, but not present in most of the proposals. It's > >> something that we can address in the "Community" section. > >> > >> Regards > >> JB > >> > >> On 01/20/2016 05:55 PM, Sean Busbey wrote: > >>> Great proposal. I like that your proposal includes a well presented > >>> roadmap, but I don't see any goals that directly address building a > >> larger > >>> community. Y'all have any ideas around outreach that will help with > >>> adoption? > >>> > >>> As a start, I recommend y'all add a section to the proposal on the wiki > >>> page for "Additional Interested Contributors" so that folks who want to > >>> sign up to participate in the project can do so without requesting > >>> additions to the initial committer list. > >>> > >>> On Wed, Jan 20, 2016 at 10:32 AM, James Malone < > >>> jamesmal...@google.com.invalid> wrote: > >>> > Hello everyone, > > Attached to this message is a proposed new project - Apache Dataflow, > a > unified programming model for data processing and integration. > > The text of the proposal is included below. Additionally, the proposal > >> is > in draft form on the wiki where we will make any required changes: > > https://wiki.apache.org/incubator/DataflowProposal > > We look forward to your feedback and input. > > Best, > > James > > > > = Apache Dataflow = > > == Abstract == > > Dataflow is an open source, unified model and set of language-specific > >> SDKs > for defining and executing data processing workflows, and also data > ingestion and integration flows, supporting Enterprise Integration > >> Patterns > (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines > simplify > the mechanics of large-scale batch and streaming data processing and > can > run on a number of runtimes like Apache Flink, Apache Spark, and > Google > Cloud Dataflow (a cloud service). Dataflow also brings DSL in > different > languages, allowing users to easily implement their data integration > processes. > > == Proposal == > > Dataflow is a simple, flexible, and powerful system for distributed > data > processing at any scale. Dataflow provides a unified programming > model, > >> a > software development kit to define and construct data processing > >> pipelines, > and runners to execute Dataflow pipelines in several runtime engines, > >> like > Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be > >> used > for a variety of streaming or batch data processing goals including > ETL, > stream analysis, and aggregate computation. The underlying programming > model for Dataflow provides MapReduce-like parallelism, combined with > support for powerful data windowing, and fine-grained correctness > >> control. > > == Background == > > Dataflow started as a set of Google projects focused on making data > processing easier, faster, and less costly. The Dataflow model is a > successor to MapReduce, FlumeJava, and Millwheel inside Google and is > focused on providing a unified solution for batch and stream > processing. > These projects on which Dataflow is based have been published in > several > papers made available to the public: > > * MapReduce - http://research.google.com/archive/mapreduce.html > > * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf > > * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf > > * MillWheel - http://research.google.com/pubs/pub41378.html > > Dataflow was designed from the start to provide a portable programming > layer. When you define a data processing pipeline with the Dataflow > >> model, > you are creating a job which is capable of being processed by any > >> number of > Dataflow processing engines. Several engines have been developed to > run > Dataflow pipelines in other open source runtimes, including a Dataflow > runner for Apache Flink and Apache Spark. There is also a “direct > >> runner”, > for execution on the developer machine (mainly for dev/debug > purposes). > Another runner allows a Dataflow program to run on a managed service, > Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK > >> is > already available on GitHub, and independent from the Google Cloud > >> Dataflow > service. Another Python SDK is currently i
Re: [DISCUSS] Apache Dataflow Incubator Proposal
good proposal, look forward to participating this project contribution. -- View this message in context: http://apache-incubator-general.996316.n3.nabble.com/DISCUSS-Apache-Dataflow-Incubator-Proposal-tp47985p48014.html Sent from the Apache Incubator - General mailing list archive at Nabble.com. - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
You are on the proposal ;) Thanks ! Regards JB On 01/20/2016 08:04 PM, P. Taylor Goetz wrote: Nice proposal. I’d be interested in contributing as well. I’m about at my mentor limit with projects, but I’d be willing to contribute in other/similar ways. -Taylor On Jan 20, 2016, at 12:46 PM, Jean-Baptiste Onofré wrote: Great, I add you in the initial committer list then ;) I quickly discussed with James, we gonna create a section for additional people as proposed by Sean. Thanks ! Regards JB On 01/20/2016 06:33 PM, Debo Dutta (dedutta) wrote: Hi JB Would love to join now. regards debo On 1/20/16, 9:31 AM, "Jean-Baptiste Onofré" wrote: Hi Debo, Awesome: do you want to join now (in the initial committer list) and once we are in the incubation ? Let me know, I can update the proposal. Regards JB On 01/20/2016 06:23 PM, Debo Dutta (dedutta) wrote: +1 Proposal looks good. Also a small section on relationships with Apache Storm and Apache Samza would be great. I would like to sign up, to help/contribute. debo On 1/20/16, 8:55 AM, "Sean Busbey" wrote: Great proposal. I like that your proposal includes a well presented roadmap, but I don't see any goals that directly address building a larger community. Y'all have any ideas around outreach that will help with adoption? As a start, I recommend y'all add a section to the proposal on the wiki page for "Additional Interested Contributors" so that folks who want to sign up to participate in the project can do so without requesting additions to the initial committer list. On Wed, Jan 20, 2016 at 10:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a ³direct runner², for execution on the developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service. Another Python SDK is currently in active development. In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an OSS project under the ASF. The runners which are a part of this proposal
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hey James, you are on the proposal ;) Thanks ! Regards JB On 01/20/2016 07:20 PM, James Carman wrote: Well, I for one would be very interested in this project and would be happy to contribute. On Wed, Jan 20, 2016 at 12:09 PM Jean-Baptiste Onofré wrote: Hi Sean, It's a fair point, but not present in most of the proposals. It's something that we can address in the "Community" section. Regards JB On 01/20/2016 05:55 PM, Sean Busbey wrote: Great proposal. I like that your proposal includes a well presented roadmap, but I don't see any goals that directly address building a larger community. Y'all have any ideas around outreach that will help with adoption? As a start, I recommend y'all add a section to the proposal on the wiki page for "Additional Interested Contributors" so that folks who want to sign up to participate in the project can do so without requesting additions to the initial committer list. On Wed, Jan 20, 2016 at 10:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a “direct runner”, for execution on the developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service. Another Python SDK is currently in active development. In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an OSS project under the ASF. The runners which are a part of this proposal include those for Spark (from Cloudera), Flink (from data Artisans), and local development (from Google); the Google Cloud Dataflow service runner is not included in this proposal. Further references to Dataflow will refer to the Dataflow model, SDKs, and runners which are a part of this proposal (Apache Dataflow) only. The initial submission will contain the already-released Java SDK; Google intends to submit the Python SDK later in the incubation process. The Google Cloud Dataflow service will continue to be one of many runners for Dataflow, built on Google Cloud Platform, to run Dataflow p
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Nice proposal. I’d be interested in contributing as well. I’m about at my mentor limit with projects, but I’d be willing to contribute in other/similar ways. -Taylor > On Jan 20, 2016, at 12:46 PM, Jean-Baptiste Onofré wrote: > > Great, I add you in the initial committer list then ;) > > I quickly discussed with James, we gonna create a section for additional > people as proposed by Sean. > > Thanks ! > Regards > JB > > On 01/20/2016 06:33 PM, Debo Dutta (dedutta) wrote: >> Hi JB >> >> Would love to join now. >> >> regards >> debo >> >> On 1/20/16, 9:31 AM, "Jean-Baptiste Onofré" wrote: >> >>> Hi Debo, >>> >>> Awesome: do you want to join now (in the initial committer list) and >>> once we are in the incubation ? >>> >>> Let me know, I can update the proposal. >>> >>> Regards >>> JB >>> >>> On 01/20/2016 06:23 PM, Debo Dutta (dedutta) wrote: +1 Proposal looks good. Also a small section on relationships with Apache Storm and Apache Samza would be great. I would like to sign up, to help/contribute. debo On 1/20/16, 8:55 AM, "Sean Busbey" wrote: > Great proposal. I like that your proposal includes a well presented > roadmap, but I don't see any goals that directly address building a > larger > community. Y'all have any ideas around outreach that will help with > adoption? > > As a start, I recommend y'all add a section to the proposal on the wiki > page for "Additional Interested Contributors" so that folks who want to > sign up to participate in the project can do so without requesting > additions to the initial committer list. > > On Wed, Jan 20, 2016 at 10:32 AM, James Malone < > jamesmal...@google.com.invalid> wrote: > >> Hello everyone, >> >> Attached to this message is a proposed new project - Apache Dataflow, >> a >> unified programming model for data processing and integration. >> >> The text of the proposal is included below. Additionally, the proposal >> is >> in draft form on the wiki where we will make any required changes: >> >> https://wiki.apache.org/incubator/DataflowProposal >> >> We look forward to your feedback and input. >> >> Best, >> >> James >> >> >> >> = Apache Dataflow = >> >> == Abstract == >> >> Dataflow is an open source, unified model and set of language-specific >> SDKs >> for defining and executing data processing workflows, and also data >> ingestion and integration flows, supporting Enterprise Integration >> Patterns >> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines >> simplify >> the mechanics of large-scale batch and streaming data processing and >> can >> run on a number of runtimes like Apache Flink, Apache Spark, and >> Google >> Cloud Dataflow (a cloud service). Dataflow also brings DSL in >> different >> languages, allowing users to easily implement their data integration >> processes. >> >> == Proposal == >> >> Dataflow is a simple, flexible, and powerful system for distributed >> data >> processing at any scale. Dataflow provides a unified programming >> model, >> a >> software development kit to define and construct data processing >> pipelines, >> and runners to execute Dataflow pipelines in several runtime engines, >> like >> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be >> used >> for a variety of streaming or batch data processing goals including >> ETL, >> stream analysis, and aggregate computation. The underlying programming >> model for Dataflow provides MapReduce-like parallelism, combined with >> support for powerful data windowing, and fine-grained correctness >> control. >> >> == Background == >> >> Dataflow started as a set of Google projects focused on making data >> processing easier, faster, and less costly. The Dataflow model is a >> successor to MapReduce, FlumeJava, and Millwheel inside Google and is >> focused on providing a unified solution for batch and stream >> processing. >> These projects on which Dataflow is based have been published in >> several >> papers made available to the public: >> >> * MapReduce - http://research.google.com/archive/mapreduce.html >> >> * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf >> >> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf >> >> * MillWheel - http://research.google.com/pubs/pub41378.html >> >> Dataflow was designed from the start to provide a portable programming >> layer. When you define a data processing pipeline with the Dataflow >> model, >> you are creating a job which is capable of being processed by any >> number of >> Dataflow pro
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi James, Can I add your to the proposal ? Regards JB On 01/20/2016 07:20 PM, James Carman wrote: Well, I for one would be very interested in this project and would be happy to contribute. On Wed, Jan 20, 2016 at 12:09 PM Jean-Baptiste Onofré wrote: Hi Sean, It's a fair point, but not present in most of the proposals. It's something that we can address in the "Community" section. Regards JB On 01/20/2016 05:55 PM, Sean Busbey wrote: Great proposal. I like that your proposal includes a well presented roadmap, but I don't see any goals that directly address building a larger community. Y'all have any ideas around outreach that will help with adoption? As a start, I recommend y'all add a section to the proposal on the wiki page for "Additional Interested Contributors" so that folks who want to sign up to participate in the project can do so without requesting additions to the initial committer list. On Wed, Jan 20, 2016 at 10:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a “direct runner”, for execution on the developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service. Another Python SDK is currently in active development. In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an OSS project under the ASF. The runners which are a part of this proposal include those for Spark (from Cloudera), Flink (from data Artisans), and local development (from Google); the Google Cloud Dataflow service runner is not included in this proposal. Further references to Dataflow will refer to the Dataflow model, SDKs, and runners which are a part of this proposal (Apache Dataflow) only. The initial submission will contain the already-released Java SDK; Google intends to submit the Python SDK later in the incubation process. The Google Cloud Dataflow service will continue to be one of many runners for Dataflow, built on Google Cloud Platform, to run Dataflow pipel
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Well, I for one would be very interested in this project and would be happy to contribute. On Wed, Jan 20, 2016 at 12:09 PM Jean-Baptiste Onofré wrote: > Hi Sean, > > It's a fair point, but not present in most of the proposals. It's > something that we can address in the "Community" section. > > Regards > JB > > On 01/20/2016 05:55 PM, Sean Busbey wrote: > > Great proposal. I like that your proposal includes a well presented > > roadmap, but I don't see any goals that directly address building a > larger > > community. Y'all have any ideas around outreach that will help with > > adoption? > > > > As a start, I recommend y'all add a section to the proposal on the wiki > > page for "Additional Interested Contributors" so that folks who want to > > sign up to participate in the project can do so without requesting > > additions to the initial committer list. > > > > On Wed, Jan 20, 2016 at 10:32 AM, James Malone < > > jamesmal...@google.com.invalid> wrote: > > > >> Hello everyone, > >> > >> Attached to this message is a proposed new project - Apache Dataflow, a > >> unified programming model for data processing and integration. > >> > >> The text of the proposal is included below. Additionally, the proposal > is > >> in draft form on the wiki where we will make any required changes: > >> > >> https://wiki.apache.org/incubator/DataflowProposal > >> > >> We look forward to your feedback and input. > >> > >> Best, > >> > >> James > >> > >> > >> > >> = Apache Dataflow = > >> > >> == Abstract == > >> > >> Dataflow is an open source, unified model and set of language-specific > SDKs > >> for defining and executing data processing workflows, and also data > >> ingestion and integration flows, supporting Enterprise Integration > Patterns > >> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify > >> the mechanics of large-scale batch and streaming data processing and can > >> run on a number of runtimes like Apache Flink, Apache Spark, and Google > >> Cloud Dataflow (a cloud service). Dataflow also brings DSL in different > >> languages, allowing users to easily implement their data integration > >> processes. > >> > >> == Proposal == > >> > >> Dataflow is a simple, flexible, and powerful system for distributed data > >> processing at any scale. Dataflow provides a unified programming model, > a > >> software development kit to define and construct data processing > pipelines, > >> and runners to execute Dataflow pipelines in several runtime engines, > like > >> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be > used > >> for a variety of streaming or batch data processing goals including ETL, > >> stream analysis, and aggregate computation. The underlying programming > >> model for Dataflow provides MapReduce-like parallelism, combined with > >> support for powerful data windowing, and fine-grained correctness > control. > >> > >> == Background == > >> > >> Dataflow started as a set of Google projects focused on making data > >> processing easier, faster, and less costly. The Dataflow model is a > >> successor to MapReduce, FlumeJava, and Millwheel inside Google and is > >> focused on providing a unified solution for batch and stream processing. > >> These projects on which Dataflow is based have been published in several > >> papers made available to the public: > >> > >> * MapReduce - http://research.google.com/archive/mapreduce.html > >> > >> * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf > >> > >> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf > >> > >> * MillWheel - http://research.google.com/pubs/pub41378.html > >> > >> Dataflow was designed from the start to provide a portable programming > >> layer. When you define a data processing pipeline with the Dataflow > model, > >> you are creating a job which is capable of being processed by any > number of > >> Dataflow processing engines. Several engines have been developed to run > >> Dataflow pipelines in other open source runtimes, including a Dataflow > >> runner for Apache Flink and Apache Spark. There is also a “direct > runner”, > >> for execution on the developer machine (mainly for dev/debug purposes). > >> Another runner allows a Dataflow program to run on a managed service, > >> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK > is > >> already available on GitHub, and independent from the Google Cloud > Dataflow > >> service. Another Python SDK is currently in active development. > >> > >> In this proposal, the Dataflow SDKs, model, and a set of runners will be > >> submitted as an OSS project under the ASF. The runners which are a part > of > >> this proposal include those for Spark (from Cloudera), Flink (from data > >> Artisans), and local development (from Google); the Google Cloud > Dataflow > >> service runner is not included in this proposal. Further references to > >> Dataflow will refer to the Dataflow model, SDKs, and runners which are a
Re: [DISCUSS] Apache Dataflow Incubator Proposal
This is great proposal. Been working with Apache Flink for a while so love to help with this project. - Henry On Wed, Jan 20, 2016 at 8:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: > Hello everyone, > > Attached to this message is a proposed new project - Apache Dataflow, a > unified programming model for data processing and integration. > > The text of the proposal is included below. Additionally, the proposal is > in draft form on the wiki where we will make any required changes: > > https://wiki.apache.org/incubator/DataflowProposal > > We look forward to your feedback and input. > > Best, > > James > > > > = Apache Dataflow = > > == Abstract == > > Dataflow is an open source, unified model and set of language-specific SDKs > for defining and executing data processing workflows, and also data > ingestion and integration flows, supporting Enterprise Integration Patterns > (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify > the mechanics of large-scale batch and streaming data processing and can > run on a number of runtimes like Apache Flink, Apache Spark, and Google > Cloud Dataflow (a cloud service). Dataflow also brings DSL in different > languages, allowing users to easily implement their data integration > processes. > > == Proposal == > > Dataflow is a simple, flexible, and powerful system for distributed data > processing at any scale. Dataflow provides a unified programming model, a > software development kit to define and construct data processing pipelines, > and runners to execute Dataflow pipelines in several runtime engines, like > Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used > for a variety of streaming or batch data processing goals including ETL, > stream analysis, and aggregate computation. The underlying programming > model for Dataflow provides MapReduce-like parallelism, combined with > support for powerful data windowing, and fine-grained correctness control. > > == Background == > > Dataflow started as a set of Google projects focused on making data > processing easier, faster, and less costly. The Dataflow model is a > successor to MapReduce, FlumeJava, and Millwheel inside Google and is > focused on providing a unified solution for batch and stream processing. > These projects on which Dataflow is based have been published in several > papers made available to the public: > > * MapReduce - http://research.google.com/archive/mapreduce.html > > * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf > > * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf > > * MillWheel - http://research.google.com/pubs/pub41378.html > > Dataflow was designed from the start to provide a portable programming > layer. When you define a data processing pipeline with the Dataflow model, > you are creating a job which is capable of being processed by any number of > Dataflow processing engines. Several engines have been developed to run > Dataflow pipelines in other open source runtimes, including a Dataflow > runner for Apache Flink and Apache Spark. There is also a “direct runner”, > for execution on the developer machine (mainly for dev/debug purposes). > Another runner allows a Dataflow program to run on a managed service, > Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is > already available on GitHub, and independent from the Google Cloud Dataflow > service. Another Python SDK is currently in active development. > > In this proposal, the Dataflow SDKs, model, and a set of runners will be > submitted as an OSS project under the ASF. The runners which are a part of > this proposal include those for Spark (from Cloudera), Flink (from data > Artisans), and local development (from Google); the Google Cloud Dataflow > service runner is not included in this proposal. Further references to > Dataflow will refer to the Dataflow model, SDKs, and runners which are a > part of this proposal (Apache Dataflow) only. The initial submission will > contain the already-released Java SDK; Google intends to submit the Python > SDK later in the incubation process. The Google Cloud Dataflow service will > continue to be one of many runners for Dataflow, built on Google Cloud > Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will > develop against the Apache project additions, updates, and changes. Google > Cloud Dataflow will become one user of Apache Dataflow and will participate > in the project openly and publicly. > > The Dataflow programming model has been designed with simplicity, > scalability, and speed as key tenants. In the Dataflow model, you only need > to think about four top-level concepts when constructing your data > processing job: > > * Pipelines - The data processing job made of a series of computations > including input, processing, and output > > * PCollections - Bounded (or unbounded) datasets which represent the input, > intermediate and output data in pipeli
Re: [DISCUSS] Apache Dataflow Incubator Proposal
As suggested, I added "Additional Interested Contributors" section, and already added Debo. Thanks ! Regards JB On 01/20/2016 05:55 PM, Sean Busbey wrote: Great proposal. I like that your proposal includes a well presented roadmap, but I don't see any goals that directly address building a larger community. Y'all have any ideas around outreach that will help with adoption? As a start, I recommend y'all add a section to the proposal on the wiki page for "Additional Interested Contributors" so that folks who want to sign up to participate in the project can do so without requesting additions to the initial committer list. On Wed, Jan 20, 2016 at 10:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a “direct runner”, for execution on the developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service. Another Python SDK is currently in active development. In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an OSS project under the ASF. The runners which are a part of this proposal include those for Spark (from Cloudera), Flink (from data Artisans), and local development (from Google); the Google Cloud Dataflow service runner is not included in this proposal. Further references to Dataflow will refer to the Dataflow model, SDKs, and runners which are a part of this proposal (Apache Dataflow) only. The initial submission will contain the already-released Java SDK; Google intends to submit the Python SDK later in the incubation process. The Google Cloud Dataflow service will continue to be one of many runners for Dataflow, built on Google Cloud Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will develop against the Apache project additions, updates, and changes. Google Cloud Dataflow will become one user of Apache Dataflow and will participate in the project openly and publicly. The Dataflow programming model has been designed with simplicity, scalability, and speed as key tenant
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Yes, you are right, we also know that other companies use dataflow wording (it's the case at Hortonworks for instance). We gonna start a thread to propose alternative names. Regards JB On 01/20/2016 06:41 PM, Gregory Chase wrote: This is also a similarly named Spring project: http://cloud.spring.io/spring-cloud-dataflow/ On Wed, Jan 20, 2016 at 9:40 AM, Marvin Humphrey wrote: On Wed, Jan 20, 2016 at 9:17 AM, Jean-Baptiste Onofré wrote: We're proposing Apache Dataflow naming because Google Cloud Dataflow is an already known name and "brand". I don't see anything in the proposal about Google ceasing the use of the brand "Google Cloud Dataflow". Yet the co-existence of "Google Cloud Dataflow" and "Apache Dataflow" would conflict with Apache requirements for vendor neutrality and project independence. The issue seems similar to the recent proposal to incubate "Apache OpenMiracl" while allowing the "Miracl" company to continue distribution of the "Miracl" project. That situation was was resolved by renaming the Apache project to "Milagro", allowing the Miracl company to continue benefitting from the brand they had invested in so heavily. http://markmail.org/message/tpiphl55rcyezcvd Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Great, I add you in the initial committer list then ;) I quickly discussed with James, we gonna create a section for additional people as proposed by Sean. Thanks ! Regards JB On 01/20/2016 06:33 PM, Debo Dutta (dedutta) wrote: Hi JB Would love to join now. regards debo On 1/20/16, 9:31 AM, "Jean-Baptiste Onofré" wrote: Hi Debo, Awesome: do you want to join now (in the initial committer list) and once we are in the incubation ? Let me know, I can update the proposal. Regards JB On 01/20/2016 06:23 PM, Debo Dutta (dedutta) wrote: +1 Proposal looks good. Also a small section on relationships with Apache Storm and Apache Samza would be great. I would like to sign up, to help/contribute. debo On 1/20/16, 8:55 AM, "Sean Busbey" wrote: Great proposal. I like that your proposal includes a well presented roadmap, but I don't see any goals that directly address building a larger community. Y'all have any ideas around outreach that will help with adoption? As a start, I recommend y'all add a section to the proposal on the wiki page for "Additional Interested Contributors" so that folks who want to sign up to participate in the project can do so without requesting additions to the initial committer list. On Wed, Jan 20, 2016 at 10:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a ³direct runner², for execution on the developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service. Another Python SDK is currently in active development. In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an OSS project under the ASF. The runners which are a part of this proposal include those for Spark (from Cloudera), Flink (from data Artisans), and local development (from Google); the Google Cloud Dataflow service runner is not included in this proposal. Further references to Dataflow will refer to the Dataflow model, SDKs, and runners which are a part of this proposal (Apache Dataflow) only. The i
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Good point Marvin, Google Dataflow SDK will "disappear" for Apache Dataflow, but you are right, the complete Dataflow "brand" will stay at Google. Let me double check with James and the team. Regards JB On 01/20/2016 06:40 PM, Marvin Humphrey wrote: On Wed, Jan 20, 2016 at 9:17 AM, Jean-Baptiste Onofré wrote: We're proposing Apache Dataflow naming because Google Cloud Dataflow is an already known name and "brand". I don't see anything in the proposal about Google ceasing the use of the brand "Google Cloud Dataflow". Yet the co-existence of "Google Cloud Dataflow" and "Apache Dataflow" would conflict with Apache requirements for vendor neutrality and project independence. The issue seems similar to the recent proposal to incubate "Apache OpenMiracl" while allowing the "Miracl" company to continue distribution of the "Miracl" project. That situation was was resolved by renaming the Apache project to "Milagro", allowing the Miracl company to continue benefitting from the brand they had invested in so heavily. http://markmail.org/message/tpiphl55rcyezcvd Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
This is also a similarly named Spring project: http://cloud.spring.io/spring-cloud-dataflow/ On Wed, Jan 20, 2016 at 9:40 AM, Marvin Humphrey wrote: > On Wed, Jan 20, 2016 at 9:17 AM, Jean-Baptiste Onofré > wrote: > > > We're proposing Apache Dataflow naming because Google Cloud Dataflow is > an > > already known name and "brand". > > I don't see anything in the proposal about Google ceasing the use of the > brand > "Google Cloud Dataflow". Yet the co-existence of "Google Cloud Dataflow" > and > "Apache Dataflow" would conflict with Apache requirements for vendor > neutrality and project independence. > > The issue seems similar to the recent proposal to incubate "Apache > OpenMiracl" > while allowing the "Miracl" company to continue distribution of the > "Miracl" > project. That situation was was resolved by renaming the Apache project to > "Milagro", allowing the Miracl company to continue benefitting from the > brand > they had invested in so heavily. > > http://markmail.org/message/tpiphl55rcyezcvd > > Marvin Humphrey > > - > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > > -- Greg Chase Director of Big Data Communities http://www.pivotal.io/big-data Pivotal Software http://www.pivotal.io/ 650-215-0477 @GregChase Blog: http://geekmarketing.biz/
Re: [DISCUSS] Apache Dataflow Incubator Proposal
On Wed, Jan 20, 2016 at 9:17 AM, Jean-Baptiste Onofré wrote: > We're proposing Apache Dataflow naming because Google Cloud Dataflow is an > already known name and "brand". I don't see anything in the proposal about Google ceasing the use of the brand "Google Cloud Dataflow". Yet the co-existence of "Google Cloud Dataflow" and "Apache Dataflow" would conflict with Apache requirements for vendor neutrality and project independence. The issue seems similar to the recent proposal to incubate "Apache OpenMiracl" while allowing the "Miracl" company to continue distribution of the "Miracl" project. That situation was was resolved by renaming the Apache project to "Milagro", allowing the Miracl company to continue benefitting from the brand they had invested in so heavily. http://markmail.org/message/tpiphl55rcyezcvd Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
> Great proposal. I like that your proposal includes a well presented > roadmap, but I don't see any goals that directly address building a larger > community. Y'all have any ideas around outreach that will help with > adoption? > Thank you and fair point. We have a few additional ideas which we can put into the Community section. > > As a start, I recommend y'all add a section to the proposal on the wiki > page for "Additional Interested Contributors" so that folks who want to > sign up to participate in the project can do so without requesting > additions to the initial committer list. > > This is a great idea and I think it makes a lot of sense to add an "Additional Interested Contributors" section to the proposal. > On Wed, Jan 20, 2016 at 10:32 AM, James Malone < > jamesmal...@google.com.invalid> wrote: > > > Hello everyone, > > > > Attached to this message is a proposed new project - Apache Dataflow, a > > unified programming model for data processing and integration. > > > > The text of the proposal is included below. Additionally, the proposal is > > in draft form on the wiki where we will make any required changes: > > > > https://wiki.apache.org/incubator/DataflowProposal > > > > We look forward to your feedback and input. > > > > Best, > > > > James > > > > > > > > = Apache Dataflow = > > > > == Abstract == > > > > Dataflow is an open source, unified model and set of language-specific > SDKs > > for defining and executing data processing workflows, and also data > > ingestion and integration flows, supporting Enterprise Integration > Patterns > > (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify > > the mechanics of large-scale batch and streaming data processing and can > > run on a number of runtimes like Apache Flink, Apache Spark, and Google > > Cloud Dataflow (a cloud service). Dataflow also brings DSL in different > > languages, allowing users to easily implement their data integration > > processes. > > > > == Proposal == > > > > Dataflow is a simple, flexible, and powerful system for distributed data > > processing at any scale. Dataflow provides a unified programming model, a > > software development kit to define and construct data processing > pipelines, > > and runners to execute Dataflow pipelines in several runtime engines, > like > > Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be > used > > for a variety of streaming or batch data processing goals including ETL, > > stream analysis, and aggregate computation. The underlying programming > > model for Dataflow provides MapReduce-like parallelism, combined with > > support for powerful data windowing, and fine-grained correctness > control. > > > > == Background == > > > > Dataflow started as a set of Google projects focused on making data > > processing easier, faster, and less costly. The Dataflow model is a > > successor to MapReduce, FlumeJava, and Millwheel inside Google and is > > focused on providing a unified solution for batch and stream processing. > > These projects on which Dataflow is based have been published in several > > papers made available to the public: > > > > * MapReduce - http://research.google.com/archive/mapreduce.html > > > > * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf > > > > * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf > > > > * MillWheel - http://research.google.com/pubs/pub41378.html > > > > Dataflow was designed from the start to provide a portable programming > > layer. When you define a data processing pipeline with the Dataflow > model, > > you are creating a job which is capable of being processed by any number > of > > Dataflow processing engines. Several engines have been developed to run > > Dataflow pipelines in other open source runtimes, including a Dataflow > > runner for Apache Flink and Apache Spark. There is also a “direct > runner”, > > for execution on the developer machine (mainly for dev/debug purposes). > > Another runner allows a Dataflow program to run on a managed service, > > Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is > > already available on GitHub, and independent from the Google Cloud > Dataflow > > service. Another Python SDK is currently in active development. > > > > In this proposal, the Dataflow SDKs, model, and a set of runners will be > > submitted as an OSS project under the ASF. The runners which are a part > of > > this proposal include those for Spark (from Cloudera), Flink (from data > > Artisans), and local development (from Google); the Google Cloud Dataflow > > service runner is not included in this proposal. Further references to > > Dataflow will refer to the Dataflow model, SDKs, and runners which are a > > part of this proposal (Apache Dataflow) only. The initial submission will > > contain the already-released Java SDK; Google intends to submit the > Python > > SDK later in the incubation process. The Google Cloud Dataflow service >
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi JB Would love to join now. regards debo On 1/20/16, 9:31 AM, "Jean-Baptiste Onofré" wrote: >Hi Debo, > >Awesome: do you want to join now (in the initial committer list) and >once we are in the incubation ? > >Let me know, I can update the proposal. > >Regards >JB > >On 01/20/2016 06:23 PM, Debo Dutta (dedutta) wrote: >> +1 >> >> Proposal looks good. Also a small section on relationships with Apache >> Storm and Apache Samza would be great. >> >> I would like to sign up, to help/contribute. >> >> debo >> >> On 1/20/16, 8:55 AM, "Sean Busbey" wrote: >> >>> Great proposal. I like that your proposal includes a well presented >>> roadmap, but I don't see any goals that directly address building a >>>larger >>> community. Y'all have any ideas around outreach that will help with >>> adoption? >>> >>> As a start, I recommend y'all add a section to the proposal on the wiki >>> page for "Additional Interested Contributors" so that folks who want to >>> sign up to participate in the project can do so without requesting >>> additions to the initial committer list. >>> >>> On Wed, Jan 20, 2016 at 10:32 AM, James Malone < >>> jamesmal...@google.com.invalid> wrote: >>> Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a ³direct runner², for execution on the developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service. Another Python SDK is currently in active development. In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an OSS project under the ASF. The runners which are a part of thi
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi Debo, Awesome: do you want to join now (in the initial committer list) and once we are in the incubation ? Let me know, I can update the proposal. Regards JB On 01/20/2016 06:23 PM, Debo Dutta (dedutta) wrote: +1 Proposal looks good. Also a small section on relationships with Apache Storm and Apache Samza would be great. I would like to sign up, to help/contribute. debo On 1/20/16, 8:55 AM, "Sean Busbey" wrote: Great proposal. I like that your proposal includes a well presented roadmap, but I don't see any goals that directly address building a larger community. Y'all have any ideas around outreach that will help with adoption? As a start, I recommend y'all add a section to the proposal on the wiki page for "Additional Interested Contributors" so that folks who want to sign up to participate in the project can do so without requesting additions to the initial committer list. On Wed, Jan 20, 2016 at 10:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a ³direct runner², for execution on the developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service. Another Python SDK is currently in active development. In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an OSS project under the ASF. The runners which are a part of this proposal include those for Spark (from Cloudera), Flink (from data Artisans), and local development (from Google); the Google Cloud Dataflow service runner is not included in this proposal. Further references to Dataflow will refer to the Dataflow model, SDKs, and runners which are a part of this proposal (Apache Dataflow) only. The initial submission will contain the already-released Java SDK; Google intends to submit the Python SDK later in the incubation process. The Google Cloud Dataflow service will continue to be one of many runners for Dataflow, built on Google Cloud Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will develop against t
Re: [DISCUSS] Apache Dataflow Incubator Proposal
+1 Proposal looks good. Also a small section on relationships with Apache Storm and Apache Samza would be great. I would like to sign up, to help/contribute. debo On 1/20/16, 8:55 AM, "Sean Busbey" wrote: >Great proposal. I like that your proposal includes a well presented >roadmap, but I don't see any goals that directly address building a larger >community. Y'all have any ideas around outreach that will help with >adoption? > >As a start, I recommend y'all add a section to the proposal on the wiki >page for "Additional Interested Contributors" so that folks who want to >sign up to participate in the project can do so without requesting >additions to the initial committer list. > >On Wed, Jan 20, 2016 at 10:32 AM, James Malone < >jamesmal...@google.com.invalid> wrote: > >> Hello everyone, >> >> Attached to this message is a proposed new project - Apache Dataflow, a >> unified programming model for data processing and integration. >> >> The text of the proposal is included below. Additionally, the proposal >>is >> in draft form on the wiki where we will make any required changes: >> >> https://wiki.apache.org/incubator/DataflowProposal >> >> We look forward to your feedback and input. >> >> Best, >> >> James >> >> >> >> = Apache Dataflow = >> >> == Abstract == >> >> Dataflow is an open source, unified model and set of language-specific >>SDKs >> for defining and executing data processing workflows, and also data >> ingestion and integration flows, supporting Enterprise Integration >>Patterns >> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify >> the mechanics of large-scale batch and streaming data processing and can >> run on a number of runtimes like Apache Flink, Apache Spark, and Google >> Cloud Dataflow (a cloud service). Dataflow also brings DSL in different >> languages, allowing users to easily implement their data integration >> processes. >> >> == Proposal == >> >> Dataflow is a simple, flexible, and powerful system for distributed data >> processing at any scale. Dataflow provides a unified programming model, >>a >> software development kit to define and construct data processing >>pipelines, >> and runners to execute Dataflow pipelines in several runtime engines, >>like >> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be >>used >> for a variety of streaming or batch data processing goals including ETL, >> stream analysis, and aggregate computation. The underlying programming >> model for Dataflow provides MapReduce-like parallelism, combined with >> support for powerful data windowing, and fine-grained correctness >>control. >> >> == Background == >> >> Dataflow started as a set of Google projects focused on making data >> processing easier, faster, and less costly. The Dataflow model is a >> successor to MapReduce, FlumeJava, and Millwheel inside Google and is >> focused on providing a unified solution for batch and stream processing. >> These projects on which Dataflow is based have been published in several >> papers made available to the public: >> >> * MapReduce - http://research.google.com/archive/mapreduce.html >> >> * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf >> >> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf >> >> * MillWheel - http://research.google.com/pubs/pub41378.html >> >> Dataflow was designed from the start to provide a portable programming >> layer. When you define a data processing pipeline with the Dataflow >>model, >> you are creating a job which is capable of being processed by any >>number of >> Dataflow processing engines. Several engines have been developed to run >> Dataflow pipelines in other open source runtimes, including a Dataflow >> runner for Apache Flink and Apache Spark. There is also a ³direct >>runner², >> for execution on the developer machine (mainly for dev/debug purposes). >> Another runner allows a Dataflow program to run on a managed service, >> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK >>is >> already available on GitHub, and independent from the Google Cloud >>Dataflow >> service. Another Python SDK is currently in active development. >> >> In this proposal, the Dataflow SDKs, model, and a set of runners will be >> submitted as an OSS project under the ASF. The runners which are a part >>of >> this proposal include those for Spark (from Cloudera), Flink (from data >> Artisans), and local development (from Google); the Google Cloud >>Dataflow >> service runner is not included in this proposal. Further references to >> Dataflow will refer to the Dataflow model, SDKs, and runners which are a >> part of this proposal (Apache Dataflow) only. The initial submission >>will >> contain the already-released Java SDK; Google intends to submit the >>Python >> SDK later in the incubation process. The Google Cloud Dataflow service >>will >> continue to be one of many runners for Dataflow, built on Google Cloud >> Platform, to run Dataflow pip
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi Marvin, you raise a point that we have a bit anticipated ;) We're proposing Apache Dataflow naming because Google Cloud Dataflow is an already known name and "brand". The naming is not directly related to dataflow programming: it's more representative of the data flowing inside a pipeline. Regards JB On 01/20/2016 06:12 PM, Marvin Humphrey wrote: On Wed, Jan 20, 2016 at 8:32 AM, James Malone wrote: == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. In general this seems like an excellent project and a well-thought-through and viable proposal -- I certainly anticipate that it will be accepted for incubation in one form or another. However, how does this "Dataflow" project relate to the programming paradigm of "dataflow programming"? https://en.wikipedia.org/wiki/Dataflow_programming Besides the potential for confusion, it seems like the proposed project name would be tough to defend as a trademark. With respect to trademark rights, Google does not hold a trademark on the phrase “Dataflow.” Based on feedback and guidance we receive during the incubation process, we are open to renaming the project if necessary for trademark or other concerns. If a renaming is going to happen, there are advantages to renaming sooner rather than later and sparing the community additional disruption. Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
On Wed, Jan 20, 2016 at 8:32 AM, James Malone wrote: > == Abstract == > > Dataflow is an open source, unified model and set of language-specific SDKs > for defining and executing data processing workflows, and also data > ingestion and integration flows, supporting Enterprise Integration Patterns > (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify > the mechanics of large-scale batch and streaming data processing and can > run on a number of runtimes like Apache Flink, Apache Spark, and Google > Cloud Dataflow (a cloud service). Dataflow also brings DSL in different > languages, allowing users to easily implement their data integration > processes. In general this seems like an excellent project and a well-thought-through and viable proposal -- I certainly anticipate that it will be accepted for incubation in one form or another. However, how does this "Dataflow" project relate to the programming paradigm of "dataflow programming"? https://en.wikipedia.org/wiki/Dataflow_programming Besides the potential for confusion, it seems like the proposed project name would be tough to defend as a trademark. > With respect to trademark rights, Google does not hold a trademark on the > phrase “Dataflow.” Based on feedback and guidance we receive during the > incubation process, we are open to renaming the project if necessary for > trademark or other concerns. If a renaming is going to happen, there are advantages to renaming sooner rather than later and sparing the community additional disruption. Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi Sean, It's a fair point, but not present in most of the proposals. It's something that we can address in the "Community" section. Regards JB On 01/20/2016 05:55 PM, Sean Busbey wrote: Great proposal. I like that your proposal includes a well presented roadmap, but I don't see any goals that directly address building a larger community. Y'all have any ideas around outreach that will help with adoption? As a start, I recommend y'all add a section to the proposal on the wiki page for "Additional Interested Contributors" so that folks who want to sign up to participate in the project can do so without requesting additions to the initial committer list. On Wed, Jan 20, 2016 at 10:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a “direct runner”, for execution on the developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service. Another Python SDK is currently in active development. In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an OSS project under the ASF. The runners which are a part of this proposal include those for Spark (from Cloudera), Flink (from data Artisans), and local development (from Google); the Google Cloud Dataflow service runner is not included in this proposal. Further references to Dataflow will refer to the Dataflow model, SDKs, and runners which are a part of this proposal (Apache Dataflow) only. The initial submission will contain the already-released Java SDK; Google intends to submit the Python SDK later in the incubation process. The Google Cloud Dataflow service will continue to be one of many runners for Dataflow, built on Google Cloud Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will develop against the Apache project additions, updates, and changes. Google Cloud Dataflow will become one user of Apache Dataflow and will participate in the project openly and publicly. The Dataflow programming model has been designed with simplicity, sc
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Great proposal. I like that your proposal includes a well presented roadmap, but I don't see any goals that directly address building a larger community. Y'all have any ideas around outreach that will help with adoption? As a start, I recommend y'all add a section to the proposal on the wiki page for "Additional Interested Contributors" so that folks who want to sign up to participate in the project can do so without requesting additions to the initial committer list. On Wed, Jan 20, 2016 at 10:32 AM, James Malone < jamesmal...@google.com.invalid> wrote: > Hello everyone, > > Attached to this message is a proposed new project - Apache Dataflow, a > unified programming model for data processing and integration. > > The text of the proposal is included below. Additionally, the proposal is > in draft form on the wiki where we will make any required changes: > > https://wiki.apache.org/incubator/DataflowProposal > > We look forward to your feedback and input. > > Best, > > James > > > > = Apache Dataflow = > > == Abstract == > > Dataflow is an open source, unified model and set of language-specific SDKs > for defining and executing data processing workflows, and also data > ingestion and integration flows, supporting Enterprise Integration Patterns > (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify > the mechanics of large-scale batch and streaming data processing and can > run on a number of runtimes like Apache Flink, Apache Spark, and Google > Cloud Dataflow (a cloud service). Dataflow also brings DSL in different > languages, allowing users to easily implement their data integration > processes. > > == Proposal == > > Dataflow is a simple, flexible, and powerful system for distributed data > processing at any scale. Dataflow provides a unified programming model, a > software development kit to define and construct data processing pipelines, > and runners to execute Dataflow pipelines in several runtime engines, like > Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used > for a variety of streaming or batch data processing goals including ETL, > stream analysis, and aggregate computation. The underlying programming > model for Dataflow provides MapReduce-like parallelism, combined with > support for powerful data windowing, and fine-grained correctness control. > > == Background == > > Dataflow started as a set of Google projects focused on making data > processing easier, faster, and less costly. The Dataflow model is a > successor to MapReduce, FlumeJava, and Millwheel inside Google and is > focused on providing a unified solution for batch and stream processing. > These projects on which Dataflow is based have been published in several > papers made available to the public: > > * MapReduce - http://research.google.com/archive/mapreduce.html > > * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf > > * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf > > * MillWheel - http://research.google.com/pubs/pub41378.html > > Dataflow was designed from the start to provide a portable programming > layer. When you define a data processing pipeline with the Dataflow model, > you are creating a job which is capable of being processed by any number of > Dataflow processing engines. Several engines have been developed to run > Dataflow pipelines in other open source runtimes, including a Dataflow > runner for Apache Flink and Apache Spark. There is also a “direct runner”, > for execution on the developer machine (mainly for dev/debug purposes). > Another runner allows a Dataflow program to run on a managed service, > Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is > already available on GitHub, and independent from the Google Cloud Dataflow > service. Another Python SDK is currently in active development. > > In this proposal, the Dataflow SDKs, model, and a set of runners will be > submitted as an OSS project under the ASF. The runners which are a part of > this proposal include those for Spark (from Cloudera), Flink (from data > Artisans), and local development (from Google); the Google Cloud Dataflow > service runner is not included in this proposal. Further references to > Dataflow will refer to the Dataflow model, SDKs, and runners which are a > part of this proposal (Apache Dataflow) only. The initial submission will > contain the already-released Java SDK; Google intends to submit the Python > SDK later in the incubation process. The Google Cloud Dataflow service will > continue to be one of many runners for Dataflow, built on Google Cloud > Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will > develop against the Apache project additions, updates, and changes. Google > Cloud Dataflow will become one user of Apache Dataflow and will participate > in the project openly and publicly. > > The Dataflow programming model has been designed with simplicity, > scalability, and speed as key tenants.
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi, This is a cool idea. Its like Apache TinkerPop (http://tinkerpop.incubator.apache.org/) but for data flow/stream systems as opposed to graph systems. Our tag line is "the JDBC for graphs." You would be "the JDBC for streams." :) You might be interested in looking TinkerPop's Gremlin language as its a data flow language. http://tinkerpop.apache.org/docs/3.1.0-incubating/#traversal Moreover, its a virtual machine that allows other languages to compile to it. http://www.datastax.com/dev/blog/the-benefits-of-the-gremlin-graph-traversal-machine https://github.com/dkuppitz/sparql-gremlin https://github.com/twilmes/sql-gremlin Do you have a URL to any documentation on Apache Dataflow's DSL? Perhaps there are ideas we can steal! Take care, Marko. http://markorodriguez.com On Jan 20, 2016, at 9:32 AM, James Malone wrote: > Hello everyone, > > Attached to this message is a proposed new project - Apache Dataflow, a > unified programming model for data processing and integration. > > The text of the proposal is included below. Additionally, the proposal is > in draft form on the wiki where we will make any required changes: > > https://wiki.apache.org/incubator/DataflowProposal > > We look forward to your feedback and input. > > Best, > > James > > > > = Apache Dataflow = > > == Abstract == > > Dataflow is an open source, unified model and set of language-specific SDKs > for defining and executing data processing workflows, and also data > ingestion and integration flows, supporting Enterprise Integration Patterns > (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify > the mechanics of large-scale batch and streaming data processing and can > run on a number of runtimes like Apache Flink, Apache Spark, and Google > Cloud Dataflow (a cloud service). Dataflow also brings DSL in different > languages, allowing users to easily implement their data integration > processes. > > == Proposal == > > Dataflow is a simple, flexible, and powerful system for distributed data > processing at any scale. Dataflow provides a unified programming model, a > software development kit to define and construct data processing pipelines, > and runners to execute Dataflow pipelines in several runtime engines, like > Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used > for a variety of streaming or batch data processing goals including ETL, > stream analysis, and aggregate computation. The underlying programming > model for Dataflow provides MapReduce-like parallelism, combined with > support for powerful data windowing, and fine-grained correctness control. > > == Background == > > Dataflow started as a set of Google projects focused on making data > processing easier, faster, and less costly. The Dataflow model is a > successor to MapReduce, FlumeJava, and Millwheel inside Google and is > focused on providing a unified solution for batch and stream processing. > These projects on which Dataflow is based have been published in several > papers made available to the public: > > * MapReduce - http://research.google.com/archive/mapreduce.html > > * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf > > * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf > > * MillWheel - http://research.google.com/pubs/pub41378.html > > Dataflow was designed from the start to provide a portable programming > layer. When you define a data processing pipeline with the Dataflow model, > you are creating a job which is capable of being processed by any number of > Dataflow processing engines. Several engines have been developed to run > Dataflow pipelines in other open source runtimes, including a Dataflow > runner for Apache Flink and Apache Spark. There is also a “direct runner”, > for execution on the developer machine (mainly for dev/debug purposes). > Another runner allows a Dataflow program to run on a managed service, > Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is > already available on GitHub, and independent from the Google Cloud Dataflow > service. Another Python SDK is currently in active development. > > In this proposal, the Dataflow SDKs, model, and a set of runners will be > submitted as an OSS project under the ASF. The runners which are a part of > this proposal include those for Spark (from Cloudera), Flink (from data > Artisans), and local development (from Google); the Google Cloud Dataflow > service runner is not included in this proposal. Further references to > Dataflow will refer to the Dataflow model, SDKs, and runners which are a > part of this proposal (Apache Dataflow) only. The initial submission will > contain the already-released Java SDK; Google intends to submit the Python > SDK later in the incubation process. The Google Cloud Dataflow service will > continue to be one of many runners for Dataflow, built on Google Cloud > Platform, to run Dataflo
Re: [DISCUSS] Apache Dataflow Incubator Proposal
Hi all, I second James there, and really excited to be champion on the project (and work on the codebase as well). I blogged about a quick dataflow technical introduction: http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/ Thanks James ! We are looking forward your feedbacks. Regards JB On 01/20/2016 05:32 PM, James Malone wrote: Hello everyone, Attached to this message is a proposed new project - Apache Dataflow, a unified programming model for data processing and integration. The text of the proposal is included below. Additionally, the proposal is in draft form on the wiki where we will make any required changes: https://wiki.apache.org/incubator/DataflowProposal We look forward to your feedback and input. Best, James = Apache Dataflow = == Abstract == Dataflow is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Dataflow also brings DSL in different languages, allowing users to easily implement their data integration processes. == Proposal == Dataflow is a simple, flexible, and powerful system for distributed data processing at any scale. Dataflow provides a unified programming model, a software development kit to define and construct data processing pipelines, and runners to execute Dataflow pipelines in several runtime engines, like Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation. The underlying programming model for Dataflow provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control. == Background == Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing. These projects on which Dataflow is based have been published in several papers made available to the public: * MapReduce - http://research.google.com/archive/mapreduce.html * Dataflow model - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf * MillWheel - http://research.google.com/pubs/pub41378.html Dataflow was designed from the start to provide a portable programming layer. When you define a data processing pipeline with the Dataflow model, you are creating a job which is capable of being processed by any number of Dataflow processing engines. Several engines have been developed to run Dataflow pipelines in other open source runtimes, including a Dataflow runner for Apache Flink and Apache Spark. There is also a “direct runner”, for execution on the developer machine (mainly for dev/debug purposes). Another runner allows a Dataflow program to run on a managed service, Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java SDK is already available on GitHub, and independent from the Google Cloud Dataflow service. Another Python SDK is currently in active development. In this proposal, the Dataflow SDKs, model, and a set of runners will be submitted as an OSS project under the ASF. The runners which are a part of this proposal include those for Spark (from Cloudera), Flink (from data Artisans), and local development (from Google); the Google Cloud Dataflow service runner is not included in this proposal. Further references to Dataflow will refer to the Dataflow model, SDKs, and runners which are a part of this proposal (Apache Dataflow) only. The initial submission will contain the already-released Java SDK; Google intends to submit the Python SDK later in the incubation process. The Google Cloud Dataflow service will continue to be one of many runners for Dataflow, built on Google Cloud Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow will develop against the Apache project additions, updates, and changes. Google Cloud Dataflow will become one user of Apache Dataflow and will participate in the project openly and publicly. The Dataflow programming model has been designed with simplicity, scalability, and speed as key tenants. In the Dataflow model, you only need to think about four top-level concepts when constructing your data processing job: * Pipelines - The data processing job made of a series of computations including input, processing, and output * PCollections - Bounded (or unbounded) datasets which represent the input, intermediate and output data in pipelines * PTransforms -