Crunch started as a clone of FlumeJava, which was Google internal. In the meantime inside Google, FlumeJava evolved into Dataflow. So all three share a number of concepts like PCollections, ParDo, DoFn, etc. However, Dataflow adds a number of new things -- the biggest being a unified batch/streaming semantics using concepts like Windowing and Triggers. Tyler Akidau's OReilly post has a really nice explanation: https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
On Fri, Jan 22, 2016 at 10:42 AM, Ashish <paliwalash...@gmail.com> wrote: > Crunch has Spark pipelines, but not sure about the runner abstraction. > > May be Josh Wills or Tom White can provide more insight on this topic. > They are core devs for both projects :) > > On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > > Hi, > > > > I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce > pipeline, it > > doesn't provide runner abstraction. It's based on FlumeJava. > > > > The logic is very similar (with DoFns, pipelines, ...). Correct me if I'm > > wrong, but Crunch started after Google Dataflow, especially because > Dataflow > > was not opensourced at that time. > > > > So, I agree it's very similar/close. > > > > Regards > > JB > > > > > > On 01/22/2016 05:51 PM, Ashish wrote: > >> > >> Hi JB, > >> > >> Curious to know about how it compares to Apache Crunch? Constructs > >> looks very familiar (had used Crunch long ago) > >> > >> Thoughts? > >> > >> - Ashish > >> > >> On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré <j...@nanthrax.net> > >> wrote: > >>> > >>> Hi Seshu, > >>> > >>> I blogged about Apache Dataflow proposal: > >>> http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/ > >>> > >>> You can see in the "what's next ?" section that new runners, skins and > >>> sources are on our roadmap. Definitely, a storm runner could be part of > >>> this. > >>> > >>> Regards > >>> JB > >>> > >>> > >>> On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote: > >>>> > >>>> > >>>> Awesome to see CloudDataFlow coming to Apache. The Stream Processing > >>>> area > >>>> has been in general fragmented with a variety of solutions, hoping the > >>>> community galvanizes around Apache Data Flow. > >>>> > >>>> We are still in the "Apache Storm" world, Any chance for folks > building > >>>> a > >>>> "Storm Runner²? > >>>> > >>>> > >>>> On 1/20/16, 9:39 AM, "James Malone" <jamesmal...@google.com.INVALID> > >>>> wrote: > >>>> > >>>>>> Great proposal. I like that your proposal includes a well presented > >>>>>> roadmap, but I don't see any goals that directly address building a > >>>>>> larger > >>>>>> community. Y'all have any ideas around outreach that will help with > >>>>>> adoption? > >>>>>> > >>>>> > >>>>> Thank you and fair point. We have a few additional ideas which we can > >>>>> put > >>>>> into the Community section. > >>>>> > >>>>> > >>>>>> > >>>>>> As a start, I recommend y'all add a section to the proposal on the > >>>>>> wiki > >>>>>> page for "Additional Interested Contributors" so that folks who want > >>>>>> to > >>>>>> sign up to participate in the project can do so without requesting > >>>>>> additions to the initial committer list. > >>>>>> > >>>>>> > >>>>> This is a great idea and I think it makes a lot of sense to add an > >>>>> "Additional > >>>>> Interested Contributors" section to the proposal. > >>>>> > >>>>> > >>>>>> On Wed, Jan 20, 2016 at 10:32 AM, James Malone < > >>>>>> jamesmal...@google.com.invalid> wrote: > >>>>>> > >>>>>>> Hello everyone, > >>>>>>> > >>>>>>> Attached to this message is a proposed new project - Apache > Dataflow, > >>>>>> > >>>>>> > >>>>>> a > >>>>>>> > >>>>>>> > >>>>>>> unified programming model for data processing and integration. > >>>>>>> > >>>>>>> The text of the proposal is included below. Additionally, the > >>>>>> > >>>>>> > >>>>>> proposal is > >>>>>>> > >>>>>>> > >>>>>>> in draft form on the wiki where we will make any required changes: > >>>>>>> > >>>>>>> https://wiki.apache.org/incubator/DataflowProposal > >>>>>>> > >>>>>>> We look forward to your feedback and input. > >>>>>>> > >>>>>>> Best, > >>>>>>> > >>>>>>> James > >>>>>>> > >>>>>>> ---- > >>>>>>> > >>>>>>> = Apache Dataflow = > >>>>>>> > >>>>>>> == Abstract == > >>>>>>> > >>>>>>> Dataflow is an open source, unified model and set of > >>>>>>> language-specific > >>>>>> > >>>>>> > >>>>>> SDKs > >>>>>>> > >>>>>>> > >>>>>>> for defining and executing data processing workflows, and also data > >>>>>>> ingestion and integration flows, supporting Enterprise Integration > >>>>>> > >>>>>> > >>>>>> Patterns > >>>>>>> > >>>>>>> > >>>>>>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines > >>>>>> > >>>>>> > >>>>>> simplify > >>>>>>> > >>>>>>> > >>>>>>> the mechanics of large-scale batch and streaming data processing > and > >>>>>> > >>>>>> > >>>>>> can > >>>>>>> > >>>>>>> > >>>>>>> run on a number of runtimes like Apache Flink, Apache Spark, and > >>>>>> > >>>>>> > >>>>>> Google > >>>>>>> > >>>>>>> > >>>>>>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in > >>>>>> > >>>>>> > >>>>>> different > >>>>>>> > >>>>>>> > >>>>>>> languages, allowing users to easily implement their data > integration > >>>>>>> processes. > >>>>>>> > >>>>>>> == Proposal == > >>>>>>> > >>>>>>> Dataflow is a simple, flexible, and powerful system for distributed > >>>>>> > >>>>>> > >>>>>> data > >>>>>>> > >>>>>>> > >>>>>>> processing at any scale. Dataflow provides a unified programming > >>>>>> > >>>>>> > >>>>>> model, a > >>>>>>> > >>>>>>> > >>>>>>> software development kit to define and construct data processing > >>>>>> > >>>>>> > >>>>>> pipelines, > >>>>>>> > >>>>>>> > >>>>>>> and runners to execute Dataflow pipelines in several runtime > engines, > >>>>>> > >>>>>> > >>>>>> like > >>>>>>> > >>>>>>> > >>>>>>> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can > be > >>>>>> > >>>>>> > >>>>>> used > >>>>>>> > >>>>>>> > >>>>>>> for a variety of streaming or batch data processing goals including > >>>>>> > >>>>>> > >>>>>> ETL, > >>>>>>> > >>>>>>> > >>>>>>> stream analysis, and aggregate computation. The underlying > >>>>>>> programming > >>>>>>> model for Dataflow provides MapReduce-like parallelism, combined > with > >>>>>>> support for powerful data windowing, and fine-grained correctness > >>>>>> > >>>>>> > >>>>>> control. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> == Background == > >>>>>>> > >>>>>>> Dataflow started as a set of Google projects focused on making data > >>>>>>> processing easier, faster, and less costly. The Dataflow model is a > >>>>>>> successor to MapReduce, FlumeJava, and Millwheel inside Google and > is > >>>>>>> focused on providing a unified solution for batch and stream > >>>>>> > >>>>>> > >>>>>> processing. > >>>>>>> > >>>>>>> > >>>>>>> These projects on which Dataflow is based have been published in > >>>>>> > >>>>>> > >>>>>> several > >>>>>>> > >>>>>>> > >>>>>>> papers made available to the public: > >>>>>>> > >>>>>>> * MapReduce - http://research.google.com/archive/mapreduce.html > >>>>>>> > >>>>>>> * Dataflow model - > http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf > >>>>>>> > >>>>>>> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf > >>>>>>> > >>>>>>> * MillWheel - http://research.google.com/pubs/pub41378.html > >>>>>>> > >>>>>>> Dataflow was designed from the start to provide a portable > >>>>>>> programming > >>>>>>> layer. When you define a data processing pipeline with the Dataflow > >>>>>> > >>>>>> > >>>>>> model, > >>>>>>> > >>>>>>> > >>>>>>> you are creating a job which is capable of being processed by any > >>>>>> > >>>>>> > >>>>>> number > >>>>>> of > >>>>>>> > >>>>>>> > >>>>>>> Dataflow processing engines. Several engines have been developed to > >>>>>> > >>>>>> > >>>>>> run > >>>>>>> > >>>>>>> > >>>>>>> Dataflow pipelines in other open source runtimes, including a > >>>>>>> Dataflow > >>>>>>> runner for Apache Flink and Apache Spark. There is also a ³direct > >>>>>> > >>>>>> > >>>>>> runner², > >>>>>>> > >>>>>>> > >>>>>>> for execution on the developer machine (mainly for dev/debug > >>>>>> > >>>>>> > >>>>>> purposes). > >>>>>>> > >>>>>>> > >>>>>>> Another runner allows a Dataflow program to run on a managed > service, > >>>>>>> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java > >>>>>> > >>>>>> > >>>>>> SDK is > >>>>>>> > >>>>>>> > >>>>>>> already available on GitHub, and independent from the Google Cloud > >>>>>> > >>>>>> > >>>>>> Dataflow > >>>>>>> > >>>>>>> > >>>>>>> service. Another Python SDK is currently in active development. > >>>>>>> > >>>>>>> In this proposal, the Dataflow SDKs, model, and a set of runners > will > >>>>>> > >>>>>> > >>>>>> be > >>>>>>> > >>>>>>> > >>>>>>> submitted as an OSS project under the ASF. The runners which are a > >>>>>> > >>>>>> > >>>>>> part > >>>>>> of > >>>>>>> > >>>>>>> > >>>>>>> this proposal include those for Spark (from Cloudera), Flink (from > >>>>>> > >>>>>> > >>>>>> data > >>>>>>> > >>>>>>> > >>>>>>> Artisans), and local development (from Google); the Google Cloud > >>>>>> > >>>>>> > >>>>>> Dataflow > >>>>>>> > >>>>>>> > >>>>>>> service runner is not included in this proposal. Further references > >>>>>>> to > >>>>>>> Dataflow will refer to the Dataflow model, SDKs, and runners which > >>>>>> > >>>>>> > >>>>>> are a > >>>>>>> > >>>>>>> > >>>>>>> part of this proposal (Apache Dataflow) only. The initial > submission > >>>>>> > >>>>>> > >>>>>> will > >>>>>>> > >>>>>>> > >>>>>>> contain the already-released Java SDK; Google intends to submit the > >>>>>> > >>>>>> > >>>>>> Python > >>>>>>> > >>>>>>> > >>>>>>> SDK later in the incubation process. The Google Cloud Dataflow > >>>>>>> service > >>>>>> > >>>>>> > >>>>>> will > >>>>>>> > >>>>>>> > >>>>>>> continue to be one of many runners for Dataflow, built on Google > >>>>>>> Cloud > >>>>>>> Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow > will > >>>>>>> develop against the Apache project additions, updates, and changes. > >>>>>> > >>>>>> > >>>>>> Google > >>>>>>> > >>>>>>> > >>>>>>> Cloud Dataflow will become one user of Apache Dataflow and will > >>>>>> > >>>>>> > >>>>>> participate > >>>>>>> > >>>>>>> > >>>>>>> in the project openly and publicly. > >>>>>>> > >>>>>>> The Dataflow programming model has been designed with simplicity, > >>>>>>> scalability, and speed as key tenants. In the Dataflow model, you > >>>>>>> only > >>>>>> > >>>>>> > >>>>>> need > >>>>>>> > >>>>>>> > >>>>>>> to think about four top-level concepts when constructing your data > >>>>>>> processing job: > >>>>>>> > >>>>>>> * Pipelines - The data processing job made of a series of > >>>>>>> computations > >>>>>>> including input, processing, and output > >>>>>>> > >>>>>>> * PCollections - Bounded (or unbounded) datasets which represent > the > >>>>>> > >>>>>> > >>>>>> input, > >>>>>>> > >>>>>>> > >>>>>>> intermediate and output data in pipelines > >>>>>>> > >>>>>>> * PTransforms - A data processing step in a pipeline in which one > or > >>>>>> > >>>>>> > >>>>>> more > >>>>>>> > >>>>>>> > >>>>>>> PCollections are an input and output > >>>>>>> > >>>>>>> * I/O Sources and Sinks - APIs for reading and writing data which > are > >>>>>> > >>>>>> > >>>>>> the > >>>>>>> > >>>>>>> > >>>>>>> roots and endpoints of the pipeline > >>>>>>> > >>>>>>> == Rationale == > >>>>>>> > >>>>>>> With Dataflow, Google intended to develop a framework which allowed > >>>>>>> developers to be maximally productive in defining the processing, > and > >>>>>> > >>>>>> > >>>>>> then > >>>>>>> > >>>>>>> > >>>>>>> be able to execute the program at various levels of > >>>>>>> latency/cost/completeness without re-architecting or re-writing it. > >>>>>> > >>>>>> > >>>>>> This > >>>>>>> > >>>>>>> > >>>>>>> goal was informed by Google¹s past experience developing several > >>>>>> > >>>>>> > >>>>>> models, > >>>>>>> > >>>>>>> > >>>>>>> frameworks, and tools useful for large-scale and distributed data > >>>>>>> processing. While Google has previously published papers describing > >>>>>> > >>>>>> > >>>>>> some > >>>>>> of > >>>>>>> > >>>>>>> > >>>>>>> its technologies, Google decided to take a different approach with > >>>>>>> Dataflow. Google open-sourced the SDK and model alongside > >>>>>> > >>>>>> > >>>>>> commercialization > >>>>>>> > >>>>>>> > >>>>>>> of the idea and ahead of publishing papers on the topic. As a > result, > >>>>>> > >>>>>> > >>>>>> a > >>>>>>> > >>>>>>> > >>>>>>> number of open source runtimes exist for Dataflow, such as the > Apache > >>>>>> > >>>>>> > >>>>>> Flink > >>>>>>> > >>>>>>> > >>>>>>> and Apache Spark runners. > >>>>>>> > >>>>>>> We believe that submitting Dataflow as an Apache project will > provide > >>>>>> > >>>>>> > >>>>>> an > >>>>>>> > >>>>>>> > >>>>>>> immediate, worthwhile, and substantial contribution to the open > >>>>>>> source > >>>>>>> community. As an incubating project, we believe Dataflow will have > a > >>>>>> > >>>>>> > >>>>>> better > >>>>>>> > >>>>>>> > >>>>>>> opportunity to provide a meaningful contribution to OSS and also > >>>>>> > >>>>>> > >>>>>> integrate > >>>>>>> > >>>>>>> > >>>>>>> with other Apache projects. > >>>>>>> > >>>>>>> In the long term, we believe Dataflow can be a powerful abstraction > >>>>>> > >>>>>> > >>>>>> layer > >>>>>>> > >>>>>>> > >>>>>>> for data processing. By providing an abstraction layer for data > >>>>>> > >>>>>> > >>>>>> pipelines > >>>>>>> > >>>>>>> > >>>>>>> and processing, data workflows can be increasingly portable, > >>>>>> > >>>>>> > >>>>>> resilient to > >>>>>>> > >>>>>>> > >>>>>>> breaking changes in tooling, and compatible across many execution > >>>>>> > >>>>>> > >>>>>> engines, > >>>>>>> > >>>>>>> > >>>>>>> runtimes, and open source projects. > >>>>>>> > >>>>>>> == Initial Goals == > >>>>>>> > >>>>>>> We are breaking our initial goals into immediate (< 2 months), > >>>>>> > >>>>>> > >>>>>> short-term > >>>>>>> > >>>>>>> > >>>>>>> (2-4 months), and intermediate-term (> 4 months). > >>>>>>> > >>>>>>> Our immediate goals include the following: > >>>>>>> > >>>>>>> * Plan for reconciling the Dataflow Java SDK and various runners > into > >>>>>> > >>>>>> > >>>>>> one > >>>>>>> > >>>>>>> > >>>>>>> project > >>>>>>> > >>>>>>> * Plan for refactoring the existing Java SDK for better > extensibility > >>>>>> > >>>>>> > >>>>>> by > >>>>>>> > >>>>>>> > >>>>>>> SDK and runner writers > >>>>>>> > >>>>>>> * Validating all dependencies are ASL 2.0 or compatible > >>>>>>> > >>>>>>> * Understanding and adapting to the Apache development process > >>>>>>> > >>>>>>> Our short-term goals include: > >>>>>>> > >>>>>>> * Moving the newly-merged lists, and build utilities to Apache > >>>>>>> > >>>>>>> * Start refactoring codebase and move code to Apache Git repo > >>>>>>> > >>>>>>> * Continue development of new features, functions, and fixes in the > >>>>>>> Dataflow Java SDK, and Dataflow runners > >>>>>>> > >>>>>>> * Cleaning up the Dataflow SDK sources and crafting a roadmap and > >>>>>>> plan > >>>>>> > >>>>>> > >>>>>> for > >>>>>>> > >>>>>>> > >>>>>>> how to include new major ideas, modules, and runtimes > >>>>>>> > >>>>>>> * Establishment of easy and clear build/test framework for Dataflow > >>>>>> > >>>>>> > >>>>>> and > >>>>>>> > >>>>>>> > >>>>>>> associated runtimes; creation of testing, rollback, and validation > >>>>>> > >>>>>> > >>>>>> policy > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> * Analysis and design for work needed to make Dataflow a better > data > >>>>>>> processing abstraction layer for multiple open source frameworks > and > >>>>>>> environments > >>>>>>> > >>>>>>> Finally, we have a number of intermediate-term goals: > >>>>>>> > >>>>>>> * Roadmapping, planning, and execution of integrations with other > OSS > >>>>>> > >>>>>> > >>>>>> and > >>>>>>> > >>>>>>> > >>>>>>> non-OSS projects/products > >>>>>>> > >>>>>>> * Inclusion of additional SDK for Python, which is under active > >>>>>> > >>>>>> > >>>>>> development > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> == Current Status == > >>>>>>> > >>>>>>> === Meritocracy === > >>>>>>> > >>>>>>> Dataflow was initially developed based on ideas from many employees > >>>>>> > >>>>>> > >>>>>> within > >>>>>>> > >>>>>>> > >>>>>>> Google. As an ASL OSS project on GitHub, the Dataflow SDK has > >>>>>>> received > >>>>>>> contributions from data Artisans, Cloudera Labs, and other > individual > >>>>>>> developers. As a project under incubation, we are committed to > >>>>>> > >>>>>> > >>>>>> expanding > >>>>>>> > >>>>>>> > >>>>>>> our effort to build an environment which supports a meritocracy. We > >>>>>> > >>>>>> > >>>>>> are > >>>>>>> > >>>>>>> > >>>>>>> focused on engaging the community and other related projects for > >>>>>> > >>>>>> > >>>>>> support > >>>>>>> > >>>>>>> > >>>>>>> and contributions. Moreover, we are committed to ensure > contributors > >>>>>> > >>>>>> > >>>>>> and > >>>>>>> > >>>>>>> > >>>>>>> committers to Dataflow come from a broad mix of organizations > through > >>>>>> > >>>>>> > >>>>>> a > >>>>>>> > >>>>>>> > >>>>>>> merit-based decision process during incubation. We believe strongly > >>>>>>> in > >>>>>> > >>>>>> > >>>>>> the > >>>>>>> > >>>>>>> > >>>>>>> Dataflow model and are committed to growing an inclusive community > of > >>>>>>> Dataflow contributors. > >>>>>>> > >>>>>>> === Community === > >>>>>>> > >>>>>>> The core of the Dataflow Java SDK has been developed by Google for > >>>>>>> use > >>>>>> > >>>>>> > >>>>>> with > >>>>>>> > >>>>>>> > >>>>>>> Google Cloud Dataflow. Google has active community engagement in > the > >>>>>> > >>>>>> > >>>>>> SDK > >>>>>>> > >>>>>>> > >>>>>>> GitHub repository ( > >>>>>> > >>>>>> > >>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK > >>>>>>> > >>>>>>> > >>>>>>> ), > >>>>>>> on Stack Overflow ( > >>>>>>> http://stackoverflow.com/questions/tagged/google-cloud-dataflow) > and > >>>>>> > >>>>>> > >>>>>> has > >>>>>>> > >>>>>>> > >>>>>>> had contributions from a number of organizations and indivuduals. > >>>>>>> > >>>>>>> Everyday, Cloud Dataflow is actively used by a number of > >>>>>>> organizations > >>>>>> > >>>>>> > >>>>>> and > >>>>>>> > >>>>>>> > >>>>>>> institutions for batch and stream processing of data. We believe > >>>>>> > >>>>>> > >>>>>> acceptance > >>>>>>> > >>>>>>> > >>>>>>> will allow us to consolidate existing Dataflow-related work, grow > the > >>>>>>> Dataflow community, and deepen connections between Dataflow and > other > >>>>>> > >>>>>> > >>>>>> open > >>>>>>> > >>>>>>> > >>>>>>> source projects. > >>>>>>> > >>>>>>> === Core Developers === > >>>>>>> > >>>>>>> The core developers for Dataflow and the Dataflow runners are: > >>>>>>> > >>>>>>> * Frances Perry > >>>>>>> > >>>>>>> * Tyler Akidau > >>>>>>> > >>>>>>> * Davor Bonaci > >>>>>>> > >>>>>>> * Luke Cwik > >>>>>>> > >>>>>>> * Ben Chambers > >>>>>>> > >>>>>>> * Kenn Knowles > >>>>>>> > >>>>>>> * Dan Halperin > >>>>>>> > >>>>>>> * Daniel Mills > >>>>>>> > >>>>>>> * Mark Shields > >>>>>>> > >>>>>>> * Craig Chambers > >>>>>>> > >>>>>>> * Maximilian Michels > >>>>>>> > >>>>>>> * Tom White > >>>>>>> > >>>>>>> * Josh Wills > >>>>>>> > >>>>>>> === Alignment === > >>>>>>> > >>>>>>> The Dataflow SDK can be used to create Dataflow pipelines which can > >>>>>>> be > >>>>>>> executed on Apache Spark or Apache Flink. Dataflow is also related > to > >>>>>> > >>>>>> > >>>>>> other > >>>>>>> > >>>>>>> > >>>>>>> Apache projects, such as Apache Crunch. We plan on expanding > >>>>>> > >>>>>> > >>>>>> functionality > >>>>>>> > >>>>>>> > >>>>>>> for Dataflow runners, support for additional domain specific > >>>>>> > >>>>>> > >>>>>> languages, > >>>>>> and > >>>>>>> > >>>>>>> > >>>>>>> increased portability so Dataflow is a powerful abstraction layer > for > >>>>>> > >>>>>> > >>>>>> data > >>>>>>> > >>>>>>> > >>>>>>> processing. > >>>>>>> > >>>>>>> == Known Risks == > >>>>>>> > >>>>>>> === Orphaned Products === > >>>>>>> > >>>>>>> The Dataflow SDK is presently used by several organizations, from > >>>>>> > >>>>>> > >>>>>> small > >>>>>>> > >>>>>>> > >>>>>>> startups to Fortune 100 companies, to construct production > pipelines > >>>>>> > >>>>>> > >>>>>> which > >>>>>>> > >>>>>>> > >>>>>>> are executed in Google Cloud Dataflow. Google has a long-term > >>>>>> > >>>>>> > >>>>>> commitment > >>>>>> to > >>>>>>> > >>>>>>> > >>>>>>> advance the Dataflow SDK; moreover, Dataflow is seeing increasing > >>>>>> > >>>>>> > >>>>>> interest, > >>>>>>> > >>>>>>> > >>>>>>> development, and adoption from organizations outside of Google. > >>>>>>> > >>>>>>> === Inexperience with Open Source === > >>>>>>> > >>>>>>> Google believes strongly in open source and the exchange of > >>>>>> > >>>>>> > >>>>>> information > >>>>>> to > >>>>>>> > >>>>>>> > >>>>>>> advance new ideas and work. Examples of this commitment are active > >>>>>>> OSS > >>>>>>> projects such as Chromium (https://www.chromium.org) and > Kubernetes ( > >>>>>>> http://kubernetes.io/). With Dataflow, we have tried to be > >>>>>> > >>>>>> > >>>>>> increasingly > >>>>>>> > >>>>>>> > >>>>>>> open and forward-looking; we have published a paper in the VLDB > >>>>>> > >>>>>> > >>>>>> conference > >>>>>>> > >>>>>>> > >>>>>>> describing the Dataflow model ( > >>>>>>> http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf) and were quick to > >>>>>> > >>>>>> > >>>>>> release > >>>>>>> > >>>>>>> > >>>>>>> the Dataflow SDK as open source software with the launch of Cloud > >>>>>> > >>>>>> > >>>>>> Dataflow. > >>>>>>> > >>>>>>> > >>>>>>> Our submission to the Apache Software Foundation is a logical > >>>>>> > >>>>>> > >>>>>> extension > >>>>>> of > >>>>>>> > >>>>>>> > >>>>>>> our commitment to open source software. > >>>>>>> > >>>>>>> === Homogeneous Developers === > >>>>>>> > >>>>>>> The majority of committers in this proposal belong to Google due to > >>>>>> > >>>>>> > >>>>>> the > >>>>>>> > >>>>>>> > >>>>>>> fact that Dataflow has emerged from several internal Google > projects. > >>>>>> > >>>>>> > >>>>>> This > >>>>>>> > >>>>>>> > >>>>>>> proposal also includes committers outside of Google who are > actively > >>>>>>> involved with other Apache projects, such as Hadoop, Flink, and > >>>>>>> Spark. > >>>>>> > >>>>>> > >>>>>> We > >>>>>>> > >>>>>>> > >>>>>>> expect our entry into incubation will allow us to expand the number > >>>>>>> of > >>>>>>> individuals and organizations participating in Dataflow > development. > >>>>>>> Additionally, separation of the Dataflow SDK from Google Cloud > >>>>>> > >>>>>> > >>>>>> Dataflow > >>>>>>> > >>>>>>> > >>>>>>> allows us to focus on the open source SDK and model and do what is > >>>>>> > >>>>>> > >>>>>> best > >>>>>> for > >>>>>>> > >>>>>>> > >>>>>>> this project. > >>>>>>> > >>>>>>> === Reliance on Salaried Developers === > >>>>>>> > >>>>>>> The Dataflow SDK and Dataflow runners have been developed primarily > >>>>>>> by > >>>>>>> salaried developers supporting the Google Cloud Dataflow project. > >>>>>> > >>>>>> > >>>>>> While > >>>>>> the > >>>>>>> > >>>>>>> > >>>>>>> Dataflow SDK and Cloud Dataflow have been developed by different > >>>>>>> teams > >>>>>> > >>>>>> > >>>>>> (and > >>>>>>> > >>>>>>> > >>>>>>> this proposal would reinforce that separation) we expect our > initial > >>>>>> > >>>>>> > >>>>>> set > >>>>>> of > >>>>>>> > >>>>>>> > >>>>>>> developers will still primarily be salaried. Contribution has not > >>>>>>> been > >>>>>>> exclusively from salaried developers, however. For example, the > >>>>>> > >>>>>> > >>>>>> contrib > >>>>>>> > >>>>>>> > >>>>>>> directory of the Dataflow SDK ( > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/master/contri > >>>>>> b > >>>>>>> > >>>>>>> > >>>>>>> ) > >>>>>>> contains items from free-time contributors. Moreover, seperate > >>>>>> > >>>>>> > >>>>>> projects, > >>>>>>> > >>>>>>> > >>>>>>> such as ScalaFlow (https://github.com/darkjh/scalaflow) have been > >>>>>> > >>>>>> > >>>>>> created > >>>>>>> > >>>>>>> > >>>>>>> around the Dataflow model and SDK. We expect our reliance on > salaried > >>>>>>> developers will decrease over time during incubation. > >>>>>>> > >>>>>>> === Relationship with other Apache products === > >>>>>>> > >>>>>>> Dataflow directly interoperates with or utilizes several existing > >>>>>> > >>>>>> > >>>>>> Apache > >>>>>>> > >>>>>>> > >>>>>>> projects. > >>>>>>> > >>>>>>> * Build > >>>>>>> > >>>>>>> ** Apache Maven > >>>>>>> > >>>>>>> * Data I/O, Libraries > >>>>>>> > >>>>>>> ** Apache Avro > >>>>>>> > >>>>>>> ** Apache Commons > >>>>>>> > >>>>>>> * Dataflow runners > >>>>>>> > >>>>>>> ** Apache Flink > >>>>>>> > >>>>>>> ** Apache Spark > >>>>>>> > >>>>>>> Dataflow when used in batch mode shares similarities with Apache > >>>>>> > >>>>>> > >>>>>> Crunch; > >>>>>>> > >>>>>>> > >>>>>>> however, Dataflow is focused on a model, SDK, and abstraction layer > >>>>>> > >>>>>> > >>>>>> beyond > >>>>>>> > >>>>>>> > >>>>>>> Spark and Hadoop (MapReduce.) One key goal of Dataflow is to > provide > >>>>>> > >>>>>> > >>>>>> an > >>>>>>> > >>>>>>> > >>>>>>> intermediate abstraction layer which can easily be implemented and > >>>>>> > >>>>>> > >>>>>> utilized > >>>>>>> > >>>>>>> > >>>>>>> across several different processing frameworks. > >>>>>>> > >>>>>>> === An excessive fascination with the Apache brand === > >>>>>>> > >>>>>>> With this proposal we are not seeking attention or publicity. > Rather, > >>>>>> > >>>>>> > >>>>>> we > >>>>>>> > >>>>>>> > >>>>>>> firmly believe in the Dataflow model, SDK, and the ability to make > >>>>>> > >>>>>> > >>>>>> Dataflow > >>>>>>> > >>>>>>> > >>>>>>> a powerful yet simple framework for data processing. While the > >>>>>> > >>>>>> > >>>>>> Dataflow > >>>>>> SDK > >>>>>>> > >>>>>>> > >>>>>>> and model have been open source, we believe putting code on GitHub > >>>>>>> can > >>>>>> > >>>>>> > >>>>>> only > >>>>>>> > >>>>>>> > >>>>>>> go so far. We see the Apache community, processes, and mission as > >>>>>> > >>>>>> > >>>>>> critical > >>>>>>> > >>>>>>> > >>>>>>> for ensuring the Dataflow SDK and model are truly community-driven, > >>>>>>> positively impactful, and innovative open source software. While > >>>>>> > >>>>>> > >>>>>> Google > >>>>>> has > >>>>>>> > >>>>>>> > >>>>>>> taken a number of steps to advance its various open source > projects, > >>>>>> > >>>>>> > >>>>>> we > >>>>>>> > >>>>>>> > >>>>>>> believe Dataflow is a great fit for the Apache Software Foundation > >>>>>> > >>>>>> > >>>>>> due to > >>>>>>> > >>>>>>> > >>>>>>> its focus on data processing and its relationships to existing ASF > >>>>>>> projects. > >>>>>>> > >>>>>>> == Documentation == > >>>>>>> > >>>>>>> The following documentation is relevant to this proposal. Relevant > >>>>>> > >>>>>> > >>>>>> portion > >>>>>>> > >>>>>>> > >>>>>>> of the documentation will be contributed to the Apache Dataflow > >>>>>> > >>>>>> > >>>>>> project. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> * Dataflow website: https://cloud.google.com/dataflow > >>>>>>> > >>>>>>> * Dataflow programming model: > >>>>>>> https://cloud.google.com/dataflow/model/programming-model > >>>>>>> > >>>>>>> * Codebases > >>>>>>> > >>>>>>> ** Dataflow Java SDK: > >>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK > >>>>>>> > >>>>>>> ** Flink Dataflow runner: > >>>>>> > >>>>>> > >>>>>> https://github.com/dataArtisans/flink-dataflow > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> ** Spark Dataflow runner: > https://github.com/cloudera/spark-dataflow > >>>>>>> > >>>>>>> * Dataflow Java SDK issue tracker: > >>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues > >>>>>>> > >>>>>>> * google-cloud-dataflow tag on Stack Overflow: > >>>>>>> http://stackoverflow.com/questions/tagged/google-cloud-dataflow > >>>>>>> > >>>>>>> == Initial Source == > >>>>>>> > >>>>>>> The initial source for Dataflow which we will submit to the Apache > >>>>>>> Foundation will include several related projects which are > currently > >>>>>> > >>>>>> > >>>>>> hosted > >>>>>>> > >>>>>>> > >>>>>>> on the GitHub repositories: > >>>>>>> > >>>>>>> * Dataflow Java SDK ( > >>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK) > >>>>>>> > >>>>>>> * Flink Dataflow runner > >>>>>> > >>>>>> > >>>>>> (https://github.com/dataArtisans/flink-dataflow) > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> * Spark Dataflow runner ( > https://github.com/cloudera/spark-dataflow) > >>>>>>> > >>>>>>> These projects have always been Apache 2.0 licensed. We intend to > >>>>>> > >>>>>> > >>>>>> bundle > >>>>>>> > >>>>>>> > >>>>>>> all of these repositories since they are all complimentary and > should > >>>>>> > >>>>>> > >>>>>> be > >>>>>>> > >>>>>>> > >>>>>>> maintained in one project. Prior to our submission, we will combine > >>>>>> > >>>>>> > >>>>>> all > >>>>>> of > >>>>>>> > >>>>>>> > >>>>>>> these projects into a new git repository. > >>>>>>> > >>>>>>> == Source and Intellectual Property Submission Plan == > >>>>>>> > >>>>>>> The source for the Dataflow SDK and the three runners (Spark, > Flink, > >>>>>> > >>>>>> > >>>>>> Google > >>>>>>> > >>>>>>> > >>>>>>> Cloud Dataflow) are already licensed under an Apache 2 license. > >>>>>>> > >>>>>>> * Dataflow SDK - > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/LICENS > >>>>>> E > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> * Flink runner - > >>>>>>> https://github.com/dataArtisans/flink-dataflow/blob/master/LICENSE > >>>>>>> > >>>>>>> * Spark runner - > >>>>>>> https://github.com/cloudera/spark-dataflow/blob/master/LICENSE > >>>>>>> > >>>>>>> Contributors to the Dataflow SDK have also signed the Google > >>>>>> > >>>>>> > >>>>>> Individual > >>>>>>> > >>>>>>> > >>>>>>> Contributor License Agreement ( > >>>>>>> https://cla.developers.google.com/about/google-individual) in > order > >>>>>>> to > >>>>>>> contribute to the project. > >>>>>>> > >>>>>>> With respect to trademark rights, Google does not hold a trademark > on > >>>>>> > >>>>>> > >>>>>> the > >>>>>>> > >>>>>>> > >>>>>>> phrase ³Dataflow.² Based on feedback and guidance we receive during > >>>>>> > >>>>>> > >>>>>> the > >>>>>>> > >>>>>>> > >>>>>>> incubation process, we are open to renaming the project if > necessary > >>>>>> > >>>>>> > >>>>>> for > >>>>>>> > >>>>>>> > >>>>>>> trademark or other concerns. > >>>>>>> > >>>>>>> == External Dependencies == > >>>>>>> > >>>>>>> All external dependencies are licensed under an Apache 2.0 or > >>>>>>> Apache-compatible license. As we grow the Dataflow community we > will > >>>>>>> configure our build process to require and validate all > contributions > >>>>>> > >>>>>> > >>>>>> and > >>>>>>> > >>>>>>> > >>>>>>> dependencies are licensed under the Apache 2.0 license or are under > >>>>>>> an > >>>>>>> Apache-compatible license. > >>>>>>> > >>>>>>> == Required Resources == > >>>>>>> > >>>>>>> === Mailing Lists === > >>>>>>> > >>>>>>> We currently use a mix of mailing lists. We will migrate our > existing > >>>>>>> mailing lists to the following: > >>>>>>> > >>>>>>> * d...@dataflow.incubator.apache.org > >>>>>>> > >>>>>>> * u...@dataflow.incubator.apache.org > >>>>>>> > >>>>>>> * priv...@dataflow.incubator.apache.org > >>>>>>> > >>>>>>> * comm...@dataflow.incubator.apache.org > >>>>>>> > >>>>>>> === Source Control === > >>>>>>> > >>>>>>> The Dataflow team currently uses Git and would like to continue to > do > >>>>>> > >>>>>> > >>>>>> so. > >>>>>>> > >>>>>>> > >>>>>>> We request a Git repository for Dataflow with mirroring to GitHub > >>>>>> > >>>>>> > >>>>>> enabled. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> === Issue Tracking === > >>>>>>> > >>>>>>> We request the creation of an Apache-hosted JIRA. The Dataflow > >>>>>> > >>>>>> > >>>>>> project is > >>>>>>> > >>>>>>> > >>>>>>> currently using both a public GitHub issue tracker and internal > >>>>>>> Google > >>>>>>> issue tracking. We will migrate and combine from these two sources > to > >>>>>> > >>>>>> > >>>>>> the > >>>>>>> > >>>>>>> > >>>>>>> Apache JIRA. > >>>>>>> > >>>>>>> == Initial Committers == > >>>>>>> > >>>>>>> * Aljoscha Krettek [aljos...@apache.org] > >>>>>>> > >>>>>>> * Amit Sela [amitsel...@gmail.com] > >>>>>>> > >>>>>>> * Ben Chambers [bchamb...@google.com] > >>>>>>> > >>>>>>> * Craig Chambers [chamb...@google.com] > >>>>>>> > >>>>>>> * Dan Halperin [dhalp...@google.com] > >>>>>>> > >>>>>>> * Davor Bonaci [da...@google.com] > >>>>>>> > >>>>>>> * Frances Perry [f...@google.com] > >>>>>>> > >>>>>>> * James Malone [jamesmal...@google.com] > >>>>>>> > >>>>>>> * Jean-Baptiste Onofré [jbono...@apache.org] > >>>>>>> > >>>>>>> * Josh Wills [jwi...@apache.org] > >>>>>>> > >>>>>>> * Kostas Tzoumas [kos...@data-artisans.com] > >>>>>>> > >>>>>>> * Kenneth Knowles [k...@google.com] > >>>>>>> > >>>>>>> * Luke Cwik [lc...@google.com] > >>>>>>> > >>>>>>> * Maximilian Michels [m...@apache.org] > >>>>>>> > >>>>>>> * Stephan Ewen [step...@data-artisans.com] > >>>>>>> > >>>>>>> * Tom White [t...@cloudera.com] > >>>>>>> > >>>>>>> * Tyler Akidau [taki...@google.com] > >>>>>>> > >>>>>>> == Affiliations == > >>>>>>> > >>>>>>> The initial committers are from six organizations. Google developed > >>>>>>> Dataflow and the Dataflow SDK, data Artisans developed the Flink > >>>>>> > >>>>>> > >>>>>> runner, > >>>>>>> > >>>>>>> > >>>>>>> and Cloudera (Labs) developed the Spark runner. > >>>>>>> > >>>>>>> * Cloudera > >>>>>>> > >>>>>>> ** Tom White > >>>>>>> > >>>>>>> * Data Artisans > >>>>>>> > >>>>>>> ** Aljoscha Krettek > >>>>>>> > >>>>>>> ** Kostas Tzoumas > >>>>>>> > >>>>>>> ** Maximilian Michels > >>>>>>> > >>>>>>> ** Stephan Ewen > >>>>>>> > >>>>>>> * Google > >>>>>>> > >>>>>>> ** Ben Chambers > >>>>>>> > >>>>>>> ** Dan Halperin > >>>>>>> > >>>>>>> ** Davor Bonaci > >>>>>>> > >>>>>>> ** Frances Perry > >>>>>>> > >>>>>>> ** James Malone > >>>>>>> > >>>>>>> ** Kenneth Knowles > >>>>>>> > >>>>>>> ** Luke Cwik > >>>>>>> > >>>>>>> ** Tyler Akidau > >>>>>>> > >>>>>>> * PayPal > >>>>>>> > >>>>>>> ** Amit Sela > >>>>>>> > >>>>>>> * Slack > >>>>>>> > >>>>>>> ** Josh Wills > >>>>>>> > >>>>>>> * Talend > >>>>>>> > >>>>>>> ** Jean-Baptiste Onofré > >>>>>>> > >>>>>>> == Sponsors == > >>>>>>> > >>>>>>> === Champion === > >>>>>>> > >>>>>>> * Jean-Baptiste Onofre [jbono...@apache.org] > >>>>>>> > >>>>>>> === Nominated Mentors === > >>>>>>> > >>>>>>> * Jim Jagielski [j...@apache.org] > >>>>>>> > >>>>>>> * Venkatesh Seetharam [venkat...@apache.org] > >>>>>>> > >>>>>>> * Bertrand Delacretaz [bdelacre...@apache.org] > >>>>>>> > >>>>>>> * Ted Dunning [tdunn...@apache.org] > >>>>>>> > >>>>>>> === Sponsoring Entity === > >>>>>>> > >>>>>>> The Apache Incubator > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Sean > >>>>>> > >>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > >>>> For additional commands, e-mail: general-h...@incubator.apache.org > >>>> > >>> > >>> -- > >>> Jean-Baptiste Onofré > >>> jbono...@apache.org > >>> http://blog.nanthrax.net > >>> Talend - http://www.talend.com > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > >>> For additional commands, e-mail: general-h...@incubator.apache.org > >>> > >> > >> > >> > > > > -- > > Jean-Baptiste Onofré > > jbono...@apache.org > > http://blog.nanthrax.net > > Talend - http://www.talend.com > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > -- > thanks > ashish > > Blog: http://www.ashishpaliwal.com/blog > My Photo Galleries: http://www.pbase.com/ashishpaliwal > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >