Re: [DISCUSS] Apache Dataflow Incubator Proposal

Frances Perry Fri, 22 Jan 2016 11:00:08 -0800

Crunch started as a clone of FlumeJava, which was Google internal. In the
meantime inside Google, FlumeJava evolved into Dataflow. So all three share
a number of concepts like PCollections, ParDo, DoFn, etc. However, Dataflow
adds a number of new things -- the biggest being a unified batch/streaming
semantics using concepts like Windowing and Triggers. Tyler Akidau's
OReilly post has a really nice explanation:
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102


On Fri, Jan 22, 2016 at 10:42 AM, Ashish <paliwalash...@gmail.com> wrote:

> Crunch has Spark pipelines, but not sure about the runner abstraction.
>
> May be Josh Wills or Tom White can provide more insight on this topic.
> They are core devs for both projects :)
>
> On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
> > Hi,
> >
> > I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce
> pipeline, it
> > doesn't provide runner abstraction. It's based on FlumeJava.
> >
> > The logic is very similar (with DoFns, pipelines, ...). Correct me if I'm
> > wrong, but Crunch started after Google Dataflow, especially because
> Dataflow
> > was not opensourced at that time.
> >
> > So, I agree it's very similar/close.
> >
> > Regards
> > JB
> >
> >
> > On 01/22/2016 05:51 PM, Ashish wrote:
> >>
> >> Hi JB,
> >>
> >> Curious to know about how it compares to Apache Crunch? Constructs
> >> looks very familiar (had used Crunch long ago)
> >>
> >> Thoughts?
> >>
> >> - Ashish
> >>
> >> On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> >> wrote:
> >>>
> >>> Hi Seshu,
> >>>
> >>> I blogged about Apache Dataflow proposal:
> >>> http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/
> >>>
> >>> You can see in the "what's next ?" section that new runners, skins and
> >>> sources are on our roadmap. Definitely, a storm runner could be part of
> >>> this.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>>
> >>> On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote:
> >>>>
> >>>>
> >>>> Awesome to see CloudDataFlow coming to Apache. The Stream Processing
> >>>> area
> >>>> has been in general fragmented with a variety of solutions, hoping the
> >>>> community galvanizes around Apache Data Flow.
> >>>>
> >>>> We are still in the "Apache Storm" world, Any chance for folks
> building
> >>>> a
> >>>> "Storm Runner²?
> >>>>
> >>>>
> >>>> On 1/20/16, 9:39 AM, "James Malone" <jamesmal...@google.com.INVALID>
> >>>> wrote:
> >>>>
> >>>>>> Great proposal. I like that your proposal includes a well presented
> >>>>>> roadmap, but I don't see any goals that directly address building a
> >>>>>> larger
> >>>>>> community. Y'all have any ideas around outreach that will help with
> >>>>>> adoption?
> >>>>>>
> >>>>>
> >>>>> Thank you and fair point. We have a few additional ideas which we can
> >>>>> put
> >>>>> into the Community section.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> As a start, I recommend y'all add a section to the proposal on the
> >>>>>> wiki
> >>>>>> page for "Additional Interested Contributors" so that folks who want
> >>>>>> to
> >>>>>> sign up to participate in the project can do so without requesting
> >>>>>> additions to the initial committer list.
> >>>>>>
> >>>>>>
> >>>>> This is a great idea and I think it makes a lot of sense to add an
> >>>>> "Additional
> >>>>> Interested Contributors" section to the proposal.
> >>>>>
> >>>>>
> >>>>>> On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
> >>>>>> jamesmal...@google.com.invalid> wrote:
> >>>>>>
> >>>>>>> Hello everyone,
> >>>>>>>
> >>>>>>> Attached to this message is a proposed new project - Apache
> Dataflow,
> >>>>>>
> >>>>>>
> >>>>>> a
> >>>>>>>
> >>>>>>>
> >>>>>>> unified programming model for data processing and integration.
> >>>>>>>
> >>>>>>> The text of the proposal is included below. Additionally, the
> >>>>>>
> >>>>>>
> >>>>>> proposal is
> >>>>>>>
> >>>>>>>
> >>>>>>> in draft form on the wiki where we will make any required changes:
> >>>>>>>
> >>>>>>> https://wiki.apache.org/incubator/DataflowProposal
> >>>>>>>
> >>>>>>> We look forward to your feedback and input.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>>
> >>>>>>> James
> >>>>>>>
> >>>>>>> ----
> >>>>>>>
> >>>>>>> = Apache Dataflow =
> >>>>>>>
> >>>>>>> == Abstract ==
> >>>>>>>
> >>>>>>> Dataflow is an open source, unified model and set of
> >>>>>>> language-specific
> >>>>>>
> >>>>>>
> >>>>>> SDKs
> >>>>>>>
> >>>>>>>
> >>>>>>> for defining and executing data processing workflows, and also data
> >>>>>>> ingestion and integration flows, supporting Enterprise Integration
> >>>>>>
> >>>>>>
> >>>>>> Patterns
> >>>>>>>
> >>>>>>>
> >>>>>>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
> >>>>>>
> >>>>>>
> >>>>>> simplify
> >>>>>>>
> >>>>>>>
> >>>>>>> the mechanics of large-scale batch and streaming data processing
> and
> >>>>>>
> >>>>>>
> >>>>>> can
> >>>>>>>
> >>>>>>>
> >>>>>>> run on a number of runtimes like Apache Flink, Apache Spark, and
> >>>>>>
> >>>>>>
> >>>>>> Google
> >>>>>>>
> >>>>>>>
> >>>>>>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in
> >>>>>>
> >>>>>>
> >>>>>> different
> >>>>>>>
> >>>>>>>
> >>>>>>> languages, allowing users to easily implement their data
> integration
> >>>>>>> processes.
> >>>>>>>
> >>>>>>> == Proposal ==
> >>>>>>>
> >>>>>>> Dataflow is a simple, flexible, and powerful system for distributed
> >>>>>>
> >>>>>>
> >>>>>> data
> >>>>>>>
> >>>>>>>
> >>>>>>> processing at any scale. Dataflow provides a unified programming
> >>>>>>
> >>>>>>
> >>>>>> model, a
> >>>>>>>
> >>>>>>>
> >>>>>>> software development kit to define and construct data processing
> >>>>>>
> >>>>>>
> >>>>>> pipelines,
> >>>>>>>
> >>>>>>>
> >>>>>>> and runners to execute Dataflow pipelines in several runtime
> engines,
> >>>>>>
> >>>>>>
> >>>>>> like
> >>>>>>>
> >>>>>>>
> >>>>>>> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can
> be
> >>>>>>
> >>>>>>
> >>>>>> used
> >>>>>>>
> >>>>>>>
> >>>>>>> for a variety of streaming or batch data processing goals including
> >>>>>>
> >>>>>>
> >>>>>> ETL,
> >>>>>>>
> >>>>>>>
> >>>>>>> stream analysis, and aggregate computation. The underlying
> >>>>>>> programming
> >>>>>>> model for Dataflow provides MapReduce-like parallelism, combined
> with
> >>>>>>> support for powerful data windowing, and fine-grained correctness
> >>>>>>
> >>>>>>
> >>>>>> control.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> == Background ==
> >>>>>>>
> >>>>>>> Dataflow started as a set of Google projects focused on making data
> >>>>>>> processing easier, faster, and less costly. The Dataflow model is a
> >>>>>>> successor to MapReduce, FlumeJava, and Millwheel inside Google and
> is
> >>>>>>> focused on providing a unified solution for batch and stream
> >>>>>>
> >>>>>>
> >>>>>> processing.
> >>>>>>>
> >>>>>>>
> >>>>>>> These projects on which Dataflow is based have been published in
> >>>>>>
> >>>>>>
> >>>>>> several
> >>>>>>>
> >>>>>>>
> >>>>>>> papers made available to the public:
> >>>>>>>
> >>>>>>> * MapReduce - http://research.google.com/archive/mapreduce.html
> >>>>>>>
> >>>>>>> * Dataflow model  -
> http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
> >>>>>>>
> >>>>>>> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
> >>>>>>>
> >>>>>>> * MillWheel - http://research.google.com/pubs/pub41378.html
> >>>>>>>
> >>>>>>> Dataflow was designed from the start to provide a portable
> >>>>>>> programming
> >>>>>>> layer. When you define a data processing pipeline with the Dataflow
> >>>>>>
> >>>>>>
> >>>>>> model,
> >>>>>>>
> >>>>>>>
> >>>>>>> you are creating a job which is capable of being processed by any
> >>>>>>
> >>>>>>
> >>>>>> number
> >>>>>> of
> >>>>>>>
> >>>>>>>
> >>>>>>> Dataflow processing engines. Several engines have been developed to
> >>>>>>
> >>>>>>
> >>>>>> run
> >>>>>>>
> >>>>>>>
> >>>>>>> Dataflow pipelines in other open source runtimes, including a
> >>>>>>> Dataflow
> >>>>>>> runner for Apache Flink and Apache Spark. There is also a ³direct
> >>>>>>
> >>>>>>
> >>>>>> runner²,
> >>>>>>>
> >>>>>>>
> >>>>>>> for execution on the developer machine (mainly for dev/debug
> >>>>>>
> >>>>>>
> >>>>>> purposes).
> >>>>>>>
> >>>>>>>
> >>>>>>> Another runner allows a Dataflow program to run on a managed
> service,
> >>>>>>> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java
> >>>>>>
> >>>>>>
> >>>>>> SDK is
> >>>>>>>
> >>>>>>>
> >>>>>>> already available on GitHub, and independent from the Google Cloud
> >>>>>>
> >>>>>>
> >>>>>> Dataflow
> >>>>>>>
> >>>>>>>
> >>>>>>> service. Another Python SDK is currently in active development.
> >>>>>>>
> >>>>>>> In this proposal, the Dataflow SDKs, model, and a set of runners
> will
> >>>>>>
> >>>>>>
> >>>>>> be
> >>>>>>>
> >>>>>>>
> >>>>>>> submitted as an OSS project under the ASF. The runners which are a
> >>>>>>
> >>>>>>
> >>>>>> part
> >>>>>> of
> >>>>>>>
> >>>>>>>
> >>>>>>> this proposal include those for Spark (from Cloudera), Flink (from
> >>>>>>
> >>>>>>
> >>>>>> data
> >>>>>>>
> >>>>>>>
> >>>>>>> Artisans), and local development (from Google); the Google Cloud
> >>>>>>
> >>>>>>
> >>>>>> Dataflow
> >>>>>>>
> >>>>>>>
> >>>>>>> service runner is not included in this proposal. Further references
> >>>>>>> to
> >>>>>>> Dataflow will refer to the Dataflow model, SDKs, and runners which
> >>>>>>
> >>>>>>
> >>>>>> are a
> >>>>>>>
> >>>>>>>
> >>>>>>> part of this proposal (Apache Dataflow) only. The initial
> submission
> >>>>>>
> >>>>>>
> >>>>>> will
> >>>>>>>
> >>>>>>>
> >>>>>>> contain the already-released Java SDK; Google intends to submit the
> >>>>>>
> >>>>>>
> >>>>>> Python
> >>>>>>>
> >>>>>>>
> >>>>>>> SDK later in the incubation process. The Google Cloud Dataflow
> >>>>>>> service
> >>>>>>
> >>>>>>
> >>>>>> will
> >>>>>>>
> >>>>>>>
> >>>>>>> continue to be one of many runners for Dataflow, built on Google
> >>>>>>> Cloud
> >>>>>>> Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow
> will
> >>>>>>> develop against the Apache project additions, updates, and changes.
> >>>>>>
> >>>>>>
> >>>>>> Google
> >>>>>>>
> >>>>>>>
> >>>>>>> Cloud Dataflow will become one user of Apache Dataflow and will
> >>>>>>
> >>>>>>
> >>>>>> participate
> >>>>>>>
> >>>>>>>
> >>>>>>> in the project openly and publicly.
> >>>>>>>
> >>>>>>> The Dataflow programming model has been designed with simplicity,
> >>>>>>> scalability, and speed as key tenants. In the Dataflow model, you
> >>>>>>> only
> >>>>>>
> >>>>>>
> >>>>>> need
> >>>>>>>
> >>>>>>>
> >>>>>>> to think about four top-level concepts when constructing your data
> >>>>>>> processing job:
> >>>>>>>
> >>>>>>> * Pipelines - The data processing job made of a series of
> >>>>>>> computations
> >>>>>>> including input, processing, and output
> >>>>>>>
> >>>>>>> * PCollections - Bounded (or unbounded) datasets which represent
> the
> >>>>>>
> >>>>>>
> >>>>>> input,
> >>>>>>>
> >>>>>>>
> >>>>>>> intermediate and output data in pipelines
> >>>>>>>
> >>>>>>> * PTransforms - A data processing step in a pipeline in which one
> or
> >>>>>>
> >>>>>>
> >>>>>> more
> >>>>>>>
> >>>>>>>
> >>>>>>> PCollections are an input and output
> >>>>>>>
> >>>>>>> * I/O Sources and Sinks - APIs for reading and writing data which
> are
> >>>>>>
> >>>>>>
> >>>>>> the
> >>>>>>>
> >>>>>>>
> >>>>>>> roots and endpoints of the pipeline
> >>>>>>>
> >>>>>>> == Rationale ==
> >>>>>>>
> >>>>>>> With Dataflow, Google intended to develop a framework which allowed
> >>>>>>> developers to be maximally productive in defining the processing,
> and
> >>>>>>
> >>>>>>
> >>>>>> then
> >>>>>>>
> >>>>>>>
> >>>>>>> be able to execute the program at various levels of
> >>>>>>> latency/cost/completeness without re-architecting or re-writing it.
> >>>>>>
> >>>>>>
> >>>>>> This
> >>>>>>>
> >>>>>>>
> >>>>>>> goal was informed by Google¹s past experience  developing several
> >>>>>>
> >>>>>>
> >>>>>> models,
> >>>>>>>
> >>>>>>>
> >>>>>>> frameworks, and tools useful for large-scale and distributed data
> >>>>>>> processing. While Google has previously published papers describing
> >>>>>>
> >>>>>>
> >>>>>> some
> >>>>>> of
> >>>>>>>
> >>>>>>>
> >>>>>>> its technologies, Google decided to take a different approach with
> >>>>>>> Dataflow. Google open-sourced the SDK and model alongside
> >>>>>>
> >>>>>>
> >>>>>> commercialization
> >>>>>>>
> >>>>>>>
> >>>>>>> of the idea and ahead of publishing papers on the topic. As a
> result,
> >>>>>>
> >>>>>>
> >>>>>> a
> >>>>>>>
> >>>>>>>
> >>>>>>> number of open source runtimes exist for Dataflow, such as the
> Apache
> >>>>>>
> >>>>>>
> >>>>>> Flink
> >>>>>>>
> >>>>>>>
> >>>>>>> and Apache Spark runners.
> >>>>>>>
> >>>>>>> We believe that submitting Dataflow as an Apache project will
> provide
> >>>>>>
> >>>>>>
> >>>>>> an
> >>>>>>>
> >>>>>>>
> >>>>>>> immediate, worthwhile, and substantial contribution to the open
> >>>>>>> source
> >>>>>>> community. As an incubating project, we believe Dataflow will have
> a
> >>>>>>
> >>>>>>
> >>>>>> better
> >>>>>>>
> >>>>>>>
> >>>>>>> opportunity to provide a meaningful contribution to OSS and also
> >>>>>>
> >>>>>>
> >>>>>> integrate
> >>>>>>>
> >>>>>>>
> >>>>>>> with other Apache projects.
> >>>>>>>
> >>>>>>> In the long term, we believe Dataflow can be a powerful abstraction
> >>>>>>
> >>>>>>
> >>>>>> layer
> >>>>>>>
> >>>>>>>
> >>>>>>> for data processing. By providing an abstraction layer for data
> >>>>>>
> >>>>>>
> >>>>>> pipelines
> >>>>>>>
> >>>>>>>
> >>>>>>> and processing, data workflows can be increasingly portable,
> >>>>>>
> >>>>>>
> >>>>>> resilient to
> >>>>>>>
> >>>>>>>
> >>>>>>> breaking changes in tooling, and compatible across many execution
> >>>>>>
> >>>>>>
> >>>>>> engines,
> >>>>>>>
> >>>>>>>
> >>>>>>> runtimes, and open source projects.
> >>>>>>>
> >>>>>>> == Initial Goals ==
> >>>>>>>
> >>>>>>> We are breaking our initial goals into immediate (< 2 months),
> >>>>>>
> >>>>>>
> >>>>>> short-term
> >>>>>>>
> >>>>>>>
> >>>>>>> (2-4 months), and intermediate-term (> 4 months).
> >>>>>>>
> >>>>>>> Our immediate goals include the following:
> >>>>>>>
> >>>>>>> * Plan for reconciling the Dataflow Java SDK and various runners
> into
> >>>>>>
> >>>>>>
> >>>>>> one
> >>>>>>>
> >>>>>>>
> >>>>>>> project
> >>>>>>>
> >>>>>>> * Plan for refactoring the existing Java SDK for better
> extensibility
> >>>>>>
> >>>>>>
> >>>>>> by
> >>>>>>>
> >>>>>>>
> >>>>>>> SDK and runner writers
> >>>>>>>
> >>>>>>> * Validating all dependencies are ASL 2.0 or compatible
> >>>>>>>
> >>>>>>> * Understanding and adapting to the Apache development process
> >>>>>>>
> >>>>>>> Our short-term goals include:
> >>>>>>>
> >>>>>>> * Moving the newly-merged lists, and build utilities to Apache
> >>>>>>>
> >>>>>>> * Start refactoring codebase and move code to Apache Git repo
> >>>>>>>
> >>>>>>> * Continue development of new features, functions, and fixes in the
> >>>>>>> Dataflow Java SDK, and Dataflow runners
> >>>>>>>
> >>>>>>> * Cleaning up the Dataflow SDK sources and crafting a roadmap and
> >>>>>>> plan
> >>>>>>
> >>>>>>
> >>>>>> for
> >>>>>>>
> >>>>>>>
> >>>>>>> how to include new major ideas, modules, and runtimes
> >>>>>>>
> >>>>>>> * Establishment of easy and clear build/test framework for Dataflow
> >>>>>>
> >>>>>>
> >>>>>> and
> >>>>>>>
> >>>>>>>
> >>>>>>> associated runtimes; creation of testing, rollback, and validation
> >>>>>>
> >>>>>>
> >>>>>> policy
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> * Analysis and design for work needed to make Dataflow a better
> data
> >>>>>>> processing abstraction layer for multiple open source frameworks
> and
> >>>>>>> environments
> >>>>>>>
> >>>>>>> Finally, we have a number of intermediate-term goals:
> >>>>>>>
> >>>>>>> * Roadmapping, planning, and execution of integrations with other
> OSS
> >>>>>>
> >>>>>>
> >>>>>> and
> >>>>>>>
> >>>>>>>
> >>>>>>> non-OSS projects/products
> >>>>>>>
> >>>>>>> * Inclusion of additional SDK for Python, which is under active
> >>>>>>
> >>>>>>
> >>>>>> development
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> == Current Status ==
> >>>>>>>
> >>>>>>> === Meritocracy ===
> >>>>>>>
> >>>>>>> Dataflow was initially developed based on ideas from many employees
> >>>>>>
> >>>>>>
> >>>>>> within
> >>>>>>>
> >>>>>>>
> >>>>>>> Google. As an ASL OSS project on GitHub, the Dataflow SDK has
> >>>>>>> received
> >>>>>>> contributions from data Artisans, Cloudera Labs, and other
> individual
> >>>>>>> developers. As a project under incubation, we are committed to
> >>>>>>
> >>>>>>
> >>>>>> expanding
> >>>>>>>
> >>>>>>>
> >>>>>>> our effort to build an environment which supports a meritocracy. We
> >>>>>>
> >>>>>>
> >>>>>> are
> >>>>>>>
> >>>>>>>
> >>>>>>> focused on engaging the community and other related projects for
> >>>>>>
> >>>>>>
> >>>>>> support
> >>>>>>>
> >>>>>>>
> >>>>>>> and contributions. Moreover, we are committed to ensure
> contributors
> >>>>>>
> >>>>>>
> >>>>>> and
> >>>>>>>
> >>>>>>>
> >>>>>>> committers to Dataflow come from a broad mix of organizations
> through
> >>>>>>
> >>>>>>
> >>>>>> a
> >>>>>>>
> >>>>>>>
> >>>>>>> merit-based decision process during incubation. We believe strongly
> >>>>>>> in
> >>>>>>
> >>>>>>
> >>>>>> the
> >>>>>>>
> >>>>>>>
> >>>>>>> Dataflow model and are committed to growing an inclusive community
> of
> >>>>>>> Dataflow contributors.
> >>>>>>>
> >>>>>>> === Community ===
> >>>>>>>
> >>>>>>> The core of the Dataflow Java SDK has been developed by Google for
> >>>>>>> use
> >>>>>>
> >>>>>>
> >>>>>> with
> >>>>>>>
> >>>>>>>
> >>>>>>> Google Cloud Dataflow. Google has active community engagement in
> the
> >>>>>>
> >>>>>>
> >>>>>> SDK
> >>>>>>>
> >>>>>>>
> >>>>>>> GitHub repository (
> >>>>>>
> >>>>>>
> >>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK
> >>>>>>>
> >>>>>>>
> >>>>>>> ),
> >>>>>>> on Stack Overflow (
> >>>>>>> http://stackoverflow.com/questions/tagged/google-cloud-dataflow)
> and
> >>>>>>
> >>>>>>
> >>>>>> has
> >>>>>>>
> >>>>>>>
> >>>>>>> had contributions from a number of organizations and indivuduals.
> >>>>>>>
> >>>>>>> Everyday, Cloud Dataflow is actively used by a number of
> >>>>>>> organizations
> >>>>>>
> >>>>>>
> >>>>>> and
> >>>>>>>
> >>>>>>>
> >>>>>>> institutions for batch and stream processing of data. We believe
> >>>>>>
> >>>>>>
> >>>>>> acceptance
> >>>>>>>
> >>>>>>>
> >>>>>>> will allow us to consolidate existing Dataflow-related work, grow
> the
> >>>>>>> Dataflow community, and deepen connections between Dataflow and
> other
> >>>>>>
> >>>>>>
> >>>>>> open
> >>>>>>>
> >>>>>>>
> >>>>>>> source projects.
> >>>>>>>
> >>>>>>> === Core Developers ===
> >>>>>>>
> >>>>>>> The core developers for Dataflow and the Dataflow runners are:
> >>>>>>>
> >>>>>>> * Frances Perry
> >>>>>>>
> >>>>>>> * Tyler Akidau
> >>>>>>>
> >>>>>>> * Davor Bonaci
> >>>>>>>
> >>>>>>> * Luke Cwik
> >>>>>>>
> >>>>>>> * Ben Chambers
> >>>>>>>
> >>>>>>> * Kenn Knowles
> >>>>>>>
> >>>>>>> * Dan Halperin
> >>>>>>>
> >>>>>>> * Daniel Mills
> >>>>>>>
> >>>>>>> * Mark Shields
> >>>>>>>
> >>>>>>> * Craig Chambers
> >>>>>>>
> >>>>>>> * Maximilian Michels
> >>>>>>>
> >>>>>>> * Tom White
> >>>>>>>
> >>>>>>> * Josh Wills
> >>>>>>>
> >>>>>>> === Alignment ===
> >>>>>>>
> >>>>>>> The Dataflow SDK can be used to create Dataflow pipelines which can
> >>>>>>> be
> >>>>>>> executed on Apache Spark or Apache Flink. Dataflow is also related
> to
> >>>>>>
> >>>>>>
> >>>>>> other
> >>>>>>>
> >>>>>>>
> >>>>>>> Apache projects, such as Apache Crunch. We plan on expanding
> >>>>>>
> >>>>>>
> >>>>>> functionality
> >>>>>>>
> >>>>>>>
> >>>>>>> for Dataflow runners, support for additional domain specific
> >>>>>>
> >>>>>>
> >>>>>> languages,
> >>>>>> and
> >>>>>>>
> >>>>>>>
> >>>>>>> increased portability so Dataflow is a powerful abstraction layer
> for
> >>>>>>
> >>>>>>
> >>>>>> data
> >>>>>>>
> >>>>>>>
> >>>>>>> processing.
> >>>>>>>
> >>>>>>> == Known Risks ==
> >>>>>>>
> >>>>>>> === Orphaned Products ===
> >>>>>>>
> >>>>>>> The Dataflow SDK is presently used by several organizations, from
> >>>>>>
> >>>>>>
> >>>>>> small
> >>>>>>>
> >>>>>>>
> >>>>>>> startups to Fortune 100 companies, to construct production
> pipelines
> >>>>>>
> >>>>>>
> >>>>>> which
> >>>>>>>
> >>>>>>>
> >>>>>>> are executed in Google Cloud Dataflow. Google has a long-term
> >>>>>>
> >>>>>>
> >>>>>> commitment
> >>>>>> to
> >>>>>>>
> >>>>>>>
> >>>>>>> advance the Dataflow SDK; moreover, Dataflow is seeing increasing
> >>>>>>
> >>>>>>
> >>>>>> interest,
> >>>>>>>
> >>>>>>>
> >>>>>>> development, and adoption from organizations outside of Google.
> >>>>>>>
> >>>>>>> === Inexperience with Open Source ===
> >>>>>>>
> >>>>>>> Google believes strongly in open source and the exchange of
> >>>>>>
> >>>>>>
> >>>>>> information
> >>>>>> to
> >>>>>>>
> >>>>>>>
> >>>>>>> advance new ideas and work. Examples of this commitment are active
> >>>>>>> OSS
> >>>>>>> projects such as Chromium (https://www.chromium.org) and
> Kubernetes (
> >>>>>>> http://kubernetes.io/). With Dataflow, we have tried to be
> >>>>>>
> >>>>>>
> >>>>>> increasingly
> >>>>>>>
> >>>>>>>
> >>>>>>> open and forward-looking; we have published a paper in the VLDB
> >>>>>>
> >>>>>>
> >>>>>> conference
> >>>>>>>
> >>>>>>>
> >>>>>>> describing the Dataflow model (
> >>>>>>> http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf) and were quick to
> >>>>>>
> >>>>>>
> >>>>>> release
> >>>>>>>
> >>>>>>>
> >>>>>>> the Dataflow SDK as open source software with the launch of Cloud
> >>>>>>
> >>>>>>
> >>>>>> Dataflow.
> >>>>>>>
> >>>>>>>
> >>>>>>> Our submission to the Apache Software Foundation is a logical
> >>>>>>
> >>>>>>
> >>>>>> extension
> >>>>>> of
> >>>>>>>
> >>>>>>>
> >>>>>>> our commitment to open source software.
> >>>>>>>
> >>>>>>> === Homogeneous Developers ===
> >>>>>>>
> >>>>>>> The majority of committers in this proposal belong to Google due to
> >>>>>>
> >>>>>>
> >>>>>> the
> >>>>>>>
> >>>>>>>
> >>>>>>> fact that Dataflow has emerged from several internal Google
> projects.
> >>>>>>
> >>>>>>
> >>>>>> This
> >>>>>>>
> >>>>>>>
> >>>>>>> proposal also includes committers outside of Google who are
> actively
> >>>>>>> involved with other Apache projects, such as Hadoop, Flink, and
> >>>>>>> Spark.
> >>>>>>
> >>>>>>
> >>>>>> We
> >>>>>>>
> >>>>>>>
> >>>>>>> expect our entry into incubation will allow us to expand the number
> >>>>>>> of
> >>>>>>> individuals and organizations participating in Dataflow
> development.
> >>>>>>> Additionally, separation of the Dataflow SDK from Google Cloud
> >>>>>>
> >>>>>>
> >>>>>> Dataflow
> >>>>>>>
> >>>>>>>
> >>>>>>> allows us to focus on the open source SDK and model and do what is
> >>>>>>
> >>>>>>
> >>>>>> best
> >>>>>> for
> >>>>>>>
> >>>>>>>
> >>>>>>> this project.
> >>>>>>>
> >>>>>>> === Reliance on Salaried Developers ===
> >>>>>>>
> >>>>>>> The Dataflow SDK and Dataflow runners have been developed primarily
> >>>>>>> by
> >>>>>>> salaried developers supporting the Google Cloud Dataflow project.
> >>>>>>
> >>>>>>
> >>>>>> While
> >>>>>> the
> >>>>>>>
> >>>>>>>
> >>>>>>> Dataflow SDK and Cloud Dataflow have been developed by different
> >>>>>>> teams
> >>>>>>
> >>>>>>
> >>>>>> (and
> >>>>>>>
> >>>>>>>
> >>>>>>> this proposal would reinforce that separation) we expect our
> initial
> >>>>>>
> >>>>>>
> >>>>>> set
> >>>>>> of
> >>>>>>>
> >>>>>>>
> >>>>>>> developers will still primarily be salaried. Contribution has not
> >>>>>>> been
> >>>>>>> exclusively from salaried developers, however. For example, the
> >>>>>>
> >>>>>>
> >>>>>> contrib
> >>>>>>>
> >>>>>>>
> >>>>>>> directory of the Dataflow SDK (
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/master/contri
> >>>>>> b
> >>>>>>>
> >>>>>>>
> >>>>>>> )
> >>>>>>> contains items from free-time contributors. Moreover, seperate
> >>>>>>
> >>>>>>
> >>>>>> projects,
> >>>>>>>
> >>>>>>>
> >>>>>>> such as ScalaFlow (https://github.com/darkjh/scalaflow) have been
> >>>>>>
> >>>>>>
> >>>>>> created
> >>>>>>>
> >>>>>>>
> >>>>>>> around the Dataflow model and SDK. We expect our reliance on
> salaried
> >>>>>>> developers will decrease over time during incubation.
> >>>>>>>
> >>>>>>> === Relationship with other Apache products ===
> >>>>>>>
> >>>>>>> Dataflow directly interoperates with or utilizes several existing
> >>>>>>
> >>>>>>
> >>>>>> Apache
> >>>>>>>
> >>>>>>>
> >>>>>>> projects.
> >>>>>>>
> >>>>>>> * Build
> >>>>>>>
> >>>>>>> ** Apache Maven
> >>>>>>>
> >>>>>>> * Data I/O, Libraries
> >>>>>>>
> >>>>>>> ** Apache Avro
> >>>>>>>
> >>>>>>> ** Apache Commons
> >>>>>>>
> >>>>>>> * Dataflow runners
> >>>>>>>
> >>>>>>> ** Apache Flink
> >>>>>>>
> >>>>>>> ** Apache Spark
> >>>>>>>
> >>>>>>> Dataflow when used in batch mode shares similarities with Apache
> >>>>>>
> >>>>>>
> >>>>>> Crunch;
> >>>>>>>
> >>>>>>>
> >>>>>>> however, Dataflow is focused on a model, SDK, and abstraction layer
> >>>>>>
> >>>>>>
> >>>>>> beyond
> >>>>>>>
> >>>>>>>
> >>>>>>> Spark and Hadoop (MapReduce.) One key goal of Dataflow is to
> provide
> >>>>>>
> >>>>>>
> >>>>>> an
> >>>>>>>
> >>>>>>>
> >>>>>>> intermediate abstraction layer which can easily be implemented and
> >>>>>>
> >>>>>>
> >>>>>> utilized
> >>>>>>>
> >>>>>>>
> >>>>>>> across several different processing frameworks.
> >>>>>>>
> >>>>>>> === An excessive fascination with the Apache brand ===
> >>>>>>>
> >>>>>>> With this proposal we are not seeking attention or publicity.
> Rather,
> >>>>>>
> >>>>>>
> >>>>>> we
> >>>>>>>
> >>>>>>>
> >>>>>>> firmly believe in the Dataflow model, SDK, and the ability to make
> >>>>>>
> >>>>>>
> >>>>>> Dataflow
> >>>>>>>
> >>>>>>>
> >>>>>>> a powerful yet simple framework for data processing. While the
> >>>>>>
> >>>>>>
> >>>>>> Dataflow
> >>>>>> SDK
> >>>>>>>
> >>>>>>>
> >>>>>>> and model have been open source, we believe putting code on GitHub
> >>>>>>> can
> >>>>>>
> >>>>>>
> >>>>>> only
> >>>>>>>
> >>>>>>>
> >>>>>>> go so far. We see the Apache community, processes, and mission as
> >>>>>>
> >>>>>>
> >>>>>> critical
> >>>>>>>
> >>>>>>>
> >>>>>>> for ensuring the Dataflow SDK and model are truly community-driven,
> >>>>>>> positively impactful, and innovative open source software. While
> >>>>>>
> >>>>>>
> >>>>>> Google
> >>>>>> has
> >>>>>>>
> >>>>>>>
> >>>>>>> taken a number of steps to advance its various open source
> projects,
> >>>>>>
> >>>>>>
> >>>>>> we
> >>>>>>>
> >>>>>>>
> >>>>>>> believe Dataflow is a great fit for the Apache Software Foundation
> >>>>>>
> >>>>>>
> >>>>>> due to
> >>>>>>>
> >>>>>>>
> >>>>>>> its focus on data processing and its relationships to existing ASF
> >>>>>>> projects.
> >>>>>>>
> >>>>>>> == Documentation ==
> >>>>>>>
> >>>>>>> The following documentation is relevant to this proposal. Relevant
> >>>>>>
> >>>>>>
> >>>>>> portion
> >>>>>>>
> >>>>>>>
> >>>>>>> of the documentation will be contributed to the Apache Dataflow
> >>>>>>
> >>>>>>
> >>>>>> project.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> * Dataflow website: https://cloud.google.com/dataflow
> >>>>>>>
> >>>>>>> * Dataflow programming model:
> >>>>>>> https://cloud.google.com/dataflow/model/programming-model
> >>>>>>>
> >>>>>>> * Codebases
> >>>>>>>
> >>>>>>> ** Dataflow Java SDK:
> >>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK
> >>>>>>>
> >>>>>>> ** Flink Dataflow runner:
> >>>>>>
> >>>>>>
> >>>>>> https://github.com/dataArtisans/flink-dataflow
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> ** Spark Dataflow runner:
> https://github.com/cloudera/spark-dataflow
> >>>>>>>
> >>>>>>> * Dataflow Java SDK issue tracker:
> >>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues
> >>>>>>>
> >>>>>>> * google-cloud-dataflow tag on Stack Overflow:
> >>>>>>> http://stackoverflow.com/questions/tagged/google-cloud-dataflow
> >>>>>>>
> >>>>>>> == Initial Source ==
> >>>>>>>
> >>>>>>> The initial source for Dataflow which we will submit to the Apache
> >>>>>>> Foundation will include several related projects which are
> currently
> >>>>>>
> >>>>>>
> >>>>>> hosted
> >>>>>>>
> >>>>>>>
> >>>>>>> on the GitHub repositories:
> >>>>>>>
> >>>>>>> * Dataflow Java SDK (
> >>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK)
> >>>>>>>
> >>>>>>> * Flink Dataflow runner
> >>>>>>
> >>>>>>
> >>>>>> (https://github.com/dataArtisans/flink-dataflow)
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> * Spark Dataflow runner (
> https://github.com/cloudera/spark-dataflow)
> >>>>>>>
> >>>>>>> These projects have always been Apache 2.0 licensed. We intend to
> >>>>>>
> >>>>>>
> >>>>>> bundle
> >>>>>>>
> >>>>>>>
> >>>>>>> all of these repositories since they are all complimentary and
> should
> >>>>>>
> >>>>>>
> >>>>>> be
> >>>>>>>
> >>>>>>>
> >>>>>>> maintained in one project. Prior to our submission, we will combine
> >>>>>>
> >>>>>>
> >>>>>> all
> >>>>>> of
> >>>>>>>
> >>>>>>>
> >>>>>>> these projects into a new git repository.
> >>>>>>>
> >>>>>>> == Source and Intellectual Property Submission Plan ==
> >>>>>>>
> >>>>>>> The source for the Dataflow SDK and the three runners (Spark,
> Flink,
> >>>>>>
> >>>>>>
> >>>>>> Google
> >>>>>>>
> >>>>>>>
> >>>>>>> Cloud Dataflow) are already licensed under an Apache 2 license.
> >>>>>>>
> >>>>>>> * Dataflow SDK -
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/LICENS
> >>>>>> E
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> * Flink runner -
> >>>>>>> https://github.com/dataArtisans/flink-dataflow/blob/master/LICENSE
> >>>>>>>
> >>>>>>> * Spark runner -
> >>>>>>> https://github.com/cloudera/spark-dataflow/blob/master/LICENSE
> >>>>>>>
> >>>>>>> Contributors to the Dataflow SDK have also signed the Google
> >>>>>>
> >>>>>>
> >>>>>> Individual
> >>>>>>>
> >>>>>>>
> >>>>>>> Contributor License Agreement (
> >>>>>>> https://cla.developers.google.com/about/google-individual) in
> order
> >>>>>>> to
> >>>>>>> contribute to the project.
> >>>>>>>
> >>>>>>> With respect to trademark rights, Google does not hold a trademark
> on
> >>>>>>
> >>>>>>
> >>>>>> the
> >>>>>>>
> >>>>>>>
> >>>>>>> phrase ³Dataflow.² Based on feedback and guidance we receive during
> >>>>>>
> >>>>>>
> >>>>>> the
> >>>>>>>
> >>>>>>>
> >>>>>>> incubation process, we are open to renaming the project if
> necessary
> >>>>>>
> >>>>>>
> >>>>>> for
> >>>>>>>
> >>>>>>>
> >>>>>>> trademark or other concerns.
> >>>>>>>
> >>>>>>> == External Dependencies ==
> >>>>>>>
> >>>>>>> All external dependencies are licensed under an Apache 2.0 or
> >>>>>>> Apache-compatible license. As we grow the Dataflow community we
> will
> >>>>>>> configure our build process to require and validate all
> contributions
> >>>>>>
> >>>>>>
> >>>>>> and
> >>>>>>>
> >>>>>>>
> >>>>>>> dependencies are licensed under the Apache 2.0 license or are under
> >>>>>>> an
> >>>>>>> Apache-compatible license.
> >>>>>>>
> >>>>>>> == Required Resources ==
> >>>>>>>
> >>>>>>> === Mailing Lists ===
> >>>>>>>
> >>>>>>> We currently use a mix of mailing lists. We will migrate our
> existing
> >>>>>>> mailing lists to the following:
> >>>>>>>
> >>>>>>> * d...@dataflow.incubator.apache.org
> >>>>>>>
> >>>>>>> * u...@dataflow.incubator.apache.org
> >>>>>>>
> >>>>>>> * priv...@dataflow.incubator.apache.org
> >>>>>>>
> >>>>>>> * comm...@dataflow.incubator.apache.org
> >>>>>>>
> >>>>>>> === Source Control ===
> >>>>>>>
> >>>>>>> The Dataflow team currently uses Git and would like to continue to
> do
> >>>>>>
> >>>>>>
> >>>>>> so.
> >>>>>>>
> >>>>>>>
> >>>>>>> We request a Git repository for Dataflow with mirroring to GitHub
> >>>>>>
> >>>>>>
> >>>>>> enabled.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> === Issue Tracking ===
> >>>>>>>
> >>>>>>> We request the creation of an Apache-hosted JIRA. The Dataflow
> >>>>>>
> >>>>>>
> >>>>>> project is
> >>>>>>>
> >>>>>>>
> >>>>>>> currently using both a public GitHub issue tracker and internal
> >>>>>>> Google
> >>>>>>> issue tracking. We will migrate and combine from these two sources
> to
> >>>>>>
> >>>>>>
> >>>>>> the
> >>>>>>>
> >>>>>>>
> >>>>>>> Apache JIRA.
> >>>>>>>
> >>>>>>> == Initial Committers ==
> >>>>>>>
> >>>>>>> * Aljoscha Krettek     [aljos...@apache.org]
> >>>>>>>
> >>>>>>> * Amit Sela            [amitsel...@gmail.com]
> >>>>>>>
> >>>>>>> * Ben Chambers         [bchamb...@google.com]
> >>>>>>>
> >>>>>>> * Craig Chambers       [chamb...@google.com]
> >>>>>>>
> >>>>>>> * Dan Halperin         [dhalp...@google.com]
> >>>>>>>
> >>>>>>> * Davor Bonaci         [da...@google.com]
> >>>>>>>
> >>>>>>> * Frances Perry        [f...@google.com]
> >>>>>>>
> >>>>>>> * James Malone         [jamesmal...@google.com]
> >>>>>>>
> >>>>>>> * Jean-Baptiste Onofré [jbono...@apache.org]
> >>>>>>>
> >>>>>>> * Josh Wills           [jwi...@apache.org]
> >>>>>>>
> >>>>>>> * Kostas Tzoumas       [kos...@data-artisans.com]
> >>>>>>>
> >>>>>>> * Kenneth Knowles      [k...@google.com]
> >>>>>>>
> >>>>>>> * Luke Cwik            [lc...@google.com]
> >>>>>>>
> >>>>>>> * Maximilian Michels   [m...@apache.org]
> >>>>>>>
> >>>>>>> * Stephan Ewen         [step...@data-artisans.com]
> >>>>>>>
> >>>>>>> * Tom White            [t...@cloudera.com]
> >>>>>>>
> >>>>>>> * Tyler Akidau         [taki...@google.com]
> >>>>>>>
> >>>>>>> == Affiliations ==
> >>>>>>>
> >>>>>>> The initial committers are from six organizations. Google developed
> >>>>>>> Dataflow and the Dataflow SDK, data Artisans developed the Flink
> >>>>>>
> >>>>>>
> >>>>>> runner,
> >>>>>>>
> >>>>>>>
> >>>>>>> and Cloudera (Labs) developed the Spark runner.
> >>>>>>>
> >>>>>>> * Cloudera
> >>>>>>>
> >>>>>>> ** Tom White
> >>>>>>>
> >>>>>>> * Data Artisans
> >>>>>>>
> >>>>>>> ** Aljoscha Krettek
> >>>>>>>
> >>>>>>> ** Kostas Tzoumas
> >>>>>>>
> >>>>>>> ** Maximilian Michels
> >>>>>>>
> >>>>>>> ** Stephan Ewen
> >>>>>>>
> >>>>>>> * Google
> >>>>>>>
> >>>>>>> ** Ben Chambers
> >>>>>>>
> >>>>>>> ** Dan Halperin
> >>>>>>>
> >>>>>>> ** Davor Bonaci
> >>>>>>>
> >>>>>>> ** Frances Perry
> >>>>>>>
> >>>>>>> ** James Malone
> >>>>>>>
> >>>>>>> ** Kenneth Knowles
> >>>>>>>
> >>>>>>> ** Luke Cwik
> >>>>>>>
> >>>>>>> ** Tyler Akidau
> >>>>>>>
> >>>>>>> * PayPal
> >>>>>>>
> >>>>>>> ** Amit Sela
> >>>>>>>
> >>>>>>> * Slack
> >>>>>>>
> >>>>>>> ** Josh Wills
> >>>>>>>
> >>>>>>> * Talend
> >>>>>>>
> >>>>>>> ** Jean-Baptiste Onofré
> >>>>>>>
> >>>>>>> == Sponsors ==
> >>>>>>>
> >>>>>>> === Champion ===
> >>>>>>>
> >>>>>>> * Jean-Baptiste Onofre      [jbono...@apache.org]
> >>>>>>>
> >>>>>>> === Nominated Mentors ===
> >>>>>>>
> >>>>>>> * Jim Jagielski           [j...@apache.org]
> >>>>>>>
> >>>>>>> * Venkatesh Seetharam     [venkat...@apache.org]
> >>>>>>>
> >>>>>>> * Bertrand Delacretaz     [bdelacre...@apache.org]
> >>>>>>>
> >>>>>>> * Ted Dunning             [tdunn...@apache.org]
> >>>>>>>
> >>>>>>> === Sponsoring Entity ===
> >>>>>>>
> >>>>>>> The Apache Incubator
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Sean
> >>>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> >>>> For additional commands, e-mail: general-h...@incubator.apache.org
> >>>>
> >>>
> >>> --
> >>> Jean-Baptiste Onofré
> >>> jbono...@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> >>> For additional commands, e-mail: general-h...@incubator.apache.org
> >>>
> >>
> >>
> >>
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>

Re: [DISCUSS] Apache Dataflow Incubator Proposal

Reply via email to