Re: [DISCUSS] Apache Dataflow Incubator Proposal

Ajay Yadav Sun, 24 Jan 2016 21:26:24 -0800

Great proposal. I would also like to contribute to the project especially
the Python SDK, if possible.


Cheers
Ajay Yadava

On Sun, Jan 24, 2016 at 1:25 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi Seshu,
>
> it does both: streaming and batching data processing.
>
> Regards
> JB
>
> On 01/23/2016 03:01 PM, Adunuthula, Seshu wrote:
>
>> Did not get a chance to play with it yet, Within Google is it used more as
>> a MR replacement or a Stream processing engine? Or it does both of them
>> fantastically well?
>>
>>
>> On 1/22/16, 10:58 AM, "Frances Perry" <f...@google.com.INVALID> wrote:
>>
>> Crunch started as a clone of FlumeJava, which was Google internal. In the
>>> meantime inside Google, FlumeJava evolved into Dataflow. So all three
>>> share
>>> a number of concepts like PCollections, ParDo, DoFn, etc. However,
>>> Dataflow
>>> adds a number of new things -- the biggest being a unified
>>> batch/streaming
>>> semantics using concepts like Windowing and Triggers. Tyler Akidau's
>>> OReilly post has a really nice explanation:
>>> https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
>>>
>>> On Fri, Jan 22, 2016 at 10:42 AM, Ashish <paliwalash...@gmail.com>
>>> wrote:
>>>
>>> Crunch has Spark pipelines, but not sure about the runner abstraction.
>>>>
>>>> May be Josh Wills or Tom White can provide more insight on this topic.
>>>> They are core devs for both projects :)
>>>>
>>>> On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce
>>>>>
>>>> pipeline, it
>>>>
>>>>> doesn't provide runner abstraction. It's based on FlumeJava.
>>>>>
>>>>> The logic is very similar (with DoFns, pipelines, ...). Correct me if
>>>>>
>>>> I'm
>>>>
>>>>> wrong, but Crunch started after Google Dataflow, especially because
>>>>>
>>>> Dataflow
>>>>
>>>>> was not opensourced at that time.
>>>>>
>>>>> So, I agree it's very similar/close.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>>
>>>>> On 01/22/2016 05:51 PM, Ashish wrote:
>>>>>
>>>>>>
>>>>>> Hi JB,
>>>>>>
>>>>>> Curious to know about how it compares to Apache Crunch? Constructs
>>>>>> looks very familiar (had used Crunch long ago)
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>> - Ashish
>>>>>>
>>>>>> On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré
>>>>>>
>>>>> <j...@nanthrax.net>
>>>>
>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi Seshu,
>>>>>>>
>>>>>>> I blogged about Apache Dataflow proposal:
>>>>>>> http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/
>>>>>>>
>>>>>>> You can see in the "what's next ?" section that new runners, skins
>>>>>>>
>>>>>> and
>>>>
>>>>> sources are on our roadmap. Definitely, a storm runner could be
>>>>>>>
>>>>>> part of
>>>>
>>>>> this.
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>>
>>>>>>> On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Awesome to see CloudDataFlow coming to Apache. The Stream
>>>>>>>>
>>>>>>> Processing
>>>>
>>>>> area
>>>>>>>> has been in general fragmented with a variety of solutions, hoping
>>>>>>>>
>>>>>>> the
>>>>
>>>>> community galvanizes around Apache Data Flow.
>>>>>>>>
>>>>>>>> We are still in the "Apache Storm" world, Any chance for folks
>>>>>>>>
>>>>>>> building
>>>>
>>>>> a
>>>>>>>> "Storm Runner²?
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1/20/16, 9:39 AM, "James Malone"
>>>>>>>>
>>>>>>> <jamesmal...@google.com.INVALID>
>>>>
>>>>> wrote:
>>>>>>>>
>>>>>>>> Great proposal. I like that your proposal includes a well
>>>>>>>>>>
>>>>>>>>> presented
>>>>
>>>>> roadmap, but I don't see any goals that directly address
>>>>>>>>>>
>>>>>>>>> building a
>>>>
>>>>> larger
>>>>>>>>>> community. Y'all have any ideas around outreach that will help
>>>>>>>>>>
>>>>>>>>> with
>>>>
>>>>> adoption?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Thank you and fair point. We have a few additional ideas which we
>>>>>>>>>
>>>>>>>> can
>>>>
>>>>> put
>>>>>>>>> into the Community section.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> As a start, I recommend y'all add a section to the proposal on
>>>>>>>>>>
>>>>>>>>> the
>>>>
>>>>> wiki
>>>>>>>>>> page for "Additional Interested Contributors" so that folks who
>>>>>>>>>>
>>>>>>>>> want
>>>>
>>>>> to
>>>>>>>>>> sign up to participate in the project can do so without
>>>>>>>>>>
>>>>>>>>> requesting
>>>>
>>>>> additions to the initial committer list.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This is a great idea and I think it makes a lot of sense to add an
>>>>>>>>> "Additional
>>>>>>>>> Interested Contributors" section to the proposal.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jan 20, 2016 at 10:32 AM, James Malone <
>>>>>>>>>> jamesmal...@google.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>> Hello everyone,
>>>>>>>>>>>
>>>>>>>>>>> Attached to this message is a proposed new project - Apache
>>>>>>>>>>>
>>>>>>>>>> Dataflow,
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> a
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> unified programming model for data processing and integration.
>>>>>>>>>>>
>>>>>>>>>>> The text of the proposal is included below. Additionally, the
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> proposal is
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> in draft form on the wiki where we will make any required
>>>>>>>>>>>
>>>>>>>>>> changes:
>>>>
>>>>>
>>>>>>>>>>> https://wiki.apache.org/incubator/DataflowProposal
>>>>>>>>>>>
>>>>>>>>>>> We look forward to your feedback and input.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> James
>>>>>>>>>>>
>>>>>>>>>>> ----
>>>>>>>>>>>
>>>>>>>>>>> = Apache Dataflow =
>>>>>>>>>>>
>>>>>>>>>>> == Abstract ==
>>>>>>>>>>>
>>>>>>>>>>> Dataflow is an open source, unified model and set of
>>>>>>>>>>> language-specific
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> SDKs
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> for defining and executing data processing workflows, and also
>>>>>>>>>>>
>>>>>>>>>> data
>>>>
>>>>> ingestion and integration flows, supporting Enterprise
>>>>>>>>>>>
>>>>>>>>>> Integration
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> Patterns
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> simplify
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> the mechanics of large-scale batch and streaming data processing
>>>>>>>>>>>
>>>>>>>>>> and
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> can
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> run on a number of runtimes like Apache Flink, Apache Spark, and
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Google
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> different
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> languages, allowing users to easily implement their data
>>>>>>>>>>>
>>>>>>>>>> integration
>>>>
>>>>> processes.
>>>>>>>>>>>
>>>>>>>>>>> == Proposal ==
>>>>>>>>>>>
>>>>>>>>>>> Dataflow is a simple, flexible, and powerful system for
>>>>>>>>>>>
>>>>>>>>>> distributed
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> data
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> processing at any scale. Dataflow provides a unified programming
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> model, a
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> software development kit to define and construct data processing
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> pipelines,
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> and runners to execute Dataflow pipelines in several runtime
>>>>>>>>>>>
>>>>>>>>>> engines,
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> like
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow
>>>>>>>>>>>
>>>>>>>>>> can
>>>> be
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> used
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> for a variety of streaming or batch data processing goals
>>>>>>>>>>>
>>>>>>>>>> including
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> ETL,
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> stream analysis, and aggregate computation. The underlying
>>>>>>>>>>> programming
>>>>>>>>>>> model for Dataflow provides MapReduce-like parallelism, combined
>>>>>>>>>>>
>>>>>>>>>> with
>>>>
>>>>> support for powerful data windowing, and fine-grained
>>>>>>>>>>>
>>>>>>>>>> correctness
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> control.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> == Background ==
>>>>>>>>>>>
>>>>>>>>>>> Dataflow started as a set of Google projects focused on making
>>>>>>>>>>>
>>>>>>>>>> data
>>>>
>>>>> processing easier, faster, and less costly. The Dataflow model
>>>>>>>>>>>
>>>>>>>>>> is a
>>>>
>>>>> successor to MapReduce, FlumeJava, and Millwheel inside Google
>>>>>>>>>>>
>>>>>>>>>> and
>>>> is
>>>>
>>>>> focused on providing a unified solution for batch and stream
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> processing.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> These projects on which Dataflow is based have been published in
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> several
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> papers made available to the public:
>>>>>>>>>>>
>>>>>>>>>>> * MapReduce - http://research.google.com/archive/mapreduce.html
>>>>>>>>>>>
>>>>>>>>>>> * Dataflow model  -
>>>>>>>>>>>
>>>>>>>>>> http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
>>>>
>>>>>
>>>>>>>>>>> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf
>>>>>>>>>>>
>>>>>>>>>>> * MillWheel - http://research.google.com/pubs/pub41378.html
>>>>>>>>>>>
>>>>>>>>>>> Dataflow was designed from the start to provide a portable
>>>>>>>>>>> programming
>>>>>>>>>>> layer. When you define a data processing pipeline with the
>>>>>>>>>>>
>>>>>>>>>> Dataflow
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> model,
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> you are creating a job which is capable of being processed by
>>>>>>>>>>>
>>>>>>>>>> any
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> number
>>>>>>>>>> of
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Dataflow processing engines. Several engines have been
>>>>>>>>>>>
>>>>>>>>>> developed to
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> run
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Dataflow pipelines in other open source runtimes, including a
>>>>>>>>>>> Dataflow
>>>>>>>>>>> runner for Apache Flink and Apache Spark. There is also a
>>>>>>>>>>>
>>>>>>>>>> ³direct
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> runner²,
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> for execution on the developer machine (mainly for dev/debug
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> purposes).
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Another runner allows a Dataflow program to run on a managed
>>>>>>>>>>>
>>>>>>>>>> service,
>>>>
>>>>> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow
>>>>>>>>>>>
>>>>>>>>>> Java
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> SDK is
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> already available on GitHub, and independent from the Google
>>>>>>>>>>>
>>>>>>>>>> Cloud
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> Dataflow
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> service. Another Python SDK is currently in active development.
>>>>>>>>>>>
>>>>>>>>>>> In this proposal, the Dataflow SDKs, model, and a set of runners
>>>>>>>>>>>
>>>>>>>>>> will
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> be
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> submitted as an OSS project under the ASF. The runners which
>>>>>>>>>>>
>>>>>>>>>> are a
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> part
>>>>>>>>>> of
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> this proposal include those for Spark (from Cloudera), Flink
>>>>>>>>>>>
>>>>>>>>>> (from
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> data
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Artisans), and local development (from Google); the Google Cloud
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Dataflow
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> service runner is not included in this proposal. Further
>>>>>>>>>>>
>>>>>>>>>> references
>>>>
>>>>> to
>>>>>>>>>>> Dataflow will refer to the Dataflow model, SDKs, and runners
>>>>>>>>>>>
>>>>>>>>>> which
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> are a
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> part of this proposal (Apache Dataflow) only. The initial
>>>>>>>>>>>
>>>>>>>>>> submission
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> will
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> contain the already-released Java SDK; Google intends to submit
>>>>>>>>>>>
>>>>>>>>>> the
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> Python
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> SDK later in the incubation process. The Google Cloud Dataflow
>>>>>>>>>>> service
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> will
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> continue to be one of many runners for Dataflow, built on Google
>>>>>>>>>>> Cloud
>>>>>>>>>>> Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow
>>>>>>>>>>>
>>>>>>>>>> will
>>>>
>>>>> develop against the Apache project additions, updates, and
>>>>>>>>>>>
>>>>>>>>>> changes.
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> Google
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Cloud Dataflow will become one user of Apache Dataflow and will
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> participate
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> in the project openly and publicly.
>>>>>>>>>>>
>>>>>>>>>>> The Dataflow programming model has been designed with
>>>>>>>>>>>
>>>>>>>>>> simplicity,
>>>>
>>>>> scalability, and speed as key tenants. In the Dataflow model,
>>>>>>>>>>>
>>>>>>>>>> you
>>>>
>>>>> only
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> need
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> to think about four top-level concepts when constructing your
>>>>>>>>>>>
>>>>>>>>>> data
>>>>
>>>>> processing job:
>>>>>>>>>>>
>>>>>>>>>>> * Pipelines - The data processing job made of a series of
>>>>>>>>>>> computations
>>>>>>>>>>> including input, processing, and output
>>>>>>>>>>>
>>>>>>>>>>> * PCollections - Bounded (or unbounded) datasets which represent
>>>>>>>>>>>
>>>>>>>>>> the
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> input,
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> intermediate and output data in pipelines
>>>>>>>>>>>
>>>>>>>>>>> * PTransforms - A data processing step in a pipeline in which
>>>>>>>>>>>
>>>>>>>>>> one
>>>> or
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> more
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> PCollections are an input and output
>>>>>>>>>>>
>>>>>>>>>>> * I/O Sources and Sinks - APIs for reading and writing data
>>>>>>>>>>>
>>>>>>>>>> which
>>>> are
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> roots and endpoints of the pipeline
>>>>>>>>>>>
>>>>>>>>>>> == Rationale ==
>>>>>>>>>>>
>>>>>>>>>>> With Dataflow, Google intended to develop a framework which
>>>>>>>>>>>
>>>>>>>>>> allowed
>>>>
>>>>> developers to be maximally productive in defining the
>>>>>>>>>>>
>>>>>>>>>> processing,
>>>> and
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> then
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> be able to execute the program at various levels of
>>>>>>>>>>> latency/cost/completeness without re-architecting or re-writing
>>>>>>>>>>>
>>>>>>>>>> it.
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> This
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> goal was informed by Google¹s past experience  developing
>>>>>>>>>>>
>>>>>>>>>> several
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> models,
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> frameworks, and tools useful for large-scale and distributed
>>>>>>>>>>>
>>>>>>>>>> data
>>>>
>>>>> processing. While Google has previously published papers
>>>>>>>>>>>
>>>>>>>>>> describing
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> some
>>>>>>>>>> of
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> its technologies, Google decided to take a different approach
>>>>>>>>>>>
>>>>>>>>>> with
>>>>
>>>>> Dataflow. Google open-sourced the SDK and model alongside
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> commercialization
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> of the idea and ahead of publishing papers on the topic. As a
>>>>>>>>>>>
>>>>>>>>>> result,
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> a
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> number of open source runtimes exist for Dataflow, such as the
>>>>>>>>>>>
>>>>>>>>>> Apache
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> Flink
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> and Apache Spark runners.
>>>>>>>>>>>
>>>>>>>>>>> We believe that submitting Dataflow as an Apache project will
>>>>>>>>>>>
>>>>>>>>>> provide
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> an
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> immediate, worthwhile, and substantial contribution to the open
>>>>>>>>>>> source
>>>>>>>>>>> community. As an incubating project, we believe Dataflow will
>>>>>>>>>>>
>>>>>>>>>> have
>>>> a
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> better
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> opportunity to provide a meaningful contribution to OSS and also
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> integrate
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> with other Apache projects.
>>>>>>>>>>>
>>>>>>>>>>> In the long term, we believe Dataflow can be a powerful
>>>>>>>>>>>
>>>>>>>>>> abstraction
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> layer
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> for data processing. By providing an abstraction layer for data
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> pipelines
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> and processing, data workflows can be increasingly portable,
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> resilient to
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> breaking changes in tooling, and compatible across many
>>>>>>>>>>>
>>>>>>>>>> execution
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> engines,
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> runtimes, and open source projects.
>>>>>>>>>>>
>>>>>>>>>>> == Initial Goals ==
>>>>>>>>>>>
>>>>>>>>>>> We are breaking our initial goals into immediate (< 2 months),
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> short-term
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> (2-4 months), and intermediate-term (> 4 months).
>>>>>>>>>>>
>>>>>>>>>>> Our immediate goals include the following:
>>>>>>>>>>>
>>>>>>>>>>> * Plan for reconciling the Dataflow Java SDK and various runners
>>>>>>>>>>>
>>>>>>>>>> into
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> one
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> project
>>>>>>>>>>>
>>>>>>>>>>> * Plan for refactoring the existing Java SDK for better
>>>>>>>>>>>
>>>>>>>>>> extensibility
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> by
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> SDK and runner writers
>>>>>>>>>>>
>>>>>>>>>>> * Validating all dependencies are ASL 2.0 or compatible
>>>>>>>>>>>
>>>>>>>>>>> * Understanding and adapting to the Apache development process
>>>>>>>>>>>
>>>>>>>>>>> Our short-term goals include:
>>>>>>>>>>>
>>>>>>>>>>> * Moving the newly-merged lists, and build utilities to Apache
>>>>>>>>>>>
>>>>>>>>>>> * Start refactoring codebase and move code to Apache Git repo
>>>>>>>>>>>
>>>>>>>>>>> * Continue development of new features, functions, and fixes in
>>>>>>>>>>>
>>>>>>>>>> the
>>>>
>>>>> Dataflow Java SDK, and Dataflow runners
>>>>>>>>>>>
>>>>>>>>>>> * Cleaning up the Dataflow SDK sources and crafting a roadmap
>>>>>>>>>>>
>>>>>>>>>> and
>>>>
>>>>> plan
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> for
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> how to include new major ideas, modules, and runtimes
>>>>>>>>>>>
>>>>>>>>>>> * Establishment of easy and clear build/test framework for
>>>>>>>>>>>
>>>>>>>>>> Dataflow
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> associated runtimes; creation of testing, rollback, and
>>>>>>>>>>>
>>>>>>>>>> validation
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> policy
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> * Analysis and design for work needed to make Dataflow a better
>>>>>>>>>>>
>>>>>>>>>> data
>>>>
>>>>> processing abstraction layer for multiple open source frameworks
>>>>>>>>>>>
>>>>>>>>>> and
>>>>
>>>>> environments
>>>>>>>>>>>
>>>>>>>>>>> Finally, we have a number of intermediate-term goals:
>>>>>>>>>>>
>>>>>>>>>>> * Roadmapping, planning, and execution of integrations with
>>>>>>>>>>>
>>>>>>>>>> other
>>>> OSS
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> non-OSS projects/products
>>>>>>>>>>>
>>>>>>>>>>> * Inclusion of additional SDK for Python, which is under active
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> development
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> == Current Status ==
>>>>>>>>>>>
>>>>>>>>>>> === Meritocracy ===
>>>>>>>>>>>
>>>>>>>>>>> Dataflow was initially developed based on ideas from many
>>>>>>>>>>>
>>>>>>>>>> employees
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> within
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Google. As an ASL OSS project on GitHub, the Dataflow SDK has
>>>>>>>>>>> received
>>>>>>>>>>> contributions from data Artisans, Cloudera Labs, and other
>>>>>>>>>>>
>>>>>>>>>> individual
>>>>
>>>>> developers. As a project under incubation, we are committed to
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> expanding
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> our effort to build an environment which supports a
>>>>>>>>>>>
>>>>>>>>>> meritocracy. We
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> are
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> focused on engaging the community and other related projects for
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> support
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> and contributions. Moreover, we are committed to ensure
>>>>>>>>>>>
>>>>>>>>>> contributors
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> committers to Dataflow come from a broad mix of organizations
>>>>>>>>>>>
>>>>>>>>>> through
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> a
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> merit-based decision process during incubation. We believe
>>>>>>>>>>>
>>>>>>>>>> strongly
>>>>
>>>>> in
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Dataflow model and are committed to growing an inclusive
>>>>>>>>>>>
>>>>>>>>>> community
>>>> of
>>>>
>>>>> Dataflow contributors.
>>>>>>>>>>>
>>>>>>>>>>> === Community ===
>>>>>>>>>>>
>>>>>>>>>>> The core of the Dataflow Java SDK has been developed by Google
>>>>>>>>>>>
>>>>>>>>>> for
>>>>
>>>>> use
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> with
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Google Cloud Dataflow. Google has active community engagement in
>>>>>>>>>>>
>>>>>>>>>> the
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> SDK
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> GitHub repository (
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ),
>>>>>>>>>>> on Stack Overflow (
>>>>>>>>>>> http://stackoverflow.com/questions/tagged/google-cloud-dataflow)
>>>>>>>>>>>
>>>>>>>>>> and
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> has
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> had contributions from a number of organizations and
>>>>>>>>>>>
>>>>>>>>>> indivuduals.
>>>>
>>>>>
>>>>>>>>>>> Everyday, Cloud Dataflow is actively used by a number of
>>>>>>>>>>> organizations
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> institutions for batch and stream processing of data. We believe
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> acceptance
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> will allow us to consolidate existing Dataflow-related work,
>>>>>>>>>>>
>>>>>>>>>> grow
>>>> the
>>>>
>>>>> Dataflow community, and deepen connections between Dataflow and
>>>>>>>>>>>
>>>>>>>>>> other
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> open
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> source projects.
>>>>>>>>>>>
>>>>>>>>>>> === Core Developers ===
>>>>>>>>>>>
>>>>>>>>>>> The core developers for Dataflow and the Dataflow runners are:
>>>>>>>>>>>
>>>>>>>>>>> * Frances Perry
>>>>>>>>>>>
>>>>>>>>>>> * Tyler Akidau
>>>>>>>>>>>
>>>>>>>>>>> * Davor Bonaci
>>>>>>>>>>>
>>>>>>>>>>> * Luke Cwik
>>>>>>>>>>>
>>>>>>>>>>> * Ben Chambers
>>>>>>>>>>>
>>>>>>>>>>> * Kenn Knowles
>>>>>>>>>>>
>>>>>>>>>>> * Dan Halperin
>>>>>>>>>>>
>>>>>>>>>>> * Daniel Mills
>>>>>>>>>>>
>>>>>>>>>>> * Mark Shields
>>>>>>>>>>>
>>>>>>>>>>> * Craig Chambers
>>>>>>>>>>>
>>>>>>>>>>> * Maximilian Michels
>>>>>>>>>>>
>>>>>>>>>>> * Tom White
>>>>>>>>>>>
>>>>>>>>>>> * Josh Wills
>>>>>>>>>>>
>>>>>>>>>>> === Alignment ===
>>>>>>>>>>>
>>>>>>>>>>> The Dataflow SDK can be used to create Dataflow pipelines which
>>>>>>>>>>>
>>>>>>>>>> can
>>>>
>>>>> be
>>>>>>>>>>> executed on Apache Spark or Apache Flink. Dataflow is also
>>>>>>>>>>>
>>>>>>>>>> related
>>>> to
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> other
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Apache projects, such as Apache Crunch. We plan on expanding
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> functionality
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> for Dataflow runners, support for additional domain specific
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> languages,
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> increased portability so Dataflow is a powerful abstraction
>>>>>>>>>>>
>>>>>>>>>> layer
>>>> for
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> data
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> processing.
>>>>>>>>>>>
>>>>>>>>>>> == Known Risks ==
>>>>>>>>>>>
>>>>>>>>>>> === Orphaned Products ===
>>>>>>>>>>>
>>>>>>>>>>> The Dataflow SDK is presently used by several organizations,
>>>>>>>>>>>
>>>>>>>>>> from
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> small
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> startups to Fortune 100 companies, to construct production
>>>>>>>>>>>
>>>>>>>>>> pipelines
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> which
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> are executed in Google Cloud Dataflow. Google has a long-term
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> commitment
>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> advance the Dataflow SDK; moreover, Dataflow is seeing
>>>>>>>>>>>
>>>>>>>>>> increasing
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> interest,
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> development, and adoption from organizations outside of Google.
>>>>>>>>>>>
>>>>>>>>>>> === Inexperience with Open Source ===
>>>>>>>>>>>
>>>>>>>>>>> Google believes strongly in open source and the exchange of
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> information
>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> advance new ideas and work. Examples of this commitment are
>>>>>>>>>>>
>>>>>>>>>> active
>>>>
>>>>> OSS
>>>>>>>>>>> projects such as Chromium (https://www.chromium.org) and
>>>>>>>>>>>
>>>>>>>>>> Kubernetes (
>>>>
>>>>> http://kubernetes.io/). With Dataflow, we have tried to be
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> increasingly
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> open and forward-looking; we have published a paper in the VLDB
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> conference
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> describing the Dataflow model (
>>>>>>>>>>> http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf) and were quick
>>>>>>>>>>>
>>>>>>>>>> to
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> release
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> the Dataflow SDK as open source software with the launch of
>>>>>>>>>>>
>>>>>>>>>> Cloud
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> Dataflow.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Our submission to the Apache Software Foundation is a logical
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> extension
>>>>>>>>>> of
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> our commitment to open source software.
>>>>>>>>>>>
>>>>>>>>>>> === Homogeneous Developers ===
>>>>>>>>>>>
>>>>>>>>>>> The majority of committers in this proposal belong to Google
>>>>>>>>>>>
>>>>>>>>>> due to
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> fact that Dataflow has emerged from several internal Google
>>>>>>>>>>>
>>>>>>>>>> projects.
>>>>
>>>>>
>>>>>>>>>>
>>>>>>>>>> This
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> proposal also includes committers outside of Google who are
>>>>>>>>>>>
>>>>>>>>>> actively
>>>>
>>>>> involved with other Apache projects, such as Hadoop, Flink, and
>>>>>>>>>>
>>>>>>>>>>

-- 
_____________________________________________________________
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.

Re: [DISCUSS] Apache Dataflow Incubator Proposal

Reply via email to