There have been multiple scenarios where people changed Beam, and ended up
breaking the Dataflow runner because that code lived in a private
repository. I believe that putting the Dataflow runner code in the public
repository will make it easier and simpler to make changes to Apache Beam.


On Thu, Sep 13, 2018 at 10:38 AM Lukasz Cwik <> wrote:

> At Google we have been importing the Apache Beam code base and integrating
> it with the Google portion of the codebase that supports the Dataflow
> worker. This process is painful as we regularly are making breaking API
> changes to support libraries related to running portable pipelines (and
> sometimes in other places as well). This has made it sometimes difficult
> for PR changes to make changes without either breaking something for Google
> or waiting for a Googler to make the change internally (e.g. dependency
> updates).
> This code is very similar to the other integrations that exist for runners
> such as Flink/Spark/Apex/Samza. It is an adaption layer that sits on top of
> an execution engine. There is no super secret awesome stuff as this code
> was already publicly visible in the past when it was part of the Google
> Cloud Dataflow github repo[1].
> Process wise the code will need to get approval from Google to be donated
> and for it to go through the code donation process but before we attempt to
> do that, I was wondering whether the community would object to adding this
> code to the master branch?
> The up side is that people can make breaking changes and fix it for all
> runners. It will also help Googlers contribute more to the portability
> story as it will remove the burden of doing the code import (wasted time)
> and it will allow people to develop in master (can have the whole project
> loaded in a single IDE).
> The downsides are that this will represent more code and unit tests to
> support.
> 1:

Reply via email to