Thank you all for the comments so far. I would follow the process as
suggested by Davor and others in this thread.

Ahmet

On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <wik...@apache.org>
wrote:

> Hi
>
> On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay <al...@google.com.invalid>
> wrote:
> >
> > tl;dr: I would like to start a discussion about merging python-sdk branch
> > to master branch. Python SDK is mature enough and merging it to master
> will
> > accelerate its development and adoption.
> >
>
> Good point, Ahmet!
>
> I've following closed the development since it was imported in June. For
> the prototypes I've implemented so far it works quite well; I guess we'd
> just need to focus the next months in bringing more runners support.
>
> With a great effort from a lot of contributors(*), Python SDK [1] is now a
> > mostly complete, tested, performant Python implementation of the Beam
> > model. Since June, when we first started with Python SDK in Apache Beam
> we
> > have been continuously improving it.
> >
>
> I wouldn't merge during the preparation of 0.5.0 release, but after that
> could be a good time to merge back into master.
>
>
> ** Python SDK currently supports:
> >
> > * Model: All main concepts are present (ParDo, GroupByKey, Windowing
> etc.).
> > * IO: There are extensible APIs for writing new bounded sources and
> sinks.
> > Implementations are provided for Text, Avro, BigQuery, and Datastore.
> > * Runners: Python SDK has an extensible base runner module that allows
> > building specific runners on top of it. The SDK comes with two pipeline
> > runners: DirectRunner and DataflowRunner; and it is possible to add more.
> > The existing runners are currently limited to bounded execution and
> > otherwise equivalent to their Java SDK counterparts in functionality.
> >
>
> What would the effort of porting, and maintaining, parallel versions of the
> Java runners? I guess I'd need to dig deeper in the model, but this may
> represent a major effort for the project, right?
>

It is somewhat higher for DirectRunner because DirectRunner also implements
the code for execution. It is not that high for DataflowRunner because the
base runner module has a lot of helpers with the right hooks for
implementing a generic runner. I would _expect_ the experience in general
would be similar to the latter.


>
>
>
> > * Testing: Python SDK implements ValidatesRunner test framework for
> > implementing integration test for current and future runners. There is
> unit
> > test coverage for all modules, and a number of integrations test for
> > validating existing runners.
> > * Documentation and examples: Documentation work has started on Python
> SDK.
> > Beam Programming Guide page has been updated to include Python [2]. The
> > code comes with many ready to use examples and we are in a good place to
> > start documenting those on the website.
> >
> > ** We are not done yet, next on the roadmap we have:
> >
> > * Streaming: Both of the existing runners lack support for streaming
> > execution, and currently there is work going on for adding streaming
> > support to DirectRunner [3].
> > * Documentation: Filling the rest of the Beam documentations with Python
> > SDK specific information and examples.
> > * SDK consistency: Making Python SDK consistent with the Java SDK. We
> have
> > come a long way on this and have only a few items left [4].
> > * Beamifying: We have been working on removing Dataflow-specific
> references
> > both from the documentation and from the code. There is some work left,
> and
> > we are currently working on those as well [5].
> >
> > ** Steps and implications of merging to master:
> >
> > * Master branch is merged to python-sdk branch at regular intervals and
> the
> > last merge was on 12/22. All the past merges were uneventful because
> there
> > is a minimal overlap in modified files between branches. Integrating
> > python-sdk to master will similarly touch a small number of existing
> files.
> >
> > * Python SDK is using the same tools for building and testing. It is
> > already integrated with Maven, Jenkins and Travis. Specifically the
> impact
> > to the testing infrastructure would be:
> > - There will be two additional test configurations in Travis. Since
> Travis
> > runs all configurations in parallel there should not be a noticeable
> change
> > in the Travis run time.
> > - Jenkins pre-commit test will start running the Python SDK tests. It
> will
> > add an additional 5 minutes to the completion time of pre-commit test.
> > Historically Python SDK tests were not flaky and did not cause any random
> > failures.
> > - Jenkins Python post-commit test is already separated from the other
> > post-commit tests and will continue to exist. It would not change the
> > testing time for any other test.
> >
> > * The release process needs to be updated to accommodate releasing Python
> > artifacts. Python SDK would fit in the existing release schedule and
> could
> > be released along with the Java SDK. The additional steps would include:
> > - Generating Python artifacts. This could be done with a single command
> > using Maven today.
> > - Publishing the artifacts to a central repository such as PyPI.
> >
>
> I'm more than happy to help on this. We left on purpose some things open
> when we added Maven support to the Python build.
>

That would be awesome. We can coordinate on that post-merge.


>
>
>
> > - Updating the release guide to reflect the changes above.
> >
> > * Users: There are existing users using the Python SDK. To give a rough
> > estimate, a distribution of the Beam Python SDK had a total of 23K
> > downloads in the past 6 months [6]. Some of those users are already
> engaged
> > with the community (e.g. [7]). There might be an increased amount
> > engagement from the rest of them after the merge.
> >
>
> Python 3 support is something we definitively need to look ahead. I'd try
> to make the codebase compatible with both 2.7.x and 3.6.x, rather than
> using other  solutions like 2to3.
>

I agree with you. I think it makes more sense to make codebase compatible
with both. As you mentioned Python 3 support is not a short-term goal in
the roadmap, and we can discuss it more as we approach that.


>
>
> Looking forward to hearing your thoughts and comments on “graduating”
> > python-sdk to the master.
> >
> > Thank you,
> > Ahmet
> >
> > (*) Python SDK branch currently has a diverse group of contributors.
> > Regular contributors include Charles Chen, Chamikara Jayalath, María
> García
> > Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
> > Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions from
> > Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and
> > Younghee Kwon.
> >
> > [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
> > [2] https://beam.apache.org/documentation/programming-guide/
> > [3] https://issues.apache.org/jira/browse/BEAM-1265
> > [4]
> > https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
> > en%20AND%20labels%20%3D%20sdk-consistency
> > [5] https://issues.apache.org/jira/browse/BEAM-1218
> > [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
> > [7] https://issues.apache.org/jira/browse/BEAM-1251
> >
>
>
> Great summary, Ahmet. Thanks.
>
> Cheers,
>
> --
> Sergio Fernández
> Partner Technology Manager
> Redlink GmbH
> m: +43 6602747925
> e: sergio.fernan...@redlink.co
> w: http://redlink.co
>

Reply via email to