Hi I didn't try the Python SDK recently but you provided a clear "state of the art". Anyway I'm in favor of merging things as quick as possible (assuming it's in a good shape in term of build, test, ...): it would potentially grow up the "external" contributions.
So +1 from my side. Regards JB On Jan 17, 2017, 08:22, at 08:22, Ahmet Altay <al...@google.com.INVALID> wrote: >Hi all, > >tl;dr: I would like to start a discussion about merging python-sdk >branch >to master branch. Python SDK is mature enough and merging it to master >will >accelerate its development and adoption. > >With a great effort from a lot of contributors(*), Python SDK [1] is >now a >mostly complete, tested, performant Python implementation of the Beam >model. Since June, when we first started with Python SDK in Apache Beam >we >have been continuously improving it. > >** Python SDK currently supports: > >* Model: All main concepts are present (ParDo, GroupByKey, Windowing >etc.). >* IO: There are extensible APIs for writing new bounded sources and >sinks. >Implementations are provided for Text, Avro, BigQuery, and Datastore. >* Runners: Python SDK has an extensible base runner module that allows >building specific runners on top of it. The SDK comes with two pipeline >runners: DirectRunner and DataflowRunner; and it is possible to add >more. >The existing runners are currently limited to bounded execution and >otherwise equivalent to their Java SDK counterparts in functionality. >* Testing: Python SDK implements ValidatesRunner test framework for >implementing integration test for current and future runners. There is >unit >test coverage for all modules, and a number of integrations test for >validating existing runners. >* Documentation and examples: Documentation work has started on Python >SDK. >Beam Programming Guide page has been updated to include Python [2]. The >code comes with many ready to use examples and we are in a good place >to >start documenting those on the website. > >** We are not done yet, next on the roadmap we have: > >* Streaming: Both of the existing runners lack support for streaming >execution, and currently there is work going on for adding streaming >support to DirectRunner [3]. >* Documentation: Filling the rest of the Beam documentations with >Python >SDK specific information and examples. >* SDK consistency: Making Python SDK consistent with the Java SDK. We >have >come a long way on this and have only a few items left [4]. >* Beamifying: We have been working on removing Dataflow-specific >references >both from the documentation and from the code. There is some work left, >and >we are currently working on those as well [5]. > >** Steps and implications of merging to master: > >* Master branch is merged to python-sdk branch at regular intervals and >the >last merge was on 12/22. All the past merges were uneventful because >there >is a minimal overlap in modified files between branches. Integrating >python-sdk to master will similarly touch a small number of existing >files. > >* Python SDK is using the same tools for building and testing. It is >already integrated with Maven, Jenkins and Travis. Specifically the >impact >to the testing infrastructure would be: >- There will be two additional test configurations in Travis. Since >Travis >runs all configurations in parallel there should not be a noticeable >change >in the Travis run time. >- Jenkins pre-commit test will start running the Python SDK tests. It >will >add an additional 5 minutes to the completion time of pre-commit test. >Historically Python SDK tests were not flaky and did not cause any >random >failures. >- Jenkins Python post-commit test is already separated from the other >post-commit tests and will continue to exist. It would not change the >testing time for any other test. > >* The release process needs to be updated to accommodate releasing >Python >artifacts. Python SDK would fit in the existing release schedule and >could >be released along with the Java SDK. The additional steps would >include: >- Generating Python artifacts. This could be done with a single command >using Maven today. >- Publishing the artifacts to a central repository such as PyPI. >- Updating the release guide to reflect the changes above. > >* Users: There are existing users using the Python SDK. To give a rough >estimate, a distribution of the Beam Python SDK had a total of 23K >downloads in the past 6 months [6]. Some of those users are already >engaged >with the community (e.g. [7]). There might be an increased amount >engagement from the rest of them after the merge. > >Looking forward to hearing your thoughts and comments on “graduating” >python-sdk to the master. > >Thank you, >Ahmet > >(*) Python SDK branch currently has a diverse group of contributors. >Regular contributors include Charles Chen, Chamikara Jayalath, María >García >Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC), >Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions >from >Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and >Younghee Kwon. > >[1] https://github.com/apache/beam/tree/python-sdk/sdks/python >[2] https://beam.apache.org/documentation/programming-guide/ >[3] https://issues.apache.org/jira/browse/BEAM-1265 >[4] >https://issues.apache.org/jira/issues/?jql=status%20%3D%20Open%20AND%20labels%20%3D%20sdk-consistency >[5] https://issues.apache.org/jira/browse/BEAM-1218 >[6] https://pypi.python.org/pypi/google-cloud-dataflow/json >[7] https://issues.apache.org/jira/browse/BEAM-1251