Hi

I didn't try the Python SDK recently but you provided a clear "state of the 
art". Anyway I'm in favor of merging things as quick as possible (assuming it's 
in a good shape in term of build, test, ...): it would potentially grow up the 
"external" contributions.

So +1 from my side.

Regards
JB⁣​

On Jan 17, 2017, 08:22, at 08:22, Ahmet Altay <al...@google.com.INVALID> wrote:
>Hi all,
>
>tl;dr: I would like to start a discussion about merging python-sdk
>branch
>to master branch. Python SDK is mature enough and merging it to master
>will
>accelerate its development and adoption.
>
>With a great effort from a lot of contributors(*), Python SDK [1] is
>now a
>mostly complete, tested, performant Python implementation of the Beam
>model. Since June, when we first started with Python SDK in Apache Beam
>we
>have been continuously improving it.
>
>** Python SDK currently supports:
>
>* Model: All main concepts are present (ParDo, GroupByKey, Windowing
>etc.).
>* IO: There are extensible APIs for writing new bounded sources and
>sinks.
>Implementations are provided for Text, Avro, BigQuery, and Datastore.
>* Runners: Python SDK has an extensible base runner module that allows
>building specific runners on top of it. The SDK comes with two pipeline
>runners: DirectRunner and DataflowRunner; and it is possible to add
>more.
>The existing runners are currently limited to bounded execution and
>otherwise equivalent to their Java SDK counterparts in functionality.
>* Testing: Python SDK implements ValidatesRunner test framework for
>implementing integration test for current and future runners. There is
>unit
>test coverage for all modules, and a number of integrations test for
>validating existing runners.
>* Documentation and examples: Documentation work has started on Python
>SDK.
>Beam Programming Guide page has been updated to include Python [2]. The
>code comes with many ready to use examples and we are in a good place
>to
>start documenting those on the website.
>
>** We are not done yet, next on the roadmap we have:
>
>* Streaming: Both of the existing runners lack support for streaming
>execution, and currently there is work going on for adding streaming
>support to DirectRunner [3].
>* Documentation: Filling the rest of the Beam documentations with
>Python
>SDK specific information and examples.
>* SDK consistency: Making Python SDK consistent with the Java SDK. We
>have
>come a long way on this and have only a few items left [4].
>* Beamifying: We have been working on removing Dataflow-specific
>references
>both from the documentation and from the code. There is some work left,
>and
>we are currently working on those as well [5].
>
>** Steps and implications of merging to master:
>
>* Master branch is merged to python-sdk branch at regular intervals and
>the
>last merge was on 12/22. All the past merges were uneventful because
>there
>is a minimal overlap in modified files between branches. Integrating
>python-sdk to master will similarly touch a small number of existing
>files.
>
>* Python SDK is using the same tools for building and testing. It is
>already integrated with Maven, Jenkins and Travis. Specifically the
>impact
>to the testing infrastructure would be:
>- There will be two additional test configurations in Travis. Since
>Travis
>runs all configurations in parallel there should not be a noticeable
>change
>in the Travis run time.
>- Jenkins pre-commit test will start running the Python SDK tests. It
>will
>add an additional 5 minutes to the completion time of pre-commit test.
>Historically Python SDK tests were not flaky and did not cause any
>random
>failures.
>- Jenkins Python post-commit test is already separated from the other
>post-commit tests and will continue to exist. It would not change the
>testing time for any other test.
>
>* The release process needs to be updated to accommodate releasing
>Python
>artifacts. Python SDK would fit in the existing release schedule and
>could
>be released along with the Java SDK. The additional steps would
>include:
>- Generating Python artifacts. This could be done with a single command
>using Maven today.
>- Publishing the artifacts to a central repository such as PyPI.
>- Updating the release guide to reflect the changes above.
>
>* Users: There are existing users using the Python SDK. To give a rough
>estimate, a distribution of the Beam Python SDK had a total of 23K
>downloads in the past 6 months [6]. Some of those users are already
>engaged
>with the community (e.g. [7]). There might be an increased amount
>engagement from the rest of them after the merge.
>
>Looking forward to hearing your thoughts and comments on “graduating”
>python-sdk to the master.
>
>Thank you,
>Ahmet
>
>(*) Python SDK branch currently has a diverse group of contributors.
>Regular contributors include Charles Chen, Chamikara Jayalath, María
>García
>Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
>Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions
>from
>Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and
>Younghee Kwon.
>
>[1] https://github.com/apache/beam/tree/python-sdk/sdks/python
>[2] https://beam.apache.org/documentation/programming-guide/
>[3] https://issues.apache.org/jira/browse/BEAM-1265
>[4]
>https://issues.apache.org/jira/issues/?jql=status%20%3D%20Open%20AND%20labels%20%3D%20sdk-consistency
>[5] https://issues.apache.org/jira/browse/BEAM-1218
>[6] https://pypi.python.org/pypi/google-cloud-dataflow/json
>[7] https://issues.apache.org/jira/browse/BEAM-1251

Reply via email to