Hi

On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay <al...@google.com.invalid>
wrote:
>
> tl;dr: I would like to start a discussion about merging python-sdk branch
> to master branch. Python SDK is mature enough and merging it to master will
> accelerate its development and adoption.
>

Good point, Ahmet!

I've following closed the development since it was imported in June. For
the prototypes I've implemented so far it works quite well; I guess we'd
just need to focus the next months in bringing more runners support.

With a great effort from a lot of contributors(*), Python SDK [1] is now a
> mostly complete, tested, performant Python implementation of the Beam
> model. Since June, when we first started with Python SDK in Apache Beam we
> have been continuously improving it.
>

I wouldn't merge during the preparation of 0.5.0 release, but after that
could be a good time to merge back into master.


** Python SDK currently supports:
>
> * Model: All main concepts are present (ParDo, GroupByKey, Windowing etc.).
> * IO: There are extensible APIs for writing new bounded sources and sinks.
> Implementations are provided for Text, Avro, BigQuery, and Datastore.
> * Runners: Python SDK has an extensible base runner module that allows
> building specific runners on top of it. The SDK comes with two pipeline
> runners: DirectRunner and DataflowRunner; and it is possible to add more.
> The existing runners are currently limited to bounded execution and
> otherwise equivalent to their Java SDK counterparts in functionality.
>

What would the effort of porting, and maintaining, parallel versions of the
Java runners? I guess I'd need to dig deeper in the model, but this may
represent a major effort for the project, right?



> * Testing: Python SDK implements ValidatesRunner test framework for
> implementing integration test for current and future runners. There is unit
> test coverage for all modules, and a number of integrations test for
> validating existing runners.
> * Documentation and examples: Documentation work has started on Python SDK.
> Beam Programming Guide page has been updated to include Python [2]. The
> code comes with many ready to use examples and we are in a good place to
> start documenting those on the website.
>
> ** We are not done yet, next on the roadmap we have:
>
> * Streaming: Both of the existing runners lack support for streaming
> execution, and currently there is work going on for adding streaming
> support to DirectRunner [3].
> * Documentation: Filling the rest of the Beam documentations with Python
> SDK specific information and examples.
> * SDK consistency: Making Python SDK consistent with the Java SDK. We have
> come a long way on this and have only a few items left [4].
> * Beamifying: We have been working on removing Dataflow-specific references
> both from the documentation and from the code. There is some work left, and
> we are currently working on those as well [5].
>
> ** Steps and implications of merging to master:
>
> * Master branch is merged to python-sdk branch at regular intervals and the
> last merge was on 12/22. All the past merges were uneventful because there
> is a minimal overlap in modified files between branches. Integrating
> python-sdk to master will similarly touch a small number of existing files.
>
> * Python SDK is using the same tools for building and testing. It is
> already integrated with Maven, Jenkins and Travis. Specifically the impact
> to the testing infrastructure would be:
> - There will be two additional test configurations in Travis. Since Travis
> runs all configurations in parallel there should not be a noticeable change
> in the Travis run time.
> - Jenkins pre-commit test will start running the Python SDK tests. It will
> add an additional 5 minutes to the completion time of pre-commit test.
> Historically Python SDK tests were not flaky and did not cause any random
> failures.
> - Jenkins Python post-commit test is already separated from the other
> post-commit tests and will continue to exist. It would not change the
> testing time for any other test.
>
> * The release process needs to be updated to accommodate releasing Python
> artifacts. Python SDK would fit in the existing release schedule and could
> be released along with the Java SDK. The additional steps would include:
> - Generating Python artifacts. This could be done with a single command
> using Maven today.
> - Publishing the artifacts to a central repository such as PyPI.
>

I'm more than happy to help on this. We left on purpose some things open
when we added Maven support to the Python build.



> - Updating the release guide to reflect the changes above.
>
> * Users: There are existing users using the Python SDK. To give a rough
> estimate, a distribution of the Beam Python SDK had a total of 23K
> downloads in the past 6 months [6]. Some of those users are already engaged
> with the community (e.g. [7]). There might be an increased amount
> engagement from the rest of them after the merge.
>

Python 3 support is something we definitively need to look ahead. I'd try
to make the codebase compatible with both 2.7.x and 3.6.x, rather than
using other  solutions like 2to3.


Looking forward to hearing your thoughts and comments on “graduating”
> python-sdk to the master.
>
> Thank you,
> Ahmet
>
> (*) Python SDK branch currently has a diverse group of contributors.
> Regular contributors include Charles Chen, Chamikara Jayalath, María García
> Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
> Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions from
> Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and
> Younghee Kwon.
>
> [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
> [2] https://beam.apache.org/documentation/programming-guide/
> [3] https://issues.apache.org/jira/browse/BEAM-1265
> [4]
> https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
> en%20AND%20labels%20%3D%20sdk-consistency
> [5] https://issues.apache.org/jira/browse/BEAM-1218
> [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
> [7] https://issues.apache.org/jira/browse/BEAM-1251
>


Great summary, Ahmet. Thanks.

Cheers,

-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernan...@redlink.co
w: http://redlink.co

Reply via email to