Re: [DISCUSS] Python SDK status and next steps

Kenneth Knowles Fri, 20 Jan 2017 10:06:50 -0800

To clarify the implied criteria of that last exchange, it is "An SDK should
have at least one runner that can execute the complete model (may be a
direct runner)"


I want to highlight this, because whether an _SDK_ supports unbounded data
is not particularly well-defined, and will evolve:

 - With the Runner API, an SDK will need to support building a graph with
unbounded constructs, as today with probably minimal changes.

 - With the Fn API, if any part of the Fn API is specific to unbounded
data, the SDK will need to implement it. I think right now there is no such
thing, and we don't want such a thing, so SDKs implementing the Fn API
automatically support unbounded data.

 - There will also likely be an SDK-specific shim just as there is today,
to leverage idiomatic deserialized representations. The richness of this
shim will decrease so that it will need to "support" unbounded data but
that will be a ~one liner.

Getting the Python SDK on master will accelerate our progress towards the
Fn API - partly technical, partly community - which is the best path
towards support for unbounded data across multiple runners. I think the
criteria are written with the completed portability framework in mind. So
this exchange makes me actually more convinced we should merge python-sdk
to master.

On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
[email protected]> wrote:

> On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
> <[email protected]> wrote:
> > I do not think that Python SDK yet meets the bar [1] for implementing the
> > Beam model -- supporting Unbounded data is very important. That said,
> given
> > the committed and sustained set of contributors, it generally makes sense
> > to me to make an exception in anticipation of these features being
> fleshed
> > out soon; including potentially new users/contributors that would arrive
> > once in master.
> >
> > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
> > [email protected]
>
> That is a valid point. The Python SDK supports all the unbounded parts
> of the model except for unbounded sources, which was deferred while
> seeing how https://s.apache.org/splittable-do-fn played out. I've been
> working with the team and merging/reviewing most of their code, and
> have full confidence this will be coming (and on that note can vouch
> for a healthy community and support which are much harder to add
> later).
>
> In short, I think it has the required maturity, and I'm in favor of
> merging soonish.
>
> > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay <[email protected]>
> > wrote:
> >
> >> Thank you all for the comments so far. I would follow the process as
> >> suggested by Davor and others in this thread.
> >>
> >> Ahmet
> >>
> >> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <[email protected]>
> >> wrote:
> >>
> >> > Hi
> >> >
> >> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay <[email protected]
> >
> >> > wrote:
> >> > >
> >> > > tl;dr: I would like to start a discussion about merging python-sdk
> >> branch
> >> > > to master branch. Python SDK is mature enough and merging it to
> master
> >> > will
> >> > > accelerate its development and adoption.
> >> > >
> >> >
> >> > Good point, Ahmet!
> >> >
> >> > I've following closed the development since it was imported in June.
> For
> >> > the prototypes I've implemented so far it works quite well; I guess
> we'd
> >> > just need to focus the next months in bringing more runners support.
> >> >
> >> > With a great effort from a lot of contributors(*), Python SDK [1] is
> now
> >> a
> >> > > mostly complete, tested, performant Python implementation of the
> Beam
> >> > > model. Since June, when we first started with Python SDK in Apache
> Beam
> >> > we
> >> > > have been continuously improving it.
> >> > >
> >> >
> >> > I wouldn't merge during the preparation of 0.5.0 release, but after
> that
> >> > could be a good time to merge back into master.
> >> >
> >> >
> >> > ** Python SDK currently supports:
> >> > >
> >> > > * Model: All main concepts are present (ParDo, GroupByKey, Windowing
> >> > etc.).
> >> > > * IO: There are extensible APIs for writing new bounded sources and
> >> > sinks.
> >> > > Implementations are provided for Text, Avro, BigQuery, and
> Datastore.
> >> > > * Runners: Python SDK has an extensible base runner module that
> allows
> >> > > building specific runners on top of it. The SDK comes with two
> pipeline
> >> > > runners: DirectRunner and DataflowRunner; and it is possible to add
> >> more.
> >> > > The existing runners are currently limited to bounded execution and
> >> > > otherwise equivalent to their Java SDK counterparts in
> functionality.
> >> > >
> >> >
> >> > What would the effort of porting, and maintaining, parallel versions
> of
> >> the
> >> > Java runners? I guess I'd need to dig deeper in the model, but this
> may
> >> > represent a major effort for the project, right?
> >> >
> >>
> >> It is somewhat higher for DirectRunner because DirectRunner also
> implements
> >> the code for execution. It is not that high for DataflowRunner because
> the
> >> base runner module has a lot of helpers with the right hooks for
> >> implementing a generic runner. I would _expect_ the experience in
> general
> >> would be similar to the latter.
> >>
> >>
> >> >
> >> >
> >> >
> >> > > * Testing: Python SDK implements ValidatesRunner test framework for
> >> > > implementing integration test for current and future runners. There
> is
> >> > unit
> >> > > test coverage for all modules, and a number of integrations test for
> >> > > validating existing runners.
> >> > > * Documentation and examples: Documentation work has started on
> Python
> >> > SDK.
> >> > > Beam Programming Guide page has been updated to include Python [2].
> The
> >> > > code comes with many ready to use examples and we are in a good
> place
> >> to
> >> > > start documenting those on the website.
> >> > >
> >> > > ** We are not done yet, next on the roadmap we have:
> >> > >
> >> > > * Streaming: Both of the existing runners lack support for streaming
> >> > > execution, and currently there is work going on for adding streaming
> >> > > support to DirectRunner [3].
> >> > > * Documentation: Filling the rest of the Beam documentations with
> >> Python
> >> > > SDK specific information and examples.
> >> > > * SDK consistency: Making Python SDK consistent with the Java SDK.
> We
> >> > have
> >> > > come a long way on this and have only a few items left [4].
> >> > > * Beamifying: We have been working on removing Dataflow-specific
> >> > references
> >> > > both from the documentation and from the code. There is some work
> left,
> >> > and
> >> > > we are currently working on those as well [5].
> >> > >
> >> > > ** Steps and implications of merging to master:
> >> > >
> >> > > * Master branch is merged to python-sdk branch at regular intervals
> and
> >> > the
> >> > > last merge was on 12/22. All the past merges were uneventful because
> >> > there
> >> > > is a minimal overlap in modified files between branches. Integrating
> >> > > python-sdk to master will similarly touch a small number of existing
> >> > files.
> >> > >
> >> > > * Python SDK is using the same tools for building and testing. It is
> >> > > already integrated with Maven, Jenkins and Travis. Specifically the
> >> > impact
> >> > > to the testing infrastructure would be:
> >> > > - There will be two additional test configurations in Travis. Since
> >> > Travis
> >> > > runs all configurations in parallel there should not be a noticeable
> >> > change
> >> > > in the Travis run time.
> >> > > - Jenkins pre-commit test will start running the Python SDK tests.
> It
> >> > will
> >> > > add an additional 5 minutes to the completion time of pre-commit
> test.
> >> > > Historically Python SDK tests were not flaky and did not cause any
> >> random
> >> > > failures.
> >> > > - Jenkins Python post-commit test is already separated from the
> other
> >> > > post-commit tests and will continue to exist. It would not change
> the
> >> > > testing time for any other test.
> >> > >
> >> > > * The release process needs to be updated to accommodate releasing
> >> Python
> >> > > artifacts. Python SDK would fit in the existing release schedule and
> >> > could
> >> > > be released along with the Java SDK. The additional steps would
> >> include:
> >> > > - Generating Python artifacts. This could be done with a single
> command
> >> > > using Maven today.
> >> > > - Publishing the artifacts to a central repository such as PyPI.
> >> > >
> >> >
> >> > I'm more than happy to help on this. We left on purpose some things
> open
> >> > when we added Maven support to the Python build.
> >> >
> >>
> >> That would be awesome. We can coordinate on that post-merge.
> >>
> >>
> >> >
> >> >
> >> >
> >> > > - Updating the release guide to reflect the changes above.
> >> > >
> >> > > * Users: There are existing users using the Python SDK. To give a
> rough
> >> > > estimate, a distribution of the Beam Python SDK had a total of 23K
> >> > > downloads in the past 6 months [6]. Some of those users are already
> >> > engaged
> >> > > with the community (e.g. [7]). There might be an increased amount
> >> > > engagement from the rest of them after the merge.
> >> > >
> >> >
> >> > Python 3 support is something we definitively need to look ahead. I'd
> try
> >> > to make the codebase compatible with both 2.7.x and 3.6.x, rather than
> >> > using other  solutions like 2to3.
> >> >
> >>
> >> I agree with you. I think it makes more sense to make codebase
> compatible
> >> with both. As you mentioned Python 3 support is not a short-term goal in
> >> the roadmap, and we can discuss it more as we approach that.
> >>
> >>
> >> >
> >> >
> >> > Looking forward to hearing your thoughts and comments on “graduating”
> >> > > python-sdk to the master.
> >> > >
> >> > > Thank you,
> >> > > Ahmet
> >> > >
> >> > > (*) Python SDK branch currently has a diverse group of contributors.
> >> > > Regular contributors include Charles Chen, Chamikara Jayalath, María
> >> > García
> >> > > Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
> >> > > Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions
> >> from
> >> > > Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and
> >> > > Younghee Kwon.
> >> > >
> >> > > [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
> >> > > [2] https://beam.apache.org/documentation/programming-guide/
> >> > > [3] https://issues.apache.org/jira/browse/BEAM-1265
> >> > > [4]
> >> > > https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
> >> > > en%20AND%20labels%20%3D%20sdk-consistency
> >> > > [5] https://issues.apache.org/jira/browse/BEAM-1218
> >> > > [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
> >> > > [7] https://issues.apache.org/jira/browse/BEAM-1251
> >> > >
> >> >
> >> >
> >> > Great summary, Ahmet. Thanks.
> >> >
> >> > Cheers,
> >> >
> >> > --
> >> > Sergio Fernández
> >> > Partner Technology Manager
> >> > Redlink GmbH
> >> > m: +43 6602747925
> >> > e: [email protected]
> >> > w: http://redlink.co
> >> >
> >>
>

Re: [DISCUSS] Python SDK status and next steps

Reply via email to