Re: [DISCUSS] Python SDK status and next steps

Prabeesh K. Tue, 31 Jan 2017 00:33:53 -0800

https://issues.apache.org/jira/browse/BEAM-1360


On 31 January 2017 at 12:12, Prabeesh K. <[email protected]> wrote:

> https://issues.apache.org/jira/browse/BAHIR-86
>
> On 31 January 2017 at 11:10, Ahmet Altay <[email protected]> wrote:
>
>> Hi all,
>>
>> This merge is completed. Python SDK is now officially part of the master
>> branch! Thank you all for the support. Please open an issue, if you notice
>> a reference to the now obsolete python-sdk branch in the documentation.
>>
>> There will not be any more merges to the python-sdk branch. Going forward
>> please use the master branch for Python SDK development. There are a few
>> existing open PRs to the python-sdk [1]. If you are the author of one of
>> those PRs, please rebase them on top of master.
>>
>> Thank you,
>> Ahmet
>>
>> [1] https://github.com/pulls?utf8=✓&q=is%3Aopen+is%3Apr+base%
>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25>
>> 3Apython-sdk+repo%3Aapache%2Fbeam+
>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr
>> +base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>>
>> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles <[email protected]
>> >
>> wrote:
>>
>> > To clarify the implied criteria of that last exchange, it is "An SDK
>> should
>> > have at least one runner that can execute the complete model (may be a
>> > direct runner)"
>> >
>> > I want to highlight this, because whether an _SDK_ supports unbounded
>> data
>> > is not particularly well-defined, and will evolve:
>> >
>> >  - With the Runner API, an SDK will need to support building a graph
>> with
>> > unbounded constructs, as today with probably minimal changes.
>> >
>> >  - With the Fn API, if any part of the Fn API is specific to unbounded
>> > data, the SDK will need to implement it. I think right now there is no
>> such
>> > thing, and we don't want such a thing, so SDKs implementing the Fn API
>> > automatically support unbounded data.
>> >
>> >  - There will also likely be an SDK-specific shim just as there is
>> today,
>> > to leverage idiomatic deserialized representations. The richness of this
>> > shim will decrease so that it will need to "support" unbounded data but
>> > that will be a ~one liner.
>> >
>> > Getting the Python SDK on master will accelerate our progress towards
>> the
>> > Fn API - partly technical, partly community - which is the best path
>> > towards support for unbounded data across multiple runners. I think the
>> > criteria are written with the completed portability framework in mind.
>> So
>> > this exchange makes me actually more convinced we should merge
>> python-sdk
>> > to master.
>> >
>> > On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
>> > [email protected]> wrote:
>> >
>> > > On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
>> > > <[email protected]> wrote:
>> > > > I do not think that Python SDK yet meets the bar [1] for
>> implementing
>> > the
>> > > > Beam model -- supporting Unbounded data is very important. That
>> said,
>> > > given
>> > > > the committed and sustained set of contributors, it generally makes
>> > sense
>> > > > to me to make an exception in anticipation of these features being
>> > > fleshed
>> > > > out soon; including potentially new users/contributors that would
>> > arrive
>> > > > once in master.
>> > > >
>> > > > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
>> > > > [email protected]
>> > >
>> > > That is a valid point. The Python SDK supports all the unbounded parts
>> > > of the model except for unbounded sources, which was deferred while
>> > > seeing how https://s.apache.org/splittable-do-fn played out. I've
>> been
>> > > working with the team and merging/reviewing most of their code, and
>> > > have full confidence this will be coming (and on that note can vouch
>> > > for a healthy community and support which are much harder to add
>> > > later).
>> > >
>> > > In short, I think it has the required maturity, and I'm in favor of
>> > > merging soonish.
>> > >
>> > > > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
>> <[email protected]
>> > >
>> > > > wrote:
>> > > >
>> > > >> Thank you all for the comments so far. I would follow the process
>> as
>> > > >> suggested by Davor and others in this thread.
>> > > >>
>> > > >> Ahmet
>> > > >>
>> > > >> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <
>> [email protected]
>> > >
>> > > >> wrote:
>> > > >>
>> > > >> > Hi
>> > > >> >
>> > > >> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
>> > <[email protected]
>> > > >
>> > > >> > wrote:
>> > > >> > >
>> > > >> > > tl;dr: I would like to start a discussion about merging
>> python-sdk
>> > > >> branch
>> > > >> > > to master branch. Python SDK is mature enough and merging it to
>> > > master
>> > > >> > will
>> > > >> > > accelerate its development and adoption.
>> > > >> > >
>> > > >> >
>> > > >> > Good point, Ahmet!
>> > > >> >
>> > > >> > I've following closed the development since it was imported in
>> June.
>> > > For
>> > > >> > the prototypes I've implemented so far it works quite well; I
>> guess
>> > > we'd
>> > > >> > just need to focus the next months in bringing more runners
>> support.
>> > > >> >
>> > > >> > With a great effort from a lot of contributors(*), Python SDK
>> [1] is
>> > > now
>> > > >> a
>> > > >> > > mostly complete, tested, performant Python implementation of
>> the
>> > > Beam
>> > > >> > > model. Since June, when we first started with Python SDK in
>> Apache
>> > > Beam
>> > > >> > we
>> > > >> > > have been continuously improving it.
>> > > >> > >
>> > > >> >
>> > > >> > I wouldn't merge during the preparation of 0.5.0 release, but
>> after
>> > > that
>> > > >> > could be a good time to merge back into master.
>> > > >> >
>> > > >> >
>> > > >> > ** Python SDK currently supports:
>> > > >> > >
>> > > >> > > * Model: All main concepts are present (ParDo, GroupByKey,
>> > Windowing
>> > > >> > etc.).
>> > > >> > > * IO: There are extensible APIs for writing new bounded sources
>> > and
>> > > >> > sinks.
>> > > >> > > Implementations are provided for Text, Avro, BigQuery, and
>> > > Datastore.
>> > > >> > > * Runners: Python SDK has an extensible base runner module that
>> > > allows
>> > > >> > > building specific runners on top of it. The SDK comes with two
>> > > pipeline
>> > > >> > > runners: DirectRunner and DataflowRunner; and it is possible to
>> > add
>> > > >> more.
>> > > >> > > The existing runners are currently limited to bounded execution
>> > and
>> > > >> > > otherwise equivalent to their Java SDK counterparts in
>> > > functionality.
>> > > >> > >
>> > > >> >
>> > > >> > What would the effort of porting, and maintaining, parallel
>> versions
>> > > of
>> > > >> the
>> > > >> > Java runners? I guess I'd need to dig deeper in the model, but
>> this
>> > > may
>> > > >> > represent a major effort for the project, right?
>> > > >> >
>> > > >>
>> > > >> It is somewhat higher for DirectRunner because DirectRunner also
>> > > implements
>> > > >> the code for execution. It is not that high for DataflowRunner
>> because
>> > > the
>> > > >> base runner module has a lot of helpers with the right hooks for
>> > > >> implementing a generic runner. I would _expect_ the experience in
>> > > general
>> > > >> would be similar to the latter.
>> > > >>
>> > > >>
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > > * Testing: Python SDK implements ValidatesRunner test framework
>> > for
>> > > >> > > implementing integration test for current and future runners.
>> > There
>> > > is
>> > > >> > unit
>> > > >> > > test coverage for all modules, and a number of integrations
>> test
>> > for
>> > > >> > > validating existing runners.
>> > > >> > > * Documentation and examples: Documentation work has started on
>> > > Python
>> > > >> > SDK.
>> > > >> > > Beam Programming Guide page has been updated to include Python
>> > [2].
>> > > The
>> > > >> > > code comes with many ready to use examples and we are in a good
>> > > place
>> > > >> to
>> > > >> > > start documenting those on the website.
>> > > >> > >
>> > > >> > > ** We are not done yet, next on the roadmap we have:
>> > > >> > >
>> > > >> > > * Streaming: Both of the existing runners lack support for
>> > streaming
>> > > >> > > execution, and currently there is work going on for adding
>> > streaming
>> > > >> > > support to DirectRunner [3].
>> > > >> > > * Documentation: Filling the rest of the Beam documentations
>> with
>> > > >> Python
>> > > >> > > SDK specific information and examples.
>> > > >> > > * SDK consistency: Making Python SDK consistent with the Java
>> SDK.
>> > > We
>> > > >> > have
>> > > >> > > come a long way on this and have only a few items left [4].
>> > > >> > > * Beamifying: We have been working on removing
>> Dataflow-specific
>> > > >> > references
>> > > >> > > both from the documentation and from the code. There is some
>> work
>> > > left,
>> > > >> > and
>> > > >> > > we are currently working on those as well [5].
>> > > >> > >
>> > > >> > > ** Steps and implications of merging to master:
>> > > >> > >
>> > > >> > > * Master branch is merged to python-sdk branch at regular
>> > intervals
>> > > and
>> > > >> > the
>> > > >> > > last merge was on 12/22. All the past merges were uneventful
>> > because
>> > > >> > there
>> > > >> > > is a minimal overlap in modified files between branches.
>> > Integrating
>> > > >> > > python-sdk to master will similarly touch a small number of
>> > existing
>> > > >> > files.
>> > > >> > >
>> > > >> > > * Python SDK is using the same tools for building and testing.
>> It
>> > is
>> > > >> > > already integrated with Maven, Jenkins and Travis. Specifically
>> > the
>> > > >> > impact
>> > > >> > > to the testing infrastructure would be:
>> > > >> > > - There will be two additional test configurations in Travis.
>> > Since
>> > > >> > Travis
>> > > >> > > runs all configurations in parallel there should not be a
>> > noticeable
>> > > >> > change
>> > > >> > > in the Travis run time.
>> > > >> > > - Jenkins pre-commit test will start running the Python SDK
>> tests.
>> > > It
>> > > >> > will
>> > > >> > > add an additional 5 minutes to the completion time of
>> pre-commit
>> > > test.
>> > > >> > > Historically Python SDK tests were not flaky and did not cause
>> any
>> > > >> random
>> > > >> > > failures.
>> > > >> > > - Jenkins Python post-commit test is already separated from the
>> > > other
>> > > >> > > post-commit tests and will continue to exist. It would not
>> change
>> > > the
>> > > >> > > testing time for any other test.
>> > > >> > >
>> > > >> > > * The release process needs to be updated to accommodate
>> releasing
>> > > >> Python
>> > > >> > > artifacts. Python SDK would fit in the existing release
>> schedule
>> > and
>> > > >> > could
>> > > >> > > be released along with the Java SDK. The additional steps would
>> > > >> include:
>> > > >> > > - Generating Python artifacts. This could be done with a single
>> > > command
>> > > >> > > using Maven today.
>> > > >> > > - Publishing the artifacts to a central repository such as
>> PyPI.
>> > > >> > >
>> > > >> >
>> > > >> > I'm more than happy to help on this. We left on purpose some
>> things
>> > > open
>> > > >> > when we added Maven support to the Python build.
>> > > >> >
>> > > >>
>> > > >> That would be awesome. We can coordinate on that post-merge.
>> > > >>
>> > > >>
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > > - Updating the release guide to reflect the changes above.
>> > > >> > >
>> > > >> > > * Users: There are existing users using the Python SDK. To
>> give a
>> > > rough
>> > > >> > > estimate, a distribution of the Beam Python SDK had a total of
>> 23K
>> > > >> > > downloads in the past 6 months [6]. Some of those users are
>> > already
>> > > >> > engaged
>> > > >> > > with the community (e.g. [7]). There might be an increased
>> amount
>> > > >> > > engagement from the rest of them after the merge.
>> > > >> > >
>> > > >> >
>> > > >> > Python 3 support is something we definitively need to look ahead.
>> > I'd
>> > > try
>> > > >> > to make the codebase compatible with both 2.7.x and 3.6.x, rather
>> > than
>> > > >> > using other  solutions like 2to3.
>> > > >> >
>> > > >>
>> > > >> I agree with you. I think it makes more sense to make codebase
>> > > compatible
>> > > >> with both. As you mentioned Python 3 support is not a short-term
>> goal
>> > in
>> > > >> the roadmap, and we can discuss it more as we approach that.
>> > > >>
>> > > >>
>> > > >> >
>> > > >> >
>> > > >> > Looking forward to hearing your thoughts and comments on
>> > “graduating”
>> > > >> > > python-sdk to the master.
>> > > >> > >
>> > > >> > > Thank you,
>> > > >> > > Ahmet
>> > > >> > >
>> > > >> > > (*) Python SDK branch currently has a diverse group of
>> > contributors.
>> > > >> > > Regular contributors include Charles Chen, Chamikara Jayalath,
>> > María
>> > > >> > García
>> > > >> > > Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam
>> > PMC),
>> > > >> > > Sourabh Bajaj, and Vikas Kedigehalli. We have also had
>> > contributions
>> > > >> from
>> > > >> > > Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun
>> Lee,
>> > and
>> > > >> > > Younghee Kwon.
>> > > >> > >
>> > > >> > > [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
>> > > >> > > [2] https://beam.apache.org/documentation/programming-guide/
>> > > >> > > [3] https://issues.apache.org/jira/browse/BEAM-1265
>> > > >> > > [4]
>> > > >> > > https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
>> > > >> > > en%20AND%20labels%20%3D%20sdk-consistency
>> > > >> > > [5] https://issues.apache.org/jira/browse/BEAM-1218
>> > > >> > > [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
>> > > >> > > [7] https://issues.apache.org/jira/browse/BEAM-1251
>> > > >> > >
>> > > >> >
>> > > >> >
>> > > >> > Great summary, Ahmet. Thanks.
>> > > >> >
>> > > >> > Cheers,
>> > > >> >
>> > > >> > --
>> > > >> > Sergio Fernández
>> > > >> > Partner Technology Manager
>> > > >> > Redlink GmbH
>> > > >> > m: +43 6602747925
>> > > >> > e: [email protected]
>> > > >> > w: http://redlink.co
>> > > >> >
>> > > >>
>> > >
>> >
>>
>
>

Re: [DISCUSS] Python SDK status and next steps

Reply via email to