https://issues.apache.org/jira/browse/BEAM-1360
On 31 January 2017 at 12:12, Prabeesh K. <prabsma...@gmail.com> wrote: > https://issues.apache.org/jira/browse/BAHIR-86 > > On 31 January 2017 at 11:10, Ahmet Altay <al...@google.com.invalid> wrote: > >> Hi all, >> >> This merge is completed. Python SDK is now officially part of the master >> branch! Thank you all for the support. Please open an issue, if you notice >> a reference to the now obsolete python-sdk branch in the documentation. >> >> There will not be any more merges to the python-sdk branch. Going forward >> please use the master branch for Python SDK development. There are a few >> existing open PRs to the python-sdk [1]. If you are the author of one of >> those PRs, please rebase them on top of master. >> >> Thank you, >> Ahmet >> >> [1] https://github.com/pulls?utf8=✓&q=is%3Aopen+is%3Apr+base% >> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25> >> 3Apython-sdk+repo%3Aapache%2Fbeam+ >> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr >> +base%3Apython-sdk+repo%3Aapache%2Fbeam+> >> >> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles <k...@google.com.invalid >> > >> wrote: >> >> > To clarify the implied criteria of that last exchange, it is "An SDK >> should >> > have at least one runner that can execute the complete model (may be a >> > direct runner)" >> > >> > I want to highlight this, because whether an _SDK_ supports unbounded >> data >> > is not particularly well-defined, and will evolve: >> > >> > - With the Runner API, an SDK will need to support building a graph >> with >> > unbounded constructs, as today with probably minimal changes. >> > >> > - With the Fn API, if any part of the Fn API is specific to unbounded >> > data, the SDK will need to implement it. I think right now there is no >> such >> > thing, and we don't want such a thing, so SDKs implementing the Fn API >> > automatically support unbounded data. >> > >> > - There will also likely be an SDK-specific shim just as there is >> today, >> > to leverage idiomatic deserialized representations. The richness of this >> > shim will decrease so that it will need to "support" unbounded data but >> > that will be a ~one liner. >> > >> > Getting the Python SDK on master will accelerate our progress towards >> the >> > Fn API - partly technical, partly community - which is the best path >> > towards support for unbounded data across multiple runners. I think the >> > criteria are written with the completed portability framework in mind. >> So >> > this exchange makes me actually more convinced we should merge >> python-sdk >> > to master. >> > >> > On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw < >> > rober...@google.com.invalid> wrote: >> > >> > > On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin >> > > <dhalp...@google.com.invalid> wrote: >> > > > I do not think that Python SDK yet meets the bar [1] for >> implementing >> > the >> > > > Beam model -- supporting Unbounded data is very important. That >> said, >> > > given >> > > > the committed and sustained set of contributors, it generally makes >> > sense >> > > > to me to make an exception in anticipation of these features being >> > > fleshed >> > > > out soon; including potentially new users/contributors that would >> > arrive >> > > > once in master. >> > > > >> > > > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y >> > > > k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com >> > > >> > > That is a valid point. The Python SDK supports all the unbounded parts >> > > of the model except for unbounded sources, which was deferred while >> > > seeing how https://s.apache.org/splittable-do-fn played out. I've >> been >> > > working with the team and merging/reviewing most of their code, and >> > > have full confidence this will be coming (and on that note can vouch >> > > for a healthy community and support which are much harder to add >> > > later). >> > > >> > > In short, I think it has the required maturity, and I'm in favor of >> > > merging soonish. >> > > >> > > > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay >> <al...@google.com.invalid >> > > >> > > > wrote: >> > > > >> > > >> Thank you all for the comments so far. I would follow the process >> as >> > > >> suggested by Davor and others in this thread. >> > > >> >> > > >> Ahmet >> > > >> >> > > >> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández < >> wik...@apache.org >> > > >> > > >> wrote: >> > > >> >> > > >> > Hi >> > > >> > >> > > >> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay >> > <al...@google.com.invalid >> > > > >> > > >> > wrote: >> > > >> > > >> > > >> > > tl;dr: I would like to start a discussion about merging >> python-sdk >> > > >> branch >> > > >> > > to master branch. Python SDK is mature enough and merging it to >> > > master >> > > >> > will >> > > >> > > accelerate its development and adoption. >> > > >> > > >> > > >> > >> > > >> > Good point, Ahmet! >> > > >> > >> > > >> > I've following closed the development since it was imported in >> June. >> > > For >> > > >> > the prototypes I've implemented so far it works quite well; I >> guess >> > > we'd >> > > >> > just need to focus the next months in bringing more runners >> support. >> > > >> > >> > > >> > With a great effort from a lot of contributors(*), Python SDK >> [1] is >> > > now >> > > >> a >> > > >> > > mostly complete, tested, performant Python implementation of >> the >> > > Beam >> > > >> > > model. Since June, when we first started with Python SDK in >> Apache >> > > Beam >> > > >> > we >> > > >> > > have been continuously improving it. >> > > >> > > >> > > >> > >> > > >> > I wouldn't merge during the preparation of 0.5.0 release, but >> after >> > > that >> > > >> > could be a good time to merge back into master. >> > > >> > >> > > >> > >> > > >> > ** Python SDK currently supports: >> > > >> > > >> > > >> > > * Model: All main concepts are present (ParDo, GroupByKey, >> > Windowing >> > > >> > etc.). >> > > >> > > * IO: There are extensible APIs for writing new bounded sources >> > and >> > > >> > sinks. >> > > >> > > Implementations are provided for Text, Avro, BigQuery, and >> > > Datastore. >> > > >> > > * Runners: Python SDK has an extensible base runner module that >> > > allows >> > > >> > > building specific runners on top of it. The SDK comes with two >> > > pipeline >> > > >> > > runners: DirectRunner and DataflowRunner; and it is possible to >> > add >> > > >> more. >> > > >> > > The existing runners are currently limited to bounded execution >> > and >> > > >> > > otherwise equivalent to their Java SDK counterparts in >> > > functionality. >> > > >> > > >> > > >> > >> > > >> > What would the effort of porting, and maintaining, parallel >> versions >> > > of >> > > >> the >> > > >> > Java runners? I guess I'd need to dig deeper in the model, but >> this >> > > may >> > > >> > represent a major effort for the project, right? >> > > >> > >> > > >> >> > > >> It is somewhat higher for DirectRunner because DirectRunner also >> > > implements >> > > >> the code for execution. It is not that high for DataflowRunner >> because >> > > the >> > > >> base runner module has a lot of helpers with the right hooks for >> > > >> implementing a generic runner. I would _expect_ the experience in >> > > general >> > > >> would be similar to the latter. >> > > >> >> > > >> >> > > >> > >> > > >> > >> > > >> > >> > > >> > > * Testing: Python SDK implements ValidatesRunner test framework >> > for >> > > >> > > implementing integration test for current and future runners. >> > There >> > > is >> > > >> > unit >> > > >> > > test coverage for all modules, and a number of integrations >> test >> > for >> > > >> > > validating existing runners. >> > > >> > > * Documentation and examples: Documentation work has started on >> > > Python >> > > >> > SDK. >> > > >> > > Beam Programming Guide page has been updated to include Python >> > [2]. >> > > The >> > > >> > > code comes with many ready to use examples and we are in a good >> > > place >> > > >> to >> > > >> > > start documenting those on the website. >> > > >> > > >> > > >> > > ** We are not done yet, next on the roadmap we have: >> > > >> > > >> > > >> > > * Streaming: Both of the existing runners lack support for >> > streaming >> > > >> > > execution, and currently there is work going on for adding >> > streaming >> > > >> > > support to DirectRunner [3]. >> > > >> > > * Documentation: Filling the rest of the Beam documentations >> with >> > > >> Python >> > > >> > > SDK specific information and examples. >> > > >> > > * SDK consistency: Making Python SDK consistent with the Java >> SDK. >> > > We >> > > >> > have >> > > >> > > come a long way on this and have only a few items left [4]. >> > > >> > > * Beamifying: We have been working on removing >> Dataflow-specific >> > > >> > references >> > > >> > > both from the documentation and from the code. There is some >> work >> > > left, >> > > >> > and >> > > >> > > we are currently working on those as well [5]. >> > > >> > > >> > > >> > > ** Steps and implications of merging to master: >> > > >> > > >> > > >> > > * Master branch is merged to python-sdk branch at regular >> > intervals >> > > and >> > > >> > the >> > > >> > > last merge was on 12/22. All the past merges were uneventful >> > because >> > > >> > there >> > > >> > > is a minimal overlap in modified files between branches. >> > Integrating >> > > >> > > python-sdk to master will similarly touch a small number of >> > existing >> > > >> > files. >> > > >> > > >> > > >> > > * Python SDK is using the same tools for building and testing. >> It >> > is >> > > >> > > already integrated with Maven, Jenkins and Travis. Specifically >> > the >> > > >> > impact >> > > >> > > to the testing infrastructure would be: >> > > >> > > - There will be two additional test configurations in Travis. >> > Since >> > > >> > Travis >> > > >> > > runs all configurations in parallel there should not be a >> > noticeable >> > > >> > change >> > > >> > > in the Travis run time. >> > > >> > > - Jenkins pre-commit test will start running the Python SDK >> tests. >> > > It >> > > >> > will >> > > >> > > add an additional 5 minutes to the completion time of >> pre-commit >> > > test. >> > > >> > > Historically Python SDK tests were not flaky and did not cause >> any >> > > >> random >> > > >> > > failures. >> > > >> > > - Jenkins Python post-commit test is already separated from the >> > > other >> > > >> > > post-commit tests and will continue to exist. It would not >> change >> > > the >> > > >> > > testing time for any other test. >> > > >> > > >> > > >> > > * The release process needs to be updated to accommodate >> releasing >> > > >> Python >> > > >> > > artifacts. Python SDK would fit in the existing release >> schedule >> > and >> > > >> > could >> > > >> > > be released along with the Java SDK. The additional steps would >> > > >> include: >> > > >> > > - Generating Python artifacts. This could be done with a single >> > > command >> > > >> > > using Maven today. >> > > >> > > - Publishing the artifacts to a central repository such as >> PyPI. >> > > >> > > >> > > >> > >> > > >> > I'm more than happy to help on this. We left on purpose some >> things >> > > open >> > > >> > when we added Maven support to the Python build. >> > > >> > >> > > >> >> > > >> That would be awesome. We can coordinate on that post-merge. >> > > >> >> > > >> >> > > >> > >> > > >> > >> > > >> > >> > > >> > > - Updating the release guide to reflect the changes above. >> > > >> > > >> > > >> > > * Users: There are existing users using the Python SDK. To >> give a >> > > rough >> > > >> > > estimate, a distribution of the Beam Python SDK had a total of >> 23K >> > > >> > > downloads in the past 6 months [6]. Some of those users are >> > already >> > > >> > engaged >> > > >> > > with the community (e.g. [7]). There might be an increased >> amount >> > > >> > > engagement from the rest of them after the merge. >> > > >> > > >> > > >> > >> > > >> > Python 3 support is something we definitively need to look ahead. >> > I'd >> > > try >> > > >> > to make the codebase compatible with both 2.7.x and 3.6.x, rather >> > than >> > > >> > using other solutions like 2to3. >> > > >> > >> > > >> >> > > >> I agree with you. I think it makes more sense to make codebase >> > > compatible >> > > >> with both. As you mentioned Python 3 support is not a short-term >> goal >> > in >> > > >> the roadmap, and we can discuss it more as we approach that. >> > > >> >> > > >> >> > > >> > >> > > >> > >> > > >> > Looking forward to hearing your thoughts and comments on >> > “graduating” >> > > >> > > python-sdk to the master. >> > > >> > > >> > > >> > > Thank you, >> > > >> > > Ahmet >> > > >> > > >> > > >> > > (*) Python SDK branch currently has a diverse group of >> > contributors. >> > > >> > > Regular contributors include Charles Chen, Chamikara Jayalath, >> > María >> > > >> > García >> > > >> > > Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam >> > PMC), >> > > >> > > Sourabh Bajaj, and Vikas Kedigehalli. We have also had >> > contributions >> > > >> from >> > > >> > > Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun >> Lee, >> > and >> > > >> > > Younghee Kwon. >> > > >> > > >> > > >> > > [1] https://github.com/apache/beam/tree/python-sdk/sdks/python >> > > >> > > [2] https://beam.apache.org/documentation/programming-guide/ >> > > >> > > [3] https://issues.apache.org/jira/browse/BEAM-1265 >> > > >> > > [4] >> > > >> > > https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op >> > > >> > > en%20AND%20labels%20%3D%20sdk-consistency >> > > >> > > [5] https://issues.apache.org/jira/browse/BEAM-1218 >> > > >> > > [6] https://pypi.python.org/pypi/google-cloud-dataflow/json >> > > >> > > [7] https://issues.apache.org/jira/browse/BEAM-1251 >> > > >> > > >> > > >> > >> > > >> > >> > > >> > Great summary, Ahmet. Thanks. >> > > >> > >> > > >> > Cheers, >> > > >> > >> > > >> > -- >> > > >> > Sergio Fernández >> > > >> > Partner Technology Manager >> > > >> > Redlink GmbH >> > > >> > m: +43 6602747925 >> > > >> > e: sergio.fernan...@redlink.co >> > > >> > w: http://redlink.co >> > > >> > >> > > >> >> > > >> > >> > >