Originally we integrate the build in Maven with the default profile. Do you feel like it'd be better to have it under a separated profile or so?
On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Just to be clear, the prerequisite to be able to build the Python SDK are: > > apt-get install python-setuptools > apt-get install python-pip > > It's also required by the default "regular" build. > > Regards > JB > > > On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote: > >> Just one thing I noticed (and can be helpful for others): to build Beam >> we now need python setuptools installed. >> >> For instance, on Ubuntu, you have to do: >> >> apt-get install python-setuptools >> >> Same for the pip distribution. >> >> I guess (if not already done), we have to update README/Building >> instructions. >> >> Correct ? >> >> Regards >> JB >> >> On 01/31/2017 08:10 AM, Ahmet Altay wrote: >> >>> Hi all, >>> >>> This merge is completed. Python SDK is now officially part of the master >>> branch! Thank you all for the support. Please open an issue, if you >>> notice >>> a reference to the now obsolete python-sdk branch in the documentation. >>> >>> There will not be any more merges to the python-sdk branch. Going forward >>> please use the master branch for Python SDK development. There are a few >>> existing open PRs to the python-sdk [1]. If you are the author of one of >>> those PRs, please rebase them on top of master. >>> >>> Thank you, >>> Ahmet >>> >>> [1] https://github.com/pulls?utf8=✓&q=is%3Aopen+is%3Apr+base% >>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25> >>> 3Apython-sdk+repo%3Aapache%2Fbeam+ >>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr >>> +base%3Apython-sdk+repo%3Aapache%2Fbeam+> >>> >>> >>> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles >>> <k...@google.com.invalid> >>> wrote: >>> >>> To clarify the implied criteria of that last exchange, it is "An SDK >>>> should >>>> have at least one runner that can execute the complete model (may be a >>>> direct runner)" >>>> >>>> I want to highlight this, because whether an _SDK_ supports unbounded >>>> data >>>> is not particularly well-defined, and will evolve: >>>> >>>> - With the Runner API, an SDK will need to support building a graph >>>> with >>>> unbounded constructs, as today with probably minimal changes. >>>> >>>> - With the Fn API, if any part of the Fn API is specific to unbounded >>>> data, the SDK will need to implement it. I think right now there is >>>> no such >>>> thing, and we don't want such a thing, so SDKs implementing the Fn API >>>> automatically support unbounded data. >>>> >>>> - There will also likely be an SDK-specific shim just as there is >>>> today, >>>> to leverage idiomatic deserialized representations. The richness of this >>>> shim will decrease so that it will need to "support" unbounded data but >>>> that will be a ~one liner. >>>> >>>> Getting the Python SDK on master will accelerate our progress towards >>>> the >>>> Fn API - partly technical, partly community - which is the best path >>>> towards support for unbounded data across multiple runners. I think the >>>> criteria are written with the completed portability framework in >>>> mind. So >>>> this exchange makes me actually more convinced we should merge >>>> python-sdk >>>> to master. >>>> >>>> On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw < >>>> rober...@google.com.invalid> wrote: >>>> >>>> On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin >>>>> <dhalp...@google.com.invalid> wrote: >>>>> >>>>>> I do not think that Python SDK yet meets the bar [1] for implementing >>>>>> >>>>> the >>>> >>>>> Beam model -- supporting Unbounded data is very important. That said, >>>>>> >>>>> given >>>>> >>>>>> the committed and sustained set of contributors, it generally makes >>>>>> >>>>> sense >>>> >>>>> to me to make an exception in anticipation of these features being >>>>>> >>>>> fleshed >>>>> >>>>>> out soon; including potentially new users/contributors that would >>>>>> >>>>> arrive >>>> >>>>> once in master. >>>>>> >>>>>> [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y >>>>>> k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com >>>>>> >>>>> >>>>> That is a valid point. The Python SDK supports all the unbounded parts >>>>> of the model except for unbounded sources, which was deferred while >>>>> seeing how https://s.apache.org/splittable-do-fn played out. I've been >>>>> working with the team and merging/reviewing most of their code, and >>>>> have full confidence this will be coming (and on that note can vouch >>>>> for a healthy community and support which are much harder to add >>>>> later). >>>>> >>>>> In short, I think it has the required maturity, and I'm in favor of >>>>> merging soonish. >>>>> >>>>> On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay >>>>>> <al...@google.com.invalid >>>>>> >>>>> >>>>> wrote: >>>>>> >>>>>> Thank you all for the comments so far. I would follow the process as >>>>>>> suggested by Davor and others in this thread. >>>>>>> >>>>>>> Ahmet >>>>>>> >>>>>>> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández < >>>>>>> wik...@apache.org >>>>>>> >>>>>> >>>>> wrote: >>>>>>> >>>>>>> Hi >>>>>>>> >>>>>>>> On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay >>>>>>>> >>>>>>> <al...@google.com.invalid >>>> >>>>> >>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> tl;dr: I would like to start a discussion about merging python-sdk >>>>>>>>> >>>>>>>> branch >>>>>>> >>>>>>>> to master branch. Python SDK is mature enough and merging it to >>>>>>>>> >>>>>>>> master >>>>> >>>>>> will >>>>>>>> >>>>>>>>> accelerate its development and adoption. >>>>>>>>> >>>>>>>>> >>>>>>>> Good point, Ahmet! >>>>>>>> >>>>>>>> I've following closed the development since it was imported in June. >>>>>>>> >>>>>>> For >>>>> >>>>>> the prototypes I've implemented so far it works quite well; I guess >>>>>>>> >>>>>>> we'd >>>>> >>>>>> just need to focus the next months in bringing more runners support. >>>>>>>> >>>>>>>> With a great effort from a lot of contributors(*), Python SDK [1] is >>>>>>>> >>>>>>> now >>>>> >>>>>> a >>>>>>> >>>>>>>> mostly complete, tested, performant Python implementation of the >>>>>>>>> >>>>>>>> Beam >>>>> >>>>>> model. Since June, when we first started with Python SDK in Apache >>>>>>>>> >>>>>>>> Beam >>>>> >>>>>> we >>>>>>>> >>>>>>>>> have been continuously improving it. >>>>>>>>> >>>>>>>>> >>>>>>>> I wouldn't merge during the preparation of 0.5.0 release, but after >>>>>>>> >>>>>>> that >>>>> >>>>>> could be a good time to merge back into master. >>>>>>>> >>>>>>>> >>>>>>>> ** Python SDK currently supports: >>>>>>>> >>>>>>>>> >>>>>>>>> * Model: All main concepts are present (ParDo, GroupByKey, >>>>>>>>> >>>>>>>> Windowing >>>> >>>>> etc.). >>>>>>>> >>>>>>>>> * IO: There are extensible APIs for writing new bounded sources >>>>>>>>> >>>>>>>> and >>>> >>>>> sinks. >>>>>>>> >>>>>>>>> Implementations are provided for Text, Avro, BigQuery, and >>>>>>>>> >>>>>>>> Datastore. >>>>> >>>>>> * Runners: Python SDK has an extensible base runner module that >>>>>>>>> >>>>>>>> allows >>>>> >>>>>> building specific runners on top of it. The SDK comes with two >>>>>>>>> >>>>>>>> pipeline >>>>> >>>>>> runners: DirectRunner and DataflowRunner; and it is possible to >>>>>>>>> >>>>>>>> add >>>> >>>>> more. >>>>>>> >>>>>>>> The existing runners are currently limited to bounded execution >>>>>>>>> >>>>>>>> and >>>> >>>>> otherwise equivalent to their Java SDK counterparts in >>>>>>>>> >>>>>>>> functionality. >>>>> >>>>>> >>>>>>>>> >>>>>>>> What would the effort of porting, and maintaining, parallel versions >>>>>>>> >>>>>>> of >>>>> >>>>>> the >>>>>>> >>>>>>>> Java runners? I guess I'd need to dig deeper in the model, but this >>>>>>>> >>>>>>> may >>>>> >>>>>> represent a major effort for the project, right? >>>>>>>> >>>>>>>> >>>>>>> It is somewhat higher for DirectRunner because DirectRunner also >>>>>>> >>>>>> implements >>>>> >>>>>> the code for execution. It is not that high for DataflowRunner >>>>>>> because >>>>>>> >>>>>> the >>>>> >>>>>> base runner module has a lot of helpers with the right hooks for >>>>>>> implementing a generic runner. I would _expect_ the experience in >>>>>>> >>>>>> general >>>>> >>>>>> would be similar to the latter. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> * Testing: Python SDK implements ValidatesRunner test framework >>>>>>>>> >>>>>>>> for >>>> >>>>> implementing integration test for current and future runners. >>>>>>>>> >>>>>>>> There >>>> >>>>> is >>>>> >>>>>> unit >>>>>>>> >>>>>>>>> test coverage for all modules, and a number of integrations test >>>>>>>>> >>>>>>>> for >>>> >>>>> validating existing runners. >>>>>>>>> * Documentation and examples: Documentation work has started on >>>>>>>>> >>>>>>>> Python >>>>> >>>>>> SDK. >>>>>>>> >>>>>>>>> Beam Programming Guide page has been updated to include Python >>>>>>>>> >>>>>>>> [2]. >>>> >>>>> The >>>>> >>>>>> code comes with many ready to use examples and we are in a good >>>>>>>>> >>>>>>>> place >>>>> >>>>>> to >>>>>>> >>>>>>>> start documenting those on the website. >>>>>>>>> >>>>>>>>> ** We are not done yet, next on the roadmap we have: >>>>>>>>> >>>>>>>>> * Streaming: Both of the existing runners lack support for >>>>>>>>> >>>>>>>> streaming >>>> >>>>> execution, and currently there is work going on for adding >>>>>>>>> >>>>>>>> streaming >>>> >>>>> support to DirectRunner [3]. >>>>>>>>> * Documentation: Filling the rest of the Beam documentations with >>>>>>>>> >>>>>>>> Python >>>>>>> >>>>>>>> SDK specific information and examples. >>>>>>>>> * SDK consistency: Making Python SDK consistent with the Java SDK. >>>>>>>>> >>>>>>>> We >>>>> >>>>>> have >>>>>>>> >>>>>>>>> come a long way on this and have only a few items left [4]. >>>>>>>>> * Beamifying: We have been working on removing Dataflow-specific >>>>>>>>> >>>>>>>> references >>>>>>>> >>>>>>>>> both from the documentation and from the code. There is some work >>>>>>>>> >>>>>>>> left, >>>>> >>>>>> and >>>>>>>> >>>>>>>>> we are currently working on those as well [5]. >>>>>>>>> >>>>>>>>> ** Steps and implications of merging to master: >>>>>>>>> >>>>>>>>> * Master branch is merged to python-sdk branch at regular >>>>>>>>> >>>>>>>> intervals >>>> >>>>> and >>>>> >>>>>> the >>>>>>>> >>>>>>>>> last merge was on 12/22. All the past merges were uneventful >>>>>>>>> >>>>>>>> because >>>> >>>>> there >>>>>>>> >>>>>>>>> is a minimal overlap in modified files between branches. >>>>>>>>> >>>>>>>> Integrating >>>> >>>>> python-sdk to master will similarly touch a small number of >>>>>>>>> >>>>>>>> existing >>>> >>>>> files. >>>>>>>> >>>>>>>>> >>>>>>>>> * Python SDK is using the same tools for building and testing. It >>>>>>>>> >>>>>>>> is >>>> >>>>> already integrated with Maven, Jenkins and Travis. Specifically >>>>>>>>> >>>>>>>> the >>>> >>>>> impact >>>>>>>> >>>>>>>>> to the testing infrastructure would be: >>>>>>>>> - There will be two additional test configurations in Travis. >>>>>>>>> >>>>>>>> Since >>>> >>>>> Travis >>>>>>>> >>>>>>>>> runs all configurations in parallel there should not be a >>>>>>>>> >>>>>>>> noticeable >>>> >>>>> change >>>>>>>> >>>>>>>>> in the Travis run time. >>>>>>>>> - Jenkins pre-commit test will start running the Python SDK tests. >>>>>>>>> >>>>>>>> It >>>>> >>>>>> will >>>>>>>> >>>>>>>>> add an additional 5 minutes to the completion time of pre-commit >>>>>>>>> >>>>>>>> test. >>>>> >>>>>> Historically Python SDK tests were not flaky and did not cause any >>>>>>>>> >>>>>>>> random >>>>>>> >>>>>>>> failures. >>>>>>>>> - Jenkins Python post-commit test is already separated from the >>>>>>>>> >>>>>>>> other >>>>> >>>>>> post-commit tests and will continue to exist. It would not change >>>>>>>>> >>>>>>>> the >>>>> >>>>>> testing time for any other test. >>>>>>>>> >>>>>>>>> * The release process needs to be updated to accommodate releasing >>>>>>>>> >>>>>>>> Python >>>>>>> >>>>>>>> artifacts. Python SDK would fit in the existing release schedule >>>>>>>>> >>>>>>>> and >>>> >>>>> could >>>>>>>> >>>>>>>>> be released along with the Java SDK. The additional steps would >>>>>>>>> >>>>>>>> include: >>>>>>> >>>>>>>> - Generating Python artifacts. This could be done with a single >>>>>>>>> >>>>>>>> command >>>>> >>>>>> using Maven today. >>>>>>>>> - Publishing the artifacts to a central repository such as PyPI. >>>>>>>>> >>>>>>>>> >>>>>>>> I'm more than happy to help on this. We left on purpose some things >>>>>>>> >>>>>>> open >>>>> >>>>>> when we added Maven support to the Python build. >>>>>>>> >>>>>>>> >>>>>>> That would be awesome. We can coordinate on that post-merge. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> - Updating the release guide to reflect the changes above. >>>>>>>>> >>>>>>>>> * Users: There are existing users using the Python SDK. To give a >>>>>>>>> >>>>>>>> rough >>>>> >>>>>> estimate, a distribution of the Beam Python SDK had a total of 23K >>>>>>>>> downloads in the past 6 months [6]. Some of those users are >>>>>>>>> >>>>>>>> already >>>> >>>>> engaged >>>>>>>> >>>>>>>>> with the community (e.g. [7]). There might be an increased amount >>>>>>>>> engagement from the rest of them after the merge. >>>>>>>>> >>>>>>>>> >>>>>>>> Python 3 support is something we definitively need to look ahead. >>>>>>>> >>>>>>> I'd >>>> >>>>> try >>>>> >>>>>> to make the codebase compatible with both 2.7.x and 3.6.x, rather >>>>>>>> >>>>>>> than >>>> >>>>> using other solutions like 2to3. >>>>>>>> >>>>>>>> >>>>>>> I agree with you. I think it makes more sense to make codebase >>>>>>> >>>>>> compatible >>>>> >>>>>> with both. As you mentioned Python 3 support is not a short-term goal >>>>>>> >>>>>> in >>>> >>>>> the roadmap, and we can discuss it more as we approach that. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Looking forward to hearing your thoughts and comments on >>>>>>>> >>>>>>> “graduating” >>>> >>>>> python-sdk to the master. >>>>>>>>> >>>>>>>>> Thank you, >>>>>>>>> Ahmet >>>>>>>>> >>>>>>>>> (*) Python SDK branch currently has a diverse group of >>>>>>>>> >>>>>>>> contributors. >>>> >>>>> Regular contributors include Charles Chen, Chamikara Jayalath, >>>>>>>>> >>>>>>>> María >>>> >>>>> García >>>>>>>> >>>>>>>>> Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam >>>>>>>>> >>>>>>>> PMC), >>>> >>>>> Sourabh Bajaj, and Vikas Kedigehalli. We have also had >>>>>>>>> >>>>>>>> contributions >>>> >>>>> from >>>>>>> >>>>>>>> Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, >>>>>>>>> >>>>>>>> and >>>> >>>>> Younghee Kwon. >>>>>>>>> >>>>>>>>> [1] https://github.com/apache/beam/tree/python-sdk/sdks/python >>>>>>>>> [2] https://beam.apache.org/documentation/programming-guide/ >>>>>>>>> [3] https://issues.apache.org/jira/browse/BEAM-1265 >>>>>>>>> [4] >>>>>>>>> https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op >>>>>>>>> en%20AND%20labels%20%3D%20sdk-consistency >>>>>>>>> [5] https://issues.apache.org/jira/browse/BEAM-1218 >>>>>>>>> [6] https://pypi.python.org/pypi/google-cloud-dataflow/json >>>>>>>>> [7] https://issues.apache.org/jira/browse/BEAM-1251 >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> Great summary, Ahmet. Thanks. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> -- >>>>>>>> Sergio Fernández >>>>>>>> Partner Technology Manager >>>>>>>> Redlink GmbH >>>>>>>> m: +43 6602747925 >>>>>>>> e: sergio.fernan...@redlink.co >>>>>>>> w: http://redlink.co >>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>> >>> >> > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com > -- Sergio Fernández Partner Technology Manager Redlink GmbH m: +43 6602747925 e: sergio.fernan...@redlink.co w: http://redlink.co