Originally we integrate the build in Maven with the default profile.
Do you feel like it'd be better to have it under a separated profile or so?

On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Just to be clear, the prerequisite to be able to build the Python SDK are:
>
> apt-get install python-setuptools
> apt-get install python-pip
>
> It's also required by the default "regular" build.
>
> Regards
> JB
>
>
> On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote:
>
>> Just one thing I noticed (and can be helpful for others): to build Beam
>> we now need python setuptools installed.
>>
>> For instance, on Ubuntu, you have to do:
>>
>> apt-get install python-setuptools
>>
>> Same for the pip distribution.
>>
>> I guess (if not already done), we have to update README/Building
>> instructions.
>>
>> Correct ?
>>
>> Regards
>> JB
>>
>> On 01/31/2017 08:10 AM, Ahmet Altay wrote:
>>
>>> Hi all,
>>>
>>> This merge is completed. Python SDK is now officially part of the master
>>> branch! Thank you all for the support. Please open an issue, if you
>>> notice
>>> a reference to the now obsolete python-sdk branch in the documentation.
>>>
>>> There will not be any more merges to the python-sdk branch. Going forward
>>> please use the master branch for Python SDK development. There are a few
>>> existing open PRs to the python-sdk [1]. If you are the author of one of
>>> those PRs, please rebase them on top of master.
>>>
>>> Thank you,
>>> Ahmet
>>>
>>> [1] https://github.com/pulls?utf8=✓&q=is%3Aopen+is%3Apr+base%
>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25>
>>> 3Apython-sdk+repo%3Aapache%2Fbeam+
>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr
>>> +base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>>>
>>>
>>> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles
>>> <k...@google.com.invalid>
>>> wrote:
>>>
>>> To clarify the implied criteria of that last exchange, it is "An SDK
>>>> should
>>>> have at least one runner that can execute the complete model (may be a
>>>> direct runner)"
>>>>
>>>> I want to highlight this, because whether an _SDK_ supports unbounded
>>>> data
>>>> is not particularly well-defined, and will evolve:
>>>>
>>>>  - With the Runner API, an SDK will need to support building a graph
>>>> with
>>>> unbounded constructs, as today with probably minimal changes.
>>>>
>>>>  - With the Fn API, if any part of the Fn API is specific to unbounded
>>>> data, the SDK will need to implement it. I think right now there is
>>>> no such
>>>> thing, and we don't want such a thing, so SDKs implementing the Fn API
>>>> automatically support unbounded data.
>>>>
>>>>  - There will also likely be an SDK-specific shim just as there is
>>>> today,
>>>> to leverage idiomatic deserialized representations. The richness of this
>>>> shim will decrease so that it will need to "support" unbounded data but
>>>> that will be a ~one liner.
>>>>
>>>> Getting the Python SDK on master will accelerate our progress towards
>>>> the
>>>> Fn API - partly technical, partly community - which is the best path
>>>> towards support for unbounded data across multiple runners. I think the
>>>> criteria are written with the completed portability framework in
>>>> mind. So
>>>> this exchange makes me actually more convinced we should merge
>>>> python-sdk
>>>> to master.
>>>>
>>>> On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
>>>> rober...@google.com.invalid> wrote:
>>>>
>>>> On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
>>>>> <dhalp...@google.com.invalid> wrote:
>>>>>
>>>>>> I do not think that Python SDK yet meets the bar [1] for implementing
>>>>>>
>>>>> the
>>>>
>>>>> Beam model -- supporting Unbounded data is very important. That said,
>>>>>>
>>>>> given
>>>>>
>>>>>> the committed and sustained set of contributors, it generally makes
>>>>>>
>>>>> sense
>>>>
>>>>> to me to make an exception in anticipation of these features being
>>>>>>
>>>>> fleshed
>>>>>
>>>>>> out soon; including potentially new users/contributors that would
>>>>>>
>>>>> arrive
>>>>
>>>>> once in master.
>>>>>>
>>>>>> [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
>>>>>> k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com
>>>>>>
>>>>>
>>>>> That is a valid point. The Python SDK supports all the unbounded parts
>>>>> of the model except for unbounded sources, which was deferred while
>>>>> seeing how https://s.apache.org/splittable-do-fn played out. I've been
>>>>> working with the team and merging/reviewing most of their code, and
>>>>> have full confidence this will be coming (and on that note can vouch
>>>>> for a healthy community and support which are much harder to add
>>>>> later).
>>>>>
>>>>> In short, I think it has the required maturity, and I'm in favor of
>>>>> merging soonish.
>>>>>
>>>>> On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
>>>>>> <al...@google.com.invalid
>>>>>>
>>>>>
>>>>> wrote:
>>>>>>
>>>>>> Thank you all for the comments so far. I would follow the process as
>>>>>>> suggested by Davor and others in this thread.
>>>>>>>
>>>>>>> Ahmet
>>>>>>>
>>>>>>> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <
>>>>>>> wik...@apache.org
>>>>>>>
>>>>>>
>>>>> wrote:
>>>>>>>
>>>>>>> Hi
>>>>>>>>
>>>>>>>> On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
>>>>>>>>
>>>>>>> <al...@google.com.invalid
>>>>
>>>>>
>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> tl;dr: I would like to start a discussion about merging python-sdk
>>>>>>>>>
>>>>>>>> branch
>>>>>>>
>>>>>>>> to master branch. Python SDK is mature enough and merging it to
>>>>>>>>>
>>>>>>>> master
>>>>>
>>>>>> will
>>>>>>>>
>>>>>>>>> accelerate its development and adoption.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Good point, Ahmet!
>>>>>>>>
>>>>>>>> I've following closed the development since it was imported in June.
>>>>>>>>
>>>>>>> For
>>>>>
>>>>>> the prototypes I've implemented so far it works quite well; I guess
>>>>>>>>
>>>>>>> we'd
>>>>>
>>>>>> just need to focus the next months in bringing more runners support.
>>>>>>>>
>>>>>>>> With a great effort from a lot of contributors(*), Python SDK [1] is
>>>>>>>>
>>>>>>> now
>>>>>
>>>>>> a
>>>>>>>
>>>>>>>> mostly complete, tested, performant Python implementation of the
>>>>>>>>>
>>>>>>>> Beam
>>>>>
>>>>>> model. Since June, when we first started with Python SDK in Apache
>>>>>>>>>
>>>>>>>> Beam
>>>>>
>>>>>> we
>>>>>>>>
>>>>>>>>> have been continuously improving it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I wouldn't merge during the preparation of 0.5.0 release, but after
>>>>>>>>
>>>>>>> that
>>>>>
>>>>>> could be a good time to merge back into master.
>>>>>>>>
>>>>>>>>
>>>>>>>> ** Python SDK currently supports:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> * Model: All main concepts are present (ParDo, GroupByKey,
>>>>>>>>>
>>>>>>>> Windowing
>>>>
>>>>> etc.).
>>>>>>>>
>>>>>>>>> * IO: There are extensible APIs for writing new bounded sources
>>>>>>>>>
>>>>>>>> and
>>>>
>>>>> sinks.
>>>>>>>>
>>>>>>>>> Implementations are provided for Text, Avro, BigQuery, and
>>>>>>>>>
>>>>>>>> Datastore.
>>>>>
>>>>>> * Runners: Python SDK has an extensible base runner module that
>>>>>>>>>
>>>>>>>> allows
>>>>>
>>>>>> building specific runners on top of it. The SDK comes with two
>>>>>>>>>
>>>>>>>> pipeline
>>>>>
>>>>>> runners: DirectRunner and DataflowRunner; and it is possible to
>>>>>>>>>
>>>>>>>> add
>>>>
>>>>> more.
>>>>>>>
>>>>>>>> The existing runners are currently limited to bounded execution
>>>>>>>>>
>>>>>>>> and
>>>>
>>>>> otherwise equivalent to their Java SDK counterparts in
>>>>>>>>>
>>>>>>>> functionality.
>>>>>
>>>>>>
>>>>>>>>>
>>>>>>>> What would the effort of porting, and maintaining, parallel versions
>>>>>>>>
>>>>>>> of
>>>>>
>>>>>> the
>>>>>>>
>>>>>>>> Java runners? I guess I'd need to dig deeper in the model, but this
>>>>>>>>
>>>>>>> may
>>>>>
>>>>>> represent a major effort for the project, right?
>>>>>>>>
>>>>>>>>
>>>>>>> It is somewhat higher for DirectRunner because DirectRunner also
>>>>>>>
>>>>>> implements
>>>>>
>>>>>> the code for execution. It is not that high for DataflowRunner
>>>>>>> because
>>>>>>>
>>>>>> the
>>>>>
>>>>>> base runner module has a lot of helpers with the right hooks for
>>>>>>> implementing a generic runner. I would _expect_ the experience in
>>>>>>>
>>>>>> general
>>>>>
>>>>>> would be similar to the latter.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> * Testing: Python SDK implements ValidatesRunner test framework
>>>>>>>>>
>>>>>>>> for
>>>>
>>>>> implementing integration test for current and future runners.
>>>>>>>>>
>>>>>>>> There
>>>>
>>>>> is
>>>>>
>>>>>> unit
>>>>>>>>
>>>>>>>>> test coverage for all modules, and a number of integrations test
>>>>>>>>>
>>>>>>>> for
>>>>
>>>>> validating existing runners.
>>>>>>>>> * Documentation and examples: Documentation work has started on
>>>>>>>>>
>>>>>>>> Python
>>>>>
>>>>>> SDK.
>>>>>>>>
>>>>>>>>> Beam Programming Guide page has been updated to include Python
>>>>>>>>>
>>>>>>>> [2].
>>>>
>>>>> The
>>>>>
>>>>>> code comes with many ready to use examples and we are in a good
>>>>>>>>>
>>>>>>>> place
>>>>>
>>>>>> to
>>>>>>>
>>>>>>>> start documenting those on the website.
>>>>>>>>>
>>>>>>>>> ** We are not done yet, next on the roadmap we have:
>>>>>>>>>
>>>>>>>>> * Streaming: Both of the existing runners lack support for
>>>>>>>>>
>>>>>>>> streaming
>>>>
>>>>> execution, and currently there is work going on for adding
>>>>>>>>>
>>>>>>>> streaming
>>>>
>>>>> support to DirectRunner [3].
>>>>>>>>> * Documentation: Filling the rest of the Beam documentations with
>>>>>>>>>
>>>>>>>> Python
>>>>>>>
>>>>>>>> SDK specific information and examples.
>>>>>>>>> * SDK consistency: Making Python SDK consistent with the Java SDK.
>>>>>>>>>
>>>>>>>> We
>>>>>
>>>>>> have
>>>>>>>>
>>>>>>>>> come a long way on this and have only a few items left [4].
>>>>>>>>> * Beamifying: We have been working on removing Dataflow-specific
>>>>>>>>>
>>>>>>>> references
>>>>>>>>
>>>>>>>>> both from the documentation and from the code. There is some work
>>>>>>>>>
>>>>>>>> left,
>>>>>
>>>>>> and
>>>>>>>>
>>>>>>>>> we are currently working on those as well [5].
>>>>>>>>>
>>>>>>>>> ** Steps and implications of merging to master:
>>>>>>>>>
>>>>>>>>> * Master branch is merged to python-sdk branch at regular
>>>>>>>>>
>>>>>>>> intervals
>>>>
>>>>> and
>>>>>
>>>>>> the
>>>>>>>>
>>>>>>>>> last merge was on 12/22. All the past merges were uneventful
>>>>>>>>>
>>>>>>>> because
>>>>
>>>>> there
>>>>>>>>
>>>>>>>>> is a minimal overlap in modified files between branches.
>>>>>>>>>
>>>>>>>> Integrating
>>>>
>>>>> python-sdk to master will similarly touch a small number of
>>>>>>>>>
>>>>>>>> existing
>>>>
>>>>> files.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> * Python SDK is using the same tools for building and testing. It
>>>>>>>>>
>>>>>>>> is
>>>>
>>>>> already integrated with Maven, Jenkins and Travis. Specifically
>>>>>>>>>
>>>>>>>> the
>>>>
>>>>> impact
>>>>>>>>
>>>>>>>>> to the testing infrastructure would be:
>>>>>>>>> - There will be two additional test configurations in Travis.
>>>>>>>>>
>>>>>>>> Since
>>>>
>>>>> Travis
>>>>>>>>
>>>>>>>>> runs all configurations in parallel there should not be a
>>>>>>>>>
>>>>>>>> noticeable
>>>>
>>>>> change
>>>>>>>>
>>>>>>>>> in the Travis run time.
>>>>>>>>> - Jenkins pre-commit test will start running the Python SDK tests.
>>>>>>>>>
>>>>>>>> It
>>>>>
>>>>>> will
>>>>>>>>
>>>>>>>>> add an additional 5 minutes to the completion time of pre-commit
>>>>>>>>>
>>>>>>>> test.
>>>>>
>>>>>> Historically Python SDK tests were not flaky and did not cause any
>>>>>>>>>
>>>>>>>> random
>>>>>>>
>>>>>>>> failures.
>>>>>>>>> - Jenkins Python post-commit test is already separated from the
>>>>>>>>>
>>>>>>>> other
>>>>>
>>>>>> post-commit tests and will continue to exist. It would not change
>>>>>>>>>
>>>>>>>> the
>>>>>
>>>>>> testing time for any other test.
>>>>>>>>>
>>>>>>>>> * The release process needs to be updated to accommodate releasing
>>>>>>>>>
>>>>>>>> Python
>>>>>>>
>>>>>>>> artifacts. Python SDK would fit in the existing release schedule
>>>>>>>>>
>>>>>>>> and
>>>>
>>>>> could
>>>>>>>>
>>>>>>>>> be released along with the Java SDK. The additional steps would
>>>>>>>>>
>>>>>>>> include:
>>>>>>>
>>>>>>>> - Generating Python artifacts. This could be done with a single
>>>>>>>>>
>>>>>>>> command
>>>>>
>>>>>> using Maven today.
>>>>>>>>> - Publishing the artifacts to a central repository such as PyPI.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I'm more than happy to help on this. We left on purpose some things
>>>>>>>>
>>>>>>> open
>>>>>
>>>>>> when we added Maven support to the Python build.
>>>>>>>>
>>>>>>>>
>>>>>>> That would be awesome. We can coordinate on that post-merge.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> - Updating the release guide to reflect the changes above.
>>>>>>>>>
>>>>>>>>> * Users: There are existing users using the Python SDK. To give a
>>>>>>>>>
>>>>>>>> rough
>>>>>
>>>>>> estimate, a distribution of the Beam Python SDK had a total of 23K
>>>>>>>>> downloads in the past 6 months [6]. Some of those users are
>>>>>>>>>
>>>>>>>> already
>>>>
>>>>> engaged
>>>>>>>>
>>>>>>>>> with the community (e.g. [7]). There might be an increased amount
>>>>>>>>> engagement from the rest of them after the merge.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Python 3 support is something we definitively need to look ahead.
>>>>>>>>
>>>>>>> I'd
>>>>
>>>>> try
>>>>>
>>>>>> to make the codebase compatible with both 2.7.x and 3.6.x, rather
>>>>>>>>
>>>>>>> than
>>>>
>>>>> using other  solutions like 2to3.
>>>>>>>>
>>>>>>>>
>>>>>>> I agree with you. I think it makes more sense to make codebase
>>>>>>>
>>>>>> compatible
>>>>>
>>>>>> with both. As you mentioned Python 3 support is not a short-term goal
>>>>>>>
>>>>>> in
>>>>
>>>>> the roadmap, and we can discuss it more as we approach that.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Looking forward to hearing your thoughts and comments on
>>>>>>>>
>>>>>>> “graduating”
>>>>
>>>>> python-sdk to the master.
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Ahmet
>>>>>>>>>
>>>>>>>>> (*) Python SDK branch currently has a diverse group of
>>>>>>>>>
>>>>>>>> contributors.
>>>>
>>>>> Regular contributors include Charles Chen, Chamikara Jayalath,
>>>>>>>>>
>>>>>>>> María
>>>>
>>>>> García
>>>>>>>>
>>>>>>>>> Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam
>>>>>>>>>
>>>>>>>> PMC),
>>>>
>>>>> Sourabh Bajaj, and Vikas Kedigehalli. We have also had
>>>>>>>>>
>>>>>>>> contributions
>>>>
>>>>> from
>>>>>>>
>>>>>>>> Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee,
>>>>>>>>>
>>>>>>>> and
>>>>
>>>>> Younghee Kwon.
>>>>>>>>>
>>>>>>>>> [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
>>>>>>>>> [2] https://beam.apache.org/documentation/programming-guide/
>>>>>>>>> [3] https://issues.apache.org/jira/browse/BEAM-1265
>>>>>>>>> [4]
>>>>>>>>> https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
>>>>>>>>> en%20AND%20labels%20%3D%20sdk-consistency
>>>>>>>>> [5] https://issues.apache.org/jira/browse/BEAM-1218
>>>>>>>>> [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
>>>>>>>>> [7] https://issues.apache.org/jira/browse/BEAM-1251
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Great summary, Ahmet. Thanks.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sergio Fernández
>>>>>>>> Partner Technology Manager
>>>>>>>> Redlink GmbH
>>>>>>>> m: +43 6602747925
>>>>>>>> e: sergio.fernan...@redlink.co
>>>>>>>> w: http://redlink.co
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernan...@redlink.co
w: http://redlink.co

Reply via email to