Re: [DISCUSS] Python SDK status and next steps

2017-01-31 Thread Ahmet Altay
Thank you Prabeesh and Sergio for fixing those!

On Tue, Jan 31, 2017 at 4:51 AM, Jean-Baptiste Onofré 
wrote:

> Awesome, thanks Sergio ! Much appreciated ;)
>
> Regards
> JB
>
>
> On 01/31/2017 01:42 PM, Sergio Fernández wrote:
>
>> PR #1879 provides the basics: https://github.com/apache/beam/pull/1879
>>
>> On Tue, Jan 31, 2017 at 1:33 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>> No, that's fine as soon as we clearly document the prerequisite for the
>>> build. IMHO, we should provide quick BUILDING instructions in the
>>> README.md.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 01/31/2017 01:24 PM, Sergio Fernández wrote:
>>>
>>> Originally we integrate the build in Maven with the default profile.
 Do you feel like it'd be better to have it under a separated profile or
 so?

 On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofré 
 wrote:

 Just to be clear, the prerequisite to be able to build the Python SDK
 are:

>
> apt-get install python-setuptools
> apt-get install python-pip
>
> It's also required by the default "regular" build.
>
> Regards
> JB
>
>
> On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote:
>
> Just one thing I noticed (and can be helpful for others): to build Beam
>
>> we now need python setuptools installed.
>>
>> For instance, on Ubuntu, you have to do:
>>
>> apt-get install python-setuptools
>>
>> Same for the pip distribution.
>>
>> I guess (if not already done), we have to update README/Building
>> instructions.
>>
>> Correct ?
>>
>> Regards
>> JB
>>
>> On 01/31/2017 08:10 AM, Ahmet Altay wrote:
>>
>> Hi all,
>>
>>>
>>> This merge is completed. Python SDK is now officially part of the
>>> master
>>> branch! Thank you all for the support. Please open an issue, if you
>>> notice
>>> a reference to the now obsolete python-sdk branch in the
>>> documentation.
>>>
>>> There will not be any more merges to the python-sdk branch. Going
>>> forward
>>> please use the master branch for Python SDK development. There are a
>>> few
>>> existing open PRs to the python-sdk [1]. If you are the author of one
>>> of
>>> those PRs, please rebase them on top of master.
>>>
>>> Thank you,
>>> Ahmet
>>>
>>> [1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%
>>> 
>>> >> >
>>> >> >
>>> 3Apython-sdk+repo%3Aapache%2Fbeam+
>>> >> +base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>>>
>>>
>>> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles
>>> 
>>> wrote:
>>>
>>> To clarify the implied criteria of that last exchange, it is "An SDK
>>>
>>> should
 have at least one runner that can execute the complete model (may
 be a
 direct runner)"

 I want to highlight this, because whether an _SDK_ supports
 unbounded
 data
 is not particularly well-defined, and will evolve:

  - With the Runner API, an SDK will need to support building a graph
 with
 unbounded constructs, as today with probably minimal changes.

  - With the Fn API, if any part of the Fn API is specific to
 unbounded
 data, the SDK will need to implement it. I think right now there is
 no such
 thing, and we don't want such a thing, so SDKs implementing the Fn
 API
 automatically support unbounded data.

  - There will also likely be an SDK-specific shim just as there is
 today,
 to leverage idiomatic deserialized representations. The richness of
 this
 shim will decrease so that it will need to "support" unbounded data
 but
 that will be a ~one liner.

 Getting the Python SDK on master will accelerate our progress
 towards
 the
 Fn API - partly technical, partly community - which is the best path
 towards support for unbounded data across multiple runners. I think
 the
 criteria are written with the completed portability framework in
 mind. So
 this exchange makes me actually more convinced we should merge
 python-sdk
 to master.

 On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
 rober...@google.com.invalid> wrote:

 On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin

  wrote:
>
> I do 

Re: [DISCUSS] Python SDK status and next steps

2017-01-31 Thread Jean-Baptiste Onofré

Awesome, thanks Sergio ! Much appreciated ;)

Regards
JB

On 01/31/2017 01:42 PM, Sergio Fernández wrote:

PR #1879 provides the basics: https://github.com/apache/beam/pull/1879

On Tue, Jan 31, 2017 at 1:33 PM, Jean-Baptiste Onofré 
wrote:


No, that's fine as soon as we clearly document the prerequisite for the
build. IMHO, we should provide quick BUILDING instructions in the README.md.

Regards
JB


On 01/31/2017 01:24 PM, Sergio Fernández wrote:


Originally we integrate the build in Maven with the default profile.
Do you feel like it'd be better to have it under a separated profile or
so?

On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofré 
wrote:

Just to be clear, the prerequisite to be able to build the Python SDK are:


apt-get install python-setuptools
apt-get install python-pip

It's also required by the default "regular" build.

Regards
JB


On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote:

Just one thing I noticed (and can be helpful for others): to build Beam

we now need python setuptools installed.

For instance, on Ubuntu, you have to do:

apt-get install python-setuptools

Same for the pip distribution.

I guess (if not already done), we have to update README/Building
instructions.

Correct ?

Regards
JB

On 01/31/2017 08:10 AM, Ahmet Altay wrote:

Hi all,


This merge is completed. Python SDK is now officially part of the
master
branch! Thank you all for the support. Please open an issue, if you
notice
a reference to the now obsolete python-sdk branch in the documentation.

There will not be any more merges to the python-sdk branch. Going
forward
please use the master branch for Python SDK development. There are a
few
existing open PRs to the python-sdk [1]. If you are the author of one
of
those PRs, please rebase them on top of master.

Thank you,
Ahmet

[1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%


3Apython-sdk+repo%3Aapache%2Fbeam+



On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles

wrote:

To clarify the implied criteria of that last exchange, it is "An SDK


should
have at least one runner that can execute the complete model (may be a
direct runner)"

I want to highlight this, because whether an _SDK_ supports unbounded
data
is not particularly well-defined, and will evolve:

 - With the Runner API, an SDK will need to support building a graph
with
unbounded constructs, as today with probably minimal changes.

 - With the Fn API, if any part of the Fn API is specific to unbounded
data, the SDK will need to implement it. I think right now there is
no such
thing, and we don't want such a thing, so SDKs implementing the Fn API
automatically support unbounded data.

 - There will also likely be an SDK-specific shim just as there is
today,
to leverage idiomatic deserialized representations. The richness of
this
shim will decrease so that it will need to "support" unbounded data
but
that will be a ~one liner.

Getting the Python SDK on master will accelerate our progress towards
the
Fn API - partly technical, partly community - which is the best path
towards support for unbounded data across multiple runners. I think
the
criteria are written with the completed portability framework in
mind. So
this exchange makes me actually more convinced we should merge
python-sdk
to master.

On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin


 wrote:

I do not think that Python SDK yet meets the bar [1] for implementing


the




Beam model -- supporting Unbounded data is very important. That said,




given


the committed and sustained set of contributors, it generally makes


sense




to me to make an exception in anticipation of these features being




fleshed


out soon; including potentially new users/contributors that would


arrive




once in master.




[1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com



That is a valid point. The Python SDK supports all the unbounded
parts
of the model except for unbounded sources, which was deferred while
seeing how https://s.apache.org/splittable-do-fn played out. I've
been
working with the team and merging/reviewing most of their code, and
have full confidence this will be coming (and on that note can vouch
for a healthy community and support which are much harder to add
later).

In short, I think it has the required maturity, and I'm in favor of
merging soonish.

On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay




Re: [DISCUSS] Python SDK status and next steps

2017-01-31 Thread Sergio Fernández
PR #1879 provides the basics: https://github.com/apache/beam/pull/1879

On Tue, Jan 31, 2017 at 1:33 PM, Jean-Baptiste Onofré 
wrote:

> No, that's fine as soon as we clearly document the prerequisite for the
> build. IMHO, we should provide quick BUILDING instructions in the README.md.
>
> Regards
> JB
>
>
> On 01/31/2017 01:24 PM, Sergio Fernández wrote:
>
>> Originally we integrate the build in Maven with the default profile.
>> Do you feel like it'd be better to have it under a separated profile or
>> so?
>>
>> On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Just to be clear, the prerequisite to be able to build the Python SDK are:
>>>
>>> apt-get install python-setuptools
>>> apt-get install python-pip
>>>
>>> It's also required by the default "regular" build.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote:
>>>
>>> Just one thing I noticed (and can be helpful for others): to build Beam
 we now need python setuptools installed.

 For instance, on Ubuntu, you have to do:

 apt-get install python-setuptools

 Same for the pip distribution.

 I guess (if not already done), we have to update README/Building
 instructions.

 Correct ?

 Regards
 JB

 On 01/31/2017 08:10 AM, Ahmet Altay wrote:

 Hi all,
>
> This merge is completed. Python SDK is now officially part of the
> master
> branch! Thank you all for the support. Please open an issue, if you
> notice
> a reference to the now obsolete python-sdk branch in the documentation.
>
> There will not be any more merges to the python-sdk branch. Going
> forward
> please use the master branch for Python SDK development. There are a
> few
> existing open PRs to the python-sdk [1]. If you are the author of one
> of
> those PRs, please rebase them on top of master.
>
> Thank you,
> Ahmet
>
> [1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%
> 
> 
> 3Apython-sdk+repo%3Aapache%2Fbeam+
>  +base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>
>
> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles
> 
> wrote:
>
> To clarify the implied criteria of that last exchange, it is "An SDK
>
>> should
>> have at least one runner that can execute the complete model (may be a
>> direct runner)"
>>
>> I want to highlight this, because whether an _SDK_ supports unbounded
>> data
>> is not particularly well-defined, and will evolve:
>>
>>  - With the Runner API, an SDK will need to support building a graph
>> with
>> unbounded constructs, as today with probably minimal changes.
>>
>>  - With the Fn API, if any part of the Fn API is specific to unbounded
>> data, the SDK will need to implement it. I think right now there is
>> no such
>> thing, and we don't want such a thing, so SDKs implementing the Fn API
>> automatically support unbounded data.
>>
>>  - There will also likely be an SDK-specific shim just as there is
>> today,
>> to leverage idiomatic deserialized representations. The richness of
>> this
>> shim will decrease so that it will need to "support" unbounded data
>> but
>> that will be a ~one liner.
>>
>> Getting the Python SDK on master will accelerate our progress towards
>> the
>> Fn API - partly technical, partly community - which is the best path
>> towards support for unbounded data across multiple runners. I think
>> the
>> criteria are written with the completed portability framework in
>> mind. So
>> this exchange makes me actually more convinced we should merge
>> python-sdk
>> to master.
>>
>> On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
>> rober...@google.com.invalid> wrote:
>>
>> On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
>>
>>>  wrote:
>>>
>>> I do not think that Python SDK yet meets the bar [1] for implementing

 the
>>>
>>
>> Beam model -- supporting Unbounded data is very important. That said,
>>>

 given
>>>
>>> the committed and sustained set of contributors, it generally makes

 sense
>>>
>>
>> to me to make an exception in anticipation of these features being
>>>

 fleshed
>>>
>>> out soon; including potentially new users/contributors that would

 arrive
>>>
>>
>> once in master.
>>>

 [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
 

Re: [DISCUSS] Python SDK status and next steps

2017-01-31 Thread Jean-Baptiste Onofré
No, that's fine as soon as we clearly document the prerequisite for the 
build. IMHO, we should provide quick BUILDING instructions in the README.md.


Regards
JB

On 01/31/2017 01:24 PM, Sergio Fernández wrote:

Originally we integrate the build in Maven with the default profile.
Do you feel like it'd be better to have it under a separated profile or so?

On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofré 
wrote:


Just to be clear, the prerequisite to be able to build the Python SDK are:

apt-get install python-setuptools
apt-get install python-pip

It's also required by the default "regular" build.

Regards
JB


On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote:


Just one thing I noticed (and can be helpful for others): to build Beam
we now need python setuptools installed.

For instance, on Ubuntu, you have to do:

apt-get install python-setuptools

Same for the pip distribution.

I guess (if not already done), we have to update README/Building
instructions.

Correct ?

Regards
JB

On 01/31/2017 08:10 AM, Ahmet Altay wrote:


Hi all,

This merge is completed. Python SDK is now officially part of the master
branch! Thank you all for the support. Please open an issue, if you
notice
a reference to the now obsolete python-sdk branch in the documentation.

There will not be any more merges to the python-sdk branch. Going forward
please use the master branch for Python SDK development. There are a few
existing open PRs to the python-sdk [1]. If you are the author of one of
those PRs, please rebase them on top of master.

Thank you,
Ahmet

[1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%

3Apython-sdk+repo%3Aapache%2Fbeam+



On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles

wrote:

To clarify the implied criteria of that last exchange, it is "An SDK

should
have at least one runner that can execute the complete model (may be a
direct runner)"

I want to highlight this, because whether an _SDK_ supports unbounded
data
is not particularly well-defined, and will evolve:

 - With the Runner API, an SDK will need to support building a graph
with
unbounded constructs, as today with probably minimal changes.

 - With the Fn API, if any part of the Fn API is specific to unbounded
data, the SDK will need to implement it. I think right now there is
no such
thing, and we don't want such a thing, so SDKs implementing the Fn API
automatically support unbounded data.

 - There will also likely be an SDK-specific shim just as there is
today,
to leverage idiomatic deserialized representations. The richness of this
shim will decrease so that it will need to "support" unbounded data but
that will be a ~one liner.

Getting the Python SDK on master will accelerate our progress towards
the
Fn API - partly technical, partly community - which is the best path
towards support for unbounded data across multiple runners. I think the
criteria are written with the completed portability framework in
mind. So
this exchange makes me actually more convinced we should merge
python-sdk
to master.

On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin

 wrote:


I do not think that Python SDK yet meets the bar [1] for implementing


the



Beam model -- supporting Unbounded data is very important. That said,



given


the committed and sustained set of contributors, it generally makes


sense



to me to make an exception in anticipation of these features being



fleshed


out soon; including potentially new users/contributors that would


arrive



once in master.


[1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com



That is a valid point. The Python SDK supports all the unbounded parts
of the model except for unbounded sources, which was deferred while
seeing how https://s.apache.org/splittable-do-fn played out. I've been
working with the team and merging/reviewing most of their code, and
have full confidence this will be coming (and on that note can vouch
for a healthy community and support which are much harder to add
later).

In short, I think it has the required maturity, and I'm in favor of
merging soonish.

On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay



Re: [DISCUSS] Python SDK status and next steps

2017-01-31 Thread Sergio Fernández
Originally we integrate the build in Maven with the default profile.
Do you feel like it'd be better to have it under a separated profile or so?

On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofré 
wrote:

> Just to be clear, the prerequisite to be able to build the Python SDK are:
>
> apt-get install python-setuptools
> apt-get install python-pip
>
> It's also required by the default "regular" build.
>
> Regards
> JB
>
>
> On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote:
>
>> Just one thing I noticed (and can be helpful for others): to build Beam
>> we now need python setuptools installed.
>>
>> For instance, on Ubuntu, you have to do:
>>
>> apt-get install python-setuptools
>>
>> Same for the pip distribution.
>>
>> I guess (if not already done), we have to update README/Building
>> instructions.
>>
>> Correct ?
>>
>> Regards
>> JB
>>
>> On 01/31/2017 08:10 AM, Ahmet Altay wrote:
>>
>>> Hi all,
>>>
>>> This merge is completed. Python SDK is now officially part of the master
>>> branch! Thank you all for the support. Please open an issue, if you
>>> notice
>>> a reference to the now obsolete python-sdk branch in the documentation.
>>>
>>> There will not be any more merges to the python-sdk branch. Going forward
>>> please use the master branch for Python SDK development. There are a few
>>> existing open PRs to the python-sdk [1]. If you are the author of one of
>>> those PRs, please rebase them on top of master.
>>>
>>> Thank you,
>>> Ahmet
>>>
>>> [1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%
>>> 
>>> 3Apython-sdk+repo%3Aapache%2Fbeam+
>>> >> +base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>>>
>>>
>>> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles
>>> 
>>> wrote:
>>>
>>> To clarify the implied criteria of that last exchange, it is "An SDK
 should
 have at least one runner that can execute the complete model (may be a
 direct runner)"

 I want to highlight this, because whether an _SDK_ supports unbounded
 data
 is not particularly well-defined, and will evolve:

  - With the Runner API, an SDK will need to support building a graph
 with
 unbounded constructs, as today with probably minimal changes.

  - With the Fn API, if any part of the Fn API is specific to unbounded
 data, the SDK will need to implement it. I think right now there is
 no such
 thing, and we don't want such a thing, so SDKs implementing the Fn API
 automatically support unbounded data.

  - There will also likely be an SDK-specific shim just as there is
 today,
 to leverage idiomatic deserialized representations. The richness of this
 shim will decrease so that it will need to "support" unbounded data but
 that will be a ~one liner.

 Getting the Python SDK on master will accelerate our progress towards
 the
 Fn API - partly technical, partly community - which is the best path
 towards support for unbounded data across multiple runners. I think the
 criteria are written with the completed portability framework in
 mind. So
 this exchange makes me actually more convinced we should merge
 python-sdk
 to master.

 On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
 rober...@google.com.invalid> wrote:

 On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
>  wrote:
>
>> I do not think that Python SDK yet meets the bar [1] for implementing
>>
> the

> Beam model -- supporting Unbounded data is very important. That said,
>>
> given
>
>> the committed and sustained set of contributors, it generally makes
>>
> sense

> to me to make an exception in anticipation of these features being
>>
> fleshed
>
>> out soon; including potentially new users/contributors that would
>>
> arrive

> once in master.
>>
>> [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
>> k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com
>>
>
> That is a valid point. The Python SDK supports all the unbounded parts
> of the model except for unbounded sources, which was deferred while
> seeing how https://s.apache.org/splittable-do-fn played out. I've been
> working with the team and merging/reviewing most of their code, and
> have full confidence this will be coming (and on that note can vouch
> for a healthy community and support which are much harder to add
> later).
>
> In short, I think it has the required maturity, and I'm in favor of
> merging soonish.
>
> On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
>> >
>
> wrote:
>>
>> Thank you all for the comments so far. I would follow the process as

Re: [DISCUSS] Python SDK status and next steps

2017-01-31 Thread Jean-Baptiste Onofré

Just to be clear, the prerequisite to be able to build the Python SDK are:

apt-get install python-setuptools
apt-get install python-pip

It's also required by the default "regular" build.

Regards
JB

On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote:

Just one thing I noticed (and can be helpful for others): to build Beam
we now need python setuptools installed.

For instance, on Ubuntu, you have to do:

apt-get install python-setuptools

Same for the pip distribution.

I guess (if not already done), we have to update README/Building
instructions.

Correct ?

Regards
JB

On 01/31/2017 08:10 AM, Ahmet Altay wrote:

Hi all,

This merge is completed. Python SDK is now officially part of the master
branch! Thank you all for the support. Please open an issue, if you
notice
a reference to the now obsolete python-sdk branch in the documentation.

There will not be any more merges to the python-sdk branch. Going forward
please use the master branch for Python SDK development. There are a few
existing open PRs to the python-sdk [1]. If you are the author of one of
those PRs, please rebase them on top of master.

Thank you,
Ahmet

[1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%
3Apython-sdk+repo%3Aapache%2Fbeam+



On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles

wrote:


To clarify the implied criteria of that last exchange, it is "An SDK
should
have at least one runner that can execute the complete model (may be a
direct runner)"

I want to highlight this, because whether an _SDK_ supports unbounded
data
is not particularly well-defined, and will evolve:

 - With the Runner API, an SDK will need to support building a graph
with
unbounded constructs, as today with probably minimal changes.

 - With the Fn API, if any part of the Fn API is specific to unbounded
data, the SDK will need to implement it. I think right now there is
no such
thing, and we don't want such a thing, so SDKs implementing the Fn API
automatically support unbounded data.

 - There will also likely be an SDK-specific shim just as there is
today,
to leverage idiomatic deserialized representations. The richness of this
shim will decrease so that it will need to "support" unbounded data but
that will be a ~one liner.

Getting the Python SDK on master will accelerate our progress towards
the
Fn API - partly technical, partly community - which is the best path
towards support for unbounded data across multiple runners. I think the
criteria are written with the completed portability framework in
mind. So
this exchange makes me actually more convinced we should merge
python-sdk
to master.

On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
rober...@google.com.invalid> wrote:


On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
 wrote:

I do not think that Python SDK yet meets the bar [1] for implementing

the

Beam model -- supporting Unbounded data is very important. That said,

given

the committed and sustained set of contributors, it generally makes

sense

to me to make an exception in anticipation of these features being

fleshed

out soon; including potentially new users/contributors that would

arrive

once in master.

[1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com


That is a valid point. The Python SDK supports all the unbounded parts
of the model except for unbounded sources, which was deferred while
seeing how https://s.apache.org/splittable-do-fn played out. I've been
working with the team and merging/reviewing most of their code, and
have full confidence this will be coming (and on that note can vouch
for a healthy community and support which are much harder to add
later).

In short, I think it has the required maturity, and I'm in favor of
merging soonish.


On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay


Re: [DISCUSS] Python SDK status and next steps

2017-01-31 Thread Prabeesh K.
https://issues.apache.org/jira/browse/BEAM-1360

On 31 January 2017 at 12:12, Prabeesh K.  wrote:

> https://issues.apache.org/jira/browse/BAHIR-86
>
> On 31 January 2017 at 11:10, Ahmet Altay  wrote:
>
>> Hi all,
>>
>> This merge is completed. Python SDK is now officially part of the master
>> branch! Thank you all for the support. Please open an issue, if you notice
>> a reference to the now obsolete python-sdk branch in the documentation.
>>
>> There will not be any more merges to the python-sdk branch. Going forward
>> please use the master branch for Python SDK development. There are a few
>> existing open PRs to the python-sdk [1]. If you are the author of one of
>> those PRs, please rebase them on top of master.
>>
>> Thank you,
>> Ahmet
>>
>> [1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%
>> 
>> 3Apython-sdk+repo%3Aapache%2Fbeam+
>> > +base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>>
>> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles > >
>> wrote:
>>
>> > To clarify the implied criteria of that last exchange, it is "An SDK
>> should
>> > have at least one runner that can execute the complete model (may be a
>> > direct runner)"
>> >
>> > I want to highlight this, because whether an _SDK_ supports unbounded
>> data
>> > is not particularly well-defined, and will evolve:
>> >
>> >  - With the Runner API, an SDK will need to support building a graph
>> with
>> > unbounded constructs, as today with probably minimal changes.
>> >
>> >  - With the Fn API, if any part of the Fn API is specific to unbounded
>> > data, the SDK will need to implement it. I think right now there is no
>> such
>> > thing, and we don't want such a thing, so SDKs implementing the Fn API
>> > automatically support unbounded data.
>> >
>> >  - There will also likely be an SDK-specific shim just as there is
>> today,
>> > to leverage idiomatic deserialized representations. The richness of this
>> > shim will decrease so that it will need to "support" unbounded data but
>> > that will be a ~one liner.
>> >
>> > Getting the Python SDK on master will accelerate our progress towards
>> the
>> > Fn API - partly technical, partly community - which is the best path
>> > towards support for unbounded data across multiple runners. I think the
>> > criteria are written with the completed portability framework in mind.
>> So
>> > this exchange makes me actually more convinced we should merge
>> python-sdk
>> > to master.
>> >
>> > On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
>> > rober...@google.com.invalid> wrote:
>> >
>> > > On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
>> > >  wrote:
>> > > > I do not think that Python SDK yet meets the bar [1] for
>> implementing
>> > the
>> > > > Beam model -- supporting Unbounded data is very important. That
>> said,
>> > > given
>> > > > the committed and sustained set of contributors, it generally makes
>> > sense
>> > > > to me to make an exception in anticipation of these features being
>> > > fleshed
>> > > > out soon; including potentially new users/contributors that would
>> > arrive
>> > > > once in master.
>> > > >
>> > > > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
>> > > > k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com
>> > >
>> > > That is a valid point. The Python SDK supports all the unbounded parts
>> > > of the model except for unbounded sources, which was deferred while
>> > > seeing how https://s.apache.org/splittable-do-fn played out. I've
>> been
>> > > working with the team and merging/reviewing most of their code, and
>> > > have full confidence this will be coming (and on that note can vouch
>> > > for a healthy community and support which are much harder to add
>> > > later).
>> > >
>> > > In short, I think it has the required maturity, and I'm in favor of
>> > > merging soonish.
>> > >
>> > > > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
>> > > >
>> > > > wrote:
>> > > >
>> > > >> Thank you all for the comments so far. I would follow the process
>> as
>> > > >> suggested by Davor and others in this thread.
>> > > >>
>> > > >> Ahmet
>> > > >>
>> > > >> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <
>> wik...@apache.org
>> > >
>> > > >> wrote:
>> > > >>
>> > > >> > Hi
>> > > >> >
>> > > >> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
>> > > > > >
>> > > >> > wrote:
>> > > >> > >
>> > > >> > > tl;dr: I would like to start a discussion about merging
>> python-sdk
>> > > >> branch
>> > > >> > > to master branch. Python SDK is mature enough and merging it to
>> > > master
>> > > >> > will
>> > > >> > > accelerate its development and adoption.
>> > > >> > >
>> > > >> >
>> > > >> > Good point, Ahmet!
>> > > >> >
>> > > >> > I've following closed the development 

Re: [DISCUSS] Python SDK status and next steps

2017-01-31 Thread Prabeesh K.
https://issues.apache.org/jira/browse/BAHIR-86

On 31 January 2017 at 11:10, Ahmet Altay  wrote:

> Hi all,
>
> This merge is completed. Python SDK is now officially part of the master
> branch! Thank you all for the support. Please open an issue, if you notice
> a reference to the now obsolete python-sdk branch in the documentation.
>
> There will not be any more merges to the python-sdk branch. Going forward
> please use the master branch for Python SDK development. There are a few
> existing open PRs to the python-sdk [1]. If you are the author of one of
> those PRs, please rebase them on top of master.
>
> Thank you,
> Ahmet
>
> [1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%
> 
> 3Apython-sdk+repo%3Aapache%2Fbeam+
>  3Apr+base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>
> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles 
> wrote:
>
> > To clarify the implied criteria of that last exchange, it is "An SDK
> should
> > have at least one runner that can execute the complete model (may be a
> > direct runner)"
> >
> > I want to highlight this, because whether an _SDK_ supports unbounded
> data
> > is not particularly well-defined, and will evolve:
> >
> >  - With the Runner API, an SDK will need to support building a graph with
> > unbounded constructs, as today with probably minimal changes.
> >
> >  - With the Fn API, if any part of the Fn API is specific to unbounded
> > data, the SDK will need to implement it. I think right now there is no
> such
> > thing, and we don't want such a thing, so SDKs implementing the Fn API
> > automatically support unbounded data.
> >
> >  - There will also likely be an SDK-specific shim just as there is today,
> > to leverage idiomatic deserialized representations. The richness of this
> > shim will decrease so that it will need to "support" unbounded data but
> > that will be a ~one liner.
> >
> > Getting the Python SDK on master will accelerate our progress towards the
> > Fn API - partly technical, partly community - which is the best path
> > towards support for unbounded data across multiple runners. I think the
> > criteria are written with the completed portability framework in mind. So
> > this exchange makes me actually more convinced we should merge python-sdk
> > to master.
> >
> > On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
> > rober...@google.com.invalid> wrote:
> >
> > > On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
> > >  wrote:
> > > > I do not think that Python SDK yet meets the bar [1] for implementing
> > the
> > > > Beam model -- supporting Unbounded data is very important. That said,
> > > given
> > > > the committed and sustained set of contributors, it generally makes
> > sense
> > > > to me to make an exception in anticipation of these features being
> > > fleshed
> > > > out soon; including potentially new users/contributors that would
> > arrive
> > > > once in master.
> > > >
> > > > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
> > > > k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com
> > >
> > > That is a valid point. The Python SDK supports all the unbounded parts
> > > of the model except for unbounded sources, which was deferred while
> > > seeing how https://s.apache.org/splittable-do-fn played out. I've been
> > > working with the team and merging/reviewing most of their code, and
> > > have full confidence this will be coming (and on that note can vouch
> > > for a healthy community and support which are much harder to add
> > > later).
> > >
> > > In short, I think it has the required maturity, and I'm in favor of
> > > merging soonish.
> > >
> > > > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
>  > >
> > > > wrote:
> > > >
> > > >> Thank you all for the comments so far. I would follow the process as
> > > >> suggested by Davor and others in this thread.
> > > >>
> > > >> Ahmet
> > > >>
> > > >> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <
> wik...@apache.org
> > >
> > > >> wrote:
> > > >>
> > > >> > Hi
> > > >> >
> > > >> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
> >  > > >
> > > >> > wrote:
> > > >> > >
> > > >> > > tl;dr: I would like to start a discussion about merging
> python-sdk
> > > >> branch
> > > >> > > to master branch. Python SDK is mature enough and merging it to
> > > master
> > > >> > will
> > > >> > > accelerate its development and adoption.
> > > >> > >
> > > >> >
> > > >> > Good point, Ahmet!
> > > >> >
> > > >> > I've following closed the development since it was imported in
> June.
> > > For
> > > >> > the prototypes I've implemented so far it works quite well; I
> guess
> > > we'd
> > > >> > just need to focus the next months in bringing more runners
> support.
> > > >> >
> > > >> > With a great effort from a lot of 

Re: [DISCUSS] Python SDK status and next steps

2017-01-30 Thread Davor Bonaci
Great -- congratulations to everyone who has contributed to the Python SDK!

On Mon, Jan 30, 2017 at 11:10 PM, Ahmet Altay 
wrote:

> Hi all,
>
> This merge is completed. Python SDK is now officially part of the master
> branch! Thank you all for the support. Please open an issue, if you notice
> a reference to the now obsolete python-sdk branch in the documentation.
>
> There will not be any more merges to the python-sdk branch. Going forward
> please use the master branch for Python SDK development. There are a few
> existing open PRs to the python-sdk [1]. If you are the author of one of
> those PRs, please rebase them on top of master.
>
> Thank you,
> Ahmet
>
> [1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%
> 
> 3Apython-sdk+repo%3Aapache%2Fbeam+
>  3Apr+base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>
> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles 
> wrote:
>
> > To clarify the implied criteria of that last exchange, it is "An SDK
> should
> > have at least one runner that can execute the complete model (may be a
> > direct runner)"
> >
> > I want to highlight this, because whether an _SDK_ supports unbounded
> data
> > is not particularly well-defined, and will evolve:
> >
> >  - With the Runner API, an SDK will need to support building a graph with
> > unbounded constructs, as today with probably minimal changes.
> >
> >  - With the Fn API, if any part of the Fn API is specific to unbounded
> > data, the SDK will need to implement it. I think right now there is no
> such
> > thing, and we don't want such a thing, so SDKs implementing the Fn API
> > automatically support unbounded data.
> >
> >  - There will also likely be an SDK-specific shim just as there is today,
> > to leverage idiomatic deserialized representations. The richness of this
> > shim will decrease so that it will need to "support" unbounded data but
> > that will be a ~one liner.
> >
> > Getting the Python SDK on master will accelerate our progress towards the
> > Fn API - partly technical, partly community - which is the best path
> > towards support for unbounded data across multiple runners. I think the
> > criteria are written with the completed portability framework in mind. So
> > this exchange makes me actually more convinced we should merge python-sdk
> > to master.
> >
> > On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
> > rober...@google.com.invalid> wrote:
> >
> > > On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
> > >  wrote:
> > > > I do not think that Python SDK yet meets the bar [1] for implementing
> > the
> > > > Beam model -- supporting Unbounded data is very important. That said,
> > > given
> > > > the committed and sustained set of contributors, it generally makes
> > sense
> > > > to me to make an exception in anticipation of these features being
> > > fleshed
> > > > out soon; including potentially new users/contributors that would
> > arrive
> > > > once in master.
> > > >
> > > > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
> > > > k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com
> > >
> > > That is a valid point. The Python SDK supports all the unbounded parts
> > > of the model except for unbounded sources, which was deferred while
> > > seeing how https://s.apache.org/splittable-do-fn played out. I've been
> > > working with the team and merging/reviewing most of their code, and
> > > have full confidence this will be coming (and on that note can vouch
> > > for a healthy community and support which are much harder to add
> > > later).
> > >
> > > In short, I think it has the required maturity, and I'm in favor of
> > > merging soonish.
> > >
> > > > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
>  > >
> > > > wrote:
> > > >
> > > >> Thank you all for the comments so far. I would follow the process as
> > > >> suggested by Davor and others in this thread.
> > > >>
> > > >> Ahmet
> > > >>
> > > >> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <
> wik...@apache.org
> > >
> > > >> wrote:
> > > >>
> > > >> > Hi
> > > >> >
> > > >> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
> >  > > >
> > > >> > wrote:
> > > >> > >
> > > >> > > tl;dr: I would like to start a discussion about merging
> python-sdk
> > > >> branch
> > > >> > > to master branch. Python SDK is mature enough and merging it to
> > > master
> > > >> > will
> > > >> > > accelerate its development and adoption.
> > > >> > >
> > > >> >
> > > >> > Good point, Ahmet!
> > > >> >
> > > >> > I've following closed the development since it was imported in
> June.
> > > For
> > > >> > the prototypes I've implemented so far it works quite well; I
> guess
> > > we'd
> > > >> > just need to focus the next months in bringing more runners
> support.
> > > >> >
> > > >> > 

Re: [DISCUSS] Python SDK status and next steps

2017-01-30 Thread Ahmet Altay
Hi all,

This merge is completed. Python SDK is now officially part of the master
branch! Thank you all for the support. Please open an issue, if you notice
a reference to the now obsolete python-sdk branch in the documentation.

There will not be any more merges to the python-sdk branch. Going forward
please use the master branch for Python SDK development. There are a few
existing open PRs to the python-sdk [1]. If you are the author of one of
those PRs, please rebase them on top of master.

Thank you,
Ahmet

[1] https://github.com/pulls?utf8=✓=is%3Aopen+is%3Apr+base%
3Apython-sdk+repo%3Aapache%2Fbeam+


On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles 
wrote:

> To clarify the implied criteria of that last exchange, it is "An SDK should
> have at least one runner that can execute the complete model (may be a
> direct runner)"
>
> I want to highlight this, because whether an _SDK_ supports unbounded data
> is not particularly well-defined, and will evolve:
>
>  - With the Runner API, an SDK will need to support building a graph with
> unbounded constructs, as today with probably minimal changes.
>
>  - With the Fn API, if any part of the Fn API is specific to unbounded
> data, the SDK will need to implement it. I think right now there is no such
> thing, and we don't want such a thing, so SDKs implementing the Fn API
> automatically support unbounded data.
>
>  - There will also likely be an SDK-specific shim just as there is today,
> to leverage idiomatic deserialized representations. The richness of this
> shim will decrease so that it will need to "support" unbounded data but
> that will be a ~one liner.
>
> Getting the Python SDK on master will accelerate our progress towards the
> Fn API - partly technical, partly community - which is the best path
> towards support for unbounded data across multiple runners. I think the
> criteria are written with the completed portability framework in mind. So
> this exchange makes me actually more convinced we should merge python-sdk
> to master.
>
> On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
> rober...@google.com.invalid> wrote:
>
> > On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
> >  wrote:
> > > I do not think that Python SDK yet meets the bar [1] for implementing
> the
> > > Beam model -- supporting Unbounded data is very important. That said,
> > given
> > > the committed and sustained set of contributors, it generally makes
> sense
> > > to me to make an exception in anticipation of these features being
> > fleshed
> > > out soon; including potentially new users/contributors that would
> arrive
> > > once in master.
> > >
> > > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
> > > k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com
> >
> > That is a valid point. The Python SDK supports all the unbounded parts
> > of the model except for unbounded sources, which was deferred while
> > seeing how https://s.apache.org/splittable-do-fn played out. I've been
> > working with the team and merging/reviewing most of their code, and
> > have full confidence this will be coming (and on that note can vouch
> > for a healthy community and support which are much harder to add
> > later).
> >
> > In short, I think it has the required maturity, and I'm in favor of
> > merging soonish.
> >
> > > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay  >
> > > wrote:
> > >
> > >> Thank you all for the comments so far. I would follow the process as
> > >> suggested by Davor and others in this thread.
> > >>
> > >> Ahmet
> > >>
> > >> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández  >
> > >> wrote:
> > >>
> > >> > Hi
> > >> >
> > >> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
>  > >
> > >> > wrote:
> > >> > >
> > >> > > tl;dr: I would like to start a discussion about merging python-sdk
> > >> branch
> > >> > > to master branch. Python SDK is mature enough and merging it to
> > master
> > >> > will
> > >> > > accelerate its development and adoption.
> > >> > >
> > >> >
> > >> > Good point, Ahmet!
> > >> >
> > >> > I've following closed the development since it was imported in June.
> > For
> > >> > the prototypes I've implemented so far it works quite well; I guess
> > we'd
> > >> > just need to focus the next months in bringing more runners support.
> > >> >
> > >> > With a great effort from a lot of contributors(*), Python SDK [1] is
> > now
> > >> a
> > >> > > mostly complete, tested, performant Python implementation of the
> > Beam
> > >> > > model. Since June, when we first started with Python SDK in Apache
> > Beam
> > >> > we
> > >> > > have been continuously improving it.
> > >> > >
> > >> >
> > >> > I wouldn't merge during the preparation of 0.5.0 release, but after
> > that
> > >> > could be a good time to merge back into 

Re: [DISCUSS] Python SDK status and next steps

2017-01-20 Thread Kenneth Knowles
To clarify the implied criteria of that last exchange, it is "An SDK should
have at least one runner that can execute the complete model (may be a
direct runner)"

I want to highlight this, because whether an _SDK_ supports unbounded data
is not particularly well-defined, and will evolve:

 - With the Runner API, an SDK will need to support building a graph with
unbounded constructs, as today with probably minimal changes.

 - With the Fn API, if any part of the Fn API is specific to unbounded
data, the SDK will need to implement it. I think right now there is no such
thing, and we don't want such a thing, so SDKs implementing the Fn API
automatically support unbounded data.

 - There will also likely be an SDK-specific shim just as there is today,
to leverage idiomatic deserialized representations. The richness of this
shim will decrease so that it will need to "support" unbounded data but
that will be a ~one liner.

Getting the Python SDK on master will accelerate our progress towards the
Fn API - partly technical, partly community - which is the best path
towards support for unbounded data across multiple runners. I think the
criteria are written with the completed portability framework in mind. So
this exchange makes me actually more convinced we should merge python-sdk
to master.

On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

> On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
>  wrote:
> > I do not think that Python SDK yet meets the bar [1] for implementing the
> > Beam model -- supporting Unbounded data is very important. That said,
> given
> > the committed and sustained set of contributors, it generally makes sense
> > to me to make an exception in anticipation of these features being
> fleshed
> > out soon; including potentially new users/contributors that would arrive
> > once in master.
> >
> > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
> > k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com
>
> That is a valid point. The Python SDK supports all the unbounded parts
> of the model except for unbounded sources, which was deferred while
> seeing how https://s.apache.org/splittable-do-fn played out. I've been
> working with the team and merging/reviewing most of their code, and
> have full confidence this will be coming (and on that note can vouch
> for a healthy community and support which are much harder to add
> later).
>
> In short, I think it has the required maturity, and I'm in favor of
> merging soonish.
>
> > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay 
> > wrote:
> >
> >> Thank you all for the comments so far. I would follow the process as
> >> suggested by Davor and others in this thread.
> >>
> >> Ahmet
> >>
> >> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández 
> >> wrote:
> >>
> >> > Hi
> >> >
> >> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay  >
> >> > wrote:
> >> > >
> >> > > tl;dr: I would like to start a discussion about merging python-sdk
> >> branch
> >> > > to master branch. Python SDK is mature enough and merging it to
> master
> >> > will
> >> > > accelerate its development and adoption.
> >> > >
> >> >
> >> > Good point, Ahmet!
> >> >
> >> > I've following closed the development since it was imported in June.
> For
> >> > the prototypes I've implemented so far it works quite well; I guess
> we'd
> >> > just need to focus the next months in bringing more runners support.
> >> >
> >> > With a great effort from a lot of contributors(*), Python SDK [1] is
> now
> >> a
> >> > > mostly complete, tested, performant Python implementation of the
> Beam
> >> > > model. Since June, when we first started with Python SDK in Apache
> Beam
> >> > we
> >> > > have been continuously improving it.
> >> > >
> >> >
> >> > I wouldn't merge during the preparation of 0.5.0 release, but after
> that
> >> > could be a good time to merge back into master.
> >> >
> >> >
> >> > ** Python SDK currently supports:
> >> > >
> >> > > * Model: All main concepts are present (ParDo, GroupByKey, Windowing
> >> > etc.).
> >> > > * IO: There are extensible APIs for writing new bounded sources and
> >> > sinks.
> >> > > Implementations are provided for Text, Avro, BigQuery, and
> Datastore.
> >> > > * Runners: Python SDK has an extensible base runner module that
> allows
> >> > > building specific runners on top of it. The SDK comes with two
> pipeline
> >> > > runners: DirectRunner and DataflowRunner; and it is possible to add
> >> more.
> >> > > The existing runners are currently limited to bounded execution and
> >> > > otherwise equivalent to their Java SDK counterparts in
> functionality.
> >> > >
> >> >
> >> > What would the effort of porting, and maintaining, parallel versions
> of
> >> the
> >> > Java runners? I guess I'd need to dig deeper in the model, but this
> may
> >> > represent a major effort for the project, right?
> >> >
> >>
> >> It is 

Re: [DISCUSS] Python SDK status and next steps

2017-01-20 Thread Ahmet Altay
Thank you Dan. Adding support for unbounded data is on the roadmap and it
will be added to Python SDK soon.

Thank you all again, I will start the official voting thread.

Thank you,
Ahmet

On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin 
wrote:

> I do not think that Python SDK yet meets the bar [1] for implementing the
> Beam model -- supporting Unbounded data is very important. That said, given
> the committed and sustained set of contributors, it generally makes sense
> to me to make an exception in anticipation of these features being fleshed
> out soon; including potentially new users/contributors that would arrive
> once in master.
>
> [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
> k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com
>
> On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay 
> wrote:
>
> > Thank you all for the comments so far. I would follow the process as
> > suggested by Davor and others in this thread.
> >
> > Ahmet
> >
> > On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández 
> > wrote:
> >
> > > Hi
> > >
> > > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay  >
> > > wrote:
> > > >
> > > > tl;dr: I would like to start a discussion about merging python-sdk
> > branch
> > > > to master branch. Python SDK is mature enough and merging it to
> master
> > > will
> > > > accelerate its development and adoption.
> > > >
> > >
> > > Good point, Ahmet!
> > >
> > > I've following closed the development since it was imported in June.
> For
> > > the prototypes I've implemented so far it works quite well; I guess
> we'd
> > > just need to focus the next months in bringing more runners support.
> > >
> > > With a great effort from a lot of contributors(*), Python SDK [1] is
> now
> > a
> > > > mostly complete, tested, performant Python implementation of the Beam
> > > > model. Since June, when we first started with Python SDK in Apache
> Beam
> > > we
> > > > have been continuously improving it.
> > > >
> > >
> > > I wouldn't merge during the preparation of 0.5.0 release, but after
> that
> > > could be a good time to merge back into master.
> > >
> > >
> > > ** Python SDK currently supports:
> > > >
> > > > * Model: All main concepts are present (ParDo, GroupByKey, Windowing
> > > etc.).
> > > > * IO: There are extensible APIs for writing new bounded sources and
> > > sinks.
> > > > Implementations are provided for Text, Avro, BigQuery, and Datastore.
> > > > * Runners: Python SDK has an extensible base runner module that
> allows
> > > > building specific runners on top of it. The SDK comes with two
> pipeline
> > > > runners: DirectRunner and DataflowRunner; and it is possible to add
> > more.
> > > > The existing runners are currently limited to bounded execution and
> > > > otherwise equivalent to their Java SDK counterparts in functionality.
> > > >
> > >
> > > What would the effort of porting, and maintaining, parallel versions of
> > the
> > > Java runners? I guess I'd need to dig deeper in the model, but this may
> > > represent a major effort for the project, right?
> > >
> >
> > It is somewhat higher for DirectRunner because DirectRunner also
> implements
> > the code for execution. It is not that high for DataflowRunner because
> the
> > base runner module has a lot of helpers with the right hooks for
> > implementing a generic runner. I would _expect_ the experience in general
> > would be similar to the latter.
> >
> >
> > >
> > >
> > >
> > > > * Testing: Python SDK implements ValidatesRunner test framework for
> > > > implementing integration test for current and future runners. There
> is
> > > unit
> > > > test coverage for all modules, and a number of integrations test for
> > > > validating existing runners.
> > > > * Documentation and examples: Documentation work has started on
> Python
> > > SDK.
> > > > Beam Programming Guide page has been updated to include Python [2].
> The
> > > > code comes with many ready to use examples and we are in a good place
> > to
> > > > start documenting those on the website.
> > > >
> > > > ** We are not done yet, next on the roadmap we have:
> > > >
> > > > * Streaming: Both of the existing runners lack support for streaming
> > > > execution, and currently there is work going on for adding streaming
> > > > support to DirectRunner [3].
> > > > * Documentation: Filling the rest of the Beam documentations with
> > Python
> > > > SDK specific information and examples.
> > > > * SDK consistency: Making Python SDK consistent with the Java SDK. We
> > > have
> > > > come a long way on this and have only a few items left [4].
> > > > * Beamifying: We have been working on removing Dataflow-specific
> > > references
> > > > both from the documentation and from the code. There is some work
> left,
> > > and
> > > > we are currently working on those as well [5].
> > > >
> > > > ** Steps and implications of merging to master:
> > > >
> > > > * Master 

Re: [DISCUSS] Python SDK status and next steps

2017-01-18 Thread Ahmet Altay
Thank you all for the comments so far. I would follow the process as
suggested by Davor and others in this thread.

Ahmet

On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández 
wrote:

> Hi
>
> On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay 
> wrote:
> >
> > tl;dr: I would like to start a discussion about merging python-sdk branch
> > to master branch. Python SDK is mature enough and merging it to master
> will
> > accelerate its development and adoption.
> >
>
> Good point, Ahmet!
>
> I've following closed the development since it was imported in June. For
> the prototypes I've implemented so far it works quite well; I guess we'd
> just need to focus the next months in bringing more runners support.
>
> With a great effort from a lot of contributors(*), Python SDK [1] is now a
> > mostly complete, tested, performant Python implementation of the Beam
> > model. Since June, when we first started with Python SDK in Apache Beam
> we
> > have been continuously improving it.
> >
>
> I wouldn't merge during the preparation of 0.5.0 release, but after that
> could be a good time to merge back into master.
>
>
> ** Python SDK currently supports:
> >
> > * Model: All main concepts are present (ParDo, GroupByKey, Windowing
> etc.).
> > * IO: There are extensible APIs for writing new bounded sources and
> sinks.
> > Implementations are provided for Text, Avro, BigQuery, and Datastore.
> > * Runners: Python SDK has an extensible base runner module that allows
> > building specific runners on top of it. The SDK comes with two pipeline
> > runners: DirectRunner and DataflowRunner; and it is possible to add more.
> > The existing runners are currently limited to bounded execution and
> > otherwise equivalent to their Java SDK counterparts in functionality.
> >
>
> What would the effort of porting, and maintaining, parallel versions of the
> Java runners? I guess I'd need to dig deeper in the model, but this may
> represent a major effort for the project, right?
>

It is somewhat higher for DirectRunner because DirectRunner also implements
the code for execution. It is not that high for DataflowRunner because the
base runner module has a lot of helpers with the right hooks for
implementing a generic runner. I would _expect_ the experience in general
would be similar to the latter.


>
>
>
> > * Testing: Python SDK implements ValidatesRunner test framework for
> > implementing integration test for current and future runners. There is
> unit
> > test coverage for all modules, and a number of integrations test for
> > validating existing runners.
> > * Documentation and examples: Documentation work has started on Python
> SDK.
> > Beam Programming Guide page has been updated to include Python [2]. The
> > code comes with many ready to use examples and we are in a good place to
> > start documenting those on the website.
> >
> > ** We are not done yet, next on the roadmap we have:
> >
> > * Streaming: Both of the existing runners lack support for streaming
> > execution, and currently there is work going on for adding streaming
> > support to DirectRunner [3].
> > * Documentation: Filling the rest of the Beam documentations with Python
> > SDK specific information and examples.
> > * SDK consistency: Making Python SDK consistent with the Java SDK. We
> have
> > come a long way on this and have only a few items left [4].
> > * Beamifying: We have been working on removing Dataflow-specific
> references
> > both from the documentation and from the code. There is some work left,
> and
> > we are currently working on those as well [5].
> >
> > ** Steps and implications of merging to master:
> >
> > * Master branch is merged to python-sdk branch at regular intervals and
> the
> > last merge was on 12/22. All the past merges were uneventful because
> there
> > is a minimal overlap in modified files between branches. Integrating
> > python-sdk to master will similarly touch a small number of existing
> files.
> >
> > * Python SDK is using the same tools for building and testing. It is
> > already integrated with Maven, Jenkins and Travis. Specifically the
> impact
> > to the testing infrastructure would be:
> > - There will be two additional test configurations in Travis. Since
> Travis
> > runs all configurations in parallel there should not be a noticeable
> change
> > in the Travis run time.
> > - Jenkins pre-commit test will start running the Python SDK tests. It
> will
> > add an additional 5 minutes to the completion time of pre-commit test.
> > Historically Python SDK tests were not flaky and did not cause any random
> > failures.
> > - Jenkins Python post-commit test is already separated from the other
> > post-commit tests and will continue to exist. It would not change the
> > testing time for any other test.
> >
> > * The release process needs to be updated to accommodate releasing Python
> > artifacts. Python SDK would fit in the existing release schedule and
> could
> > be 

Re: [DISCUSS] Python SDK status and next steps

2017-01-17 Thread Sergio Fernández
Hi

On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay 
wrote:
>
> tl;dr: I would like to start a discussion about merging python-sdk branch
> to master branch. Python SDK is mature enough and merging it to master will
> accelerate its development and adoption.
>

Good point, Ahmet!

I've following closed the development since it was imported in June. For
the prototypes I've implemented so far it works quite well; I guess we'd
just need to focus the next months in bringing more runners support.

With a great effort from a lot of contributors(*), Python SDK [1] is now a
> mostly complete, tested, performant Python implementation of the Beam
> model. Since June, when we first started with Python SDK in Apache Beam we
> have been continuously improving it.
>

I wouldn't merge during the preparation of 0.5.0 release, but after that
could be a good time to merge back into master.


** Python SDK currently supports:
>
> * Model: All main concepts are present (ParDo, GroupByKey, Windowing etc.).
> * IO: There are extensible APIs for writing new bounded sources and sinks.
> Implementations are provided for Text, Avro, BigQuery, and Datastore.
> * Runners: Python SDK has an extensible base runner module that allows
> building specific runners on top of it. The SDK comes with two pipeline
> runners: DirectRunner and DataflowRunner; and it is possible to add more.
> The existing runners are currently limited to bounded execution and
> otherwise equivalent to their Java SDK counterparts in functionality.
>

What would the effort of porting, and maintaining, parallel versions of the
Java runners? I guess I'd need to dig deeper in the model, but this may
represent a major effort for the project, right?



> * Testing: Python SDK implements ValidatesRunner test framework for
> implementing integration test for current and future runners. There is unit
> test coverage for all modules, and a number of integrations test for
> validating existing runners.
> * Documentation and examples: Documentation work has started on Python SDK.
> Beam Programming Guide page has been updated to include Python [2]. The
> code comes with many ready to use examples and we are in a good place to
> start documenting those on the website.
>
> ** We are not done yet, next on the roadmap we have:
>
> * Streaming: Both of the existing runners lack support for streaming
> execution, and currently there is work going on for adding streaming
> support to DirectRunner [3].
> * Documentation: Filling the rest of the Beam documentations with Python
> SDK specific information and examples.
> * SDK consistency: Making Python SDK consistent with the Java SDK. We have
> come a long way on this and have only a few items left [4].
> * Beamifying: We have been working on removing Dataflow-specific references
> both from the documentation and from the code. There is some work left, and
> we are currently working on those as well [5].
>
> ** Steps and implications of merging to master:
>
> * Master branch is merged to python-sdk branch at regular intervals and the
> last merge was on 12/22. All the past merges were uneventful because there
> is a minimal overlap in modified files between branches. Integrating
> python-sdk to master will similarly touch a small number of existing files.
>
> * Python SDK is using the same tools for building and testing. It is
> already integrated with Maven, Jenkins and Travis. Specifically the impact
> to the testing infrastructure would be:
> - There will be two additional test configurations in Travis. Since Travis
> runs all configurations in parallel there should not be a noticeable change
> in the Travis run time.
> - Jenkins pre-commit test will start running the Python SDK tests. It will
> add an additional 5 minutes to the completion time of pre-commit test.
> Historically Python SDK tests were not flaky and did not cause any random
> failures.
> - Jenkins Python post-commit test is already separated from the other
> post-commit tests and will continue to exist. It would not change the
> testing time for any other test.
>
> * The release process needs to be updated to accommodate releasing Python
> artifacts. Python SDK would fit in the existing release schedule and could
> be released along with the Java SDK. The additional steps would include:
> - Generating Python artifacts. This could be done with a single command
> using Maven today.
> - Publishing the artifacts to a central repository such as PyPI.
>

I'm more than happy to help on this. We left on purpose some things open
when we added Maven support to the Python build.



> - Updating the release guide to reflect the changes above.
>
> * Users: There are existing users using the Python SDK. To give a rough
> estimate, a distribution of the Beam Python SDK had a total of 23K
> downloads in the past 6 months [6]. Some of those users are already engaged
> with the community (e.g. [7]). There might be an increased amount
> engagement from the rest 

Re: [DISCUSS] Python SDK status and next steps

2017-01-17 Thread Frances Perry
+1 merged after 0.5.

It's on a great trajectory in terms of development and community.

On Tue, Jan 17, 2017 at 5:48 PM, Kenneth Knowles 
wrote:

> Seems reasonable, and the timeline Davor suggests makes a lot of sense.
>
> On Tue, Jan 17, 2017 at 3:59 PM, Lukasz Cwik 
> wrote:
>
> > I'm also for merging to master.
> >
> > On Tue, Jan 17, 2017 at 3:39 PM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > It makes sense to merge after 0.5.0 release.
> > >
> > > Good point Davor: +1
> > >
> > > Regards
> > > JB
> > >
> > >
> > > On 01/17/2017 03:34 PM, Davor Bonaci wrote:
> > >
> > >> +1. I think merging to master would be an awesome next step for the
> > Python
> > >> SDK.
> > >>
> > >> And, thanks for a great summary of the current state, roadmap, and
> > impact
> > >> to the project as a whole -- awesome!
> > >>
> > >> Process-wise, I'd suggest starting a formal vote once this discussion
> > >> seems
> > >> to be trending towards a conclusion, and complete the merge as soon as
> > the
> > >> next release (0.5.0) is cut. This would enable additional time before
> > >> 0.6.0
> > >> to figure out compliance, release process impact, etc.
> > >>
> > >> Great work everyone!
> > >>
> > >> On Tue, Jan 17, 2017 at 8:26 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net>
> > >> wrote:
> > >>
> > >> Hi
> > >>>
> > >>> I didn't try the Python SDK recently but you provided a clear "state
> of
> > >>> the art". Anyway I'm in favor of merging things as quick as possible
> > >>> (assuming it's in a good shape in term of build, test, ...): it would
> > >>> potentially grow up the "external" contributions.
> > >>>
> > >>> So +1 from my side.
> > >>>
> > >>> Regards
> > >>> JB⁣​
> > >>>
> > >>> On Jan 17, 2017, 08:22, at 08:22, Ahmet Altay
>  > >
> > >>> wrote:
> > >>>
> >  Hi all,
> > 
> >  tl;dr: I would like to start a discussion about merging python-sdk
> >  branch
> >  to master branch. Python SDK is mature enough and merging it to
> master
> >  will
> >  accelerate its development and adoption.
> > 
> >  With a great effort from a lot of contributors(*), Python SDK [1] is
> >  now a
> >  mostly complete, tested, performant Python implementation of the
> Beam
> >  model. Since June, when we first started with Python SDK in Apache
> > Beam
> >  we
> >  have been continuously improving it.
> > 
> >  ** Python SDK currently supports:
> > 
> >  * Model: All main concepts are present (ParDo, GroupByKey, Windowing
> >  etc.).
> >  * IO: There are extensible APIs for writing new bounded sources and
> >  sinks.
> >  Implementations are provided for Text, Avro, BigQuery, and
> Datastore.
> >  * Runners: Python SDK has an extensible base runner module that
> allows
> >  building specific runners on top of it. The SDK comes with two
> > pipeline
> >  runners: DirectRunner and DataflowRunner; and it is possible to add
> >  more.
> >  The existing runners are currently limited to bounded execution and
> >  otherwise equivalent to their Java SDK counterparts in
> functionality.
> >  * Testing: Python SDK implements ValidatesRunner test framework for
> >  implementing integration test for current and future runners. There
> is
> >  unit
> >  test coverage for all modules, and a number of integrations test for
> >  validating existing runners.
> >  * Documentation and examples: Documentation work has started on
> Python
> >  SDK.
> >  Beam Programming Guide page has been updated to include Python [2].
> > The
> >  code comes with many ready to use examples and we are in a good
> place
> >  to
> >  start documenting those on the website.
> > 
> >  ** We are not done yet, next on the roadmap we have:
> > 
> >  * Streaming: Both of the existing runners lack support for streaming
> >  execution, and currently there is work going on for adding streaming
> >  support to DirectRunner [3].
> >  * Documentation: Filling the rest of the Beam documentations with
> >  Python
> >  SDK specific information and examples.
> >  * SDK consistency: Making Python SDK consistent with the Java SDK.
> We
> >  have
> >  come a long way on this and have only a few items left [4].
> >  * Beamifying: We have been working on removing Dataflow-specific
> >  references
> >  both from the documentation and from the code. There is some work
> > left,
> >  and
> >  we are currently working on those as well [5].
> > 
> >  ** Steps and implications of merging to master:
> > 
> >  * Master branch is merged to python-sdk branch at regular intervals
> > and
> >  the
> >  last merge was on 12/22. All the past merges were uneventful because
> >  there
> >  is a minimal overlap in modified files between branches. Integrating
> >  

Re: [DISCUSS] Python SDK status and next steps

2017-01-17 Thread Kenneth Knowles
Seems reasonable, and the timeline Davor suggests makes a lot of sense.

On Tue, Jan 17, 2017 at 3:59 PM, Lukasz Cwik 
wrote:

> I'm also for merging to master.
>
> On Tue, Jan 17, 2017 at 3:39 PM, Jean-Baptiste Onofré 
> wrote:
>
> > It makes sense to merge after 0.5.0 release.
> >
> > Good point Davor: +1
> >
> > Regards
> > JB
> >
> >
> > On 01/17/2017 03:34 PM, Davor Bonaci wrote:
> >
> >> +1. I think merging to master would be an awesome next step for the
> Python
> >> SDK.
> >>
> >> And, thanks for a great summary of the current state, roadmap, and
> impact
> >> to the project as a whole -- awesome!
> >>
> >> Process-wise, I'd suggest starting a formal vote once this discussion
> >> seems
> >> to be trending towards a conclusion, and complete the merge as soon as
> the
> >> next release (0.5.0) is cut. This would enable additional time before
> >> 0.6.0
> >> to figure out compliance, release process impact, etc.
> >>
> >> Great work everyone!
> >>
> >> On Tue, Jan 17, 2017 at 8:26 AM, Jean-Baptiste Onofré 
> >> wrote:
> >>
> >> Hi
> >>>
> >>> I didn't try the Python SDK recently but you provided a clear "state of
> >>> the art". Anyway I'm in favor of merging things as quick as possible
> >>> (assuming it's in a good shape in term of build, test, ...): it would
> >>> potentially grow up the "external" contributions.
> >>>
> >>> So +1 from my side.
> >>>
> >>> Regards
> >>> JB⁣​
> >>>
> >>> On Jan 17, 2017, 08:22, at 08:22, Ahmet Altay  >
> >>> wrote:
> >>>
>  Hi all,
> 
>  tl;dr: I would like to start a discussion about merging python-sdk
>  branch
>  to master branch. Python SDK is mature enough and merging it to master
>  will
>  accelerate its development and adoption.
> 
>  With a great effort from a lot of contributors(*), Python SDK [1] is
>  now a
>  mostly complete, tested, performant Python implementation of the Beam
>  model. Since June, when we first started with Python SDK in Apache
> Beam
>  we
>  have been continuously improving it.
> 
>  ** Python SDK currently supports:
> 
>  * Model: All main concepts are present (ParDo, GroupByKey, Windowing
>  etc.).
>  * IO: There are extensible APIs for writing new bounded sources and
>  sinks.
>  Implementations are provided for Text, Avro, BigQuery, and Datastore.
>  * Runners: Python SDK has an extensible base runner module that allows
>  building specific runners on top of it. The SDK comes with two
> pipeline
>  runners: DirectRunner and DataflowRunner; and it is possible to add
>  more.
>  The existing runners are currently limited to bounded execution and
>  otherwise equivalent to their Java SDK counterparts in functionality.
>  * Testing: Python SDK implements ValidatesRunner test framework for
>  implementing integration test for current and future runners. There is
>  unit
>  test coverage for all modules, and a number of integrations test for
>  validating existing runners.
>  * Documentation and examples: Documentation work has started on Python
>  SDK.
>  Beam Programming Guide page has been updated to include Python [2].
> The
>  code comes with many ready to use examples and we are in a good place
>  to
>  start documenting those on the website.
> 
>  ** We are not done yet, next on the roadmap we have:
> 
>  * Streaming: Both of the existing runners lack support for streaming
>  execution, and currently there is work going on for adding streaming
>  support to DirectRunner [3].
>  * Documentation: Filling the rest of the Beam documentations with
>  Python
>  SDK specific information and examples.
>  * SDK consistency: Making Python SDK consistent with the Java SDK. We
>  have
>  come a long way on this and have only a few items left [4].
>  * Beamifying: We have been working on removing Dataflow-specific
>  references
>  both from the documentation and from the code. There is some work
> left,
>  and
>  we are currently working on those as well [5].
> 
>  ** Steps and implications of merging to master:
> 
>  * Master branch is merged to python-sdk branch at regular intervals
> and
>  the
>  last merge was on 12/22. All the past merges were uneventful because
>  there
>  is a minimal overlap in modified files between branches. Integrating
>  python-sdk to master will similarly touch a small number of existing
>  files.
> 
>  * Python SDK is using the same tools for building and testing. It is
>  already integrated with Maven, Jenkins and Travis. Specifically the
>  impact
>  to the testing infrastructure would be:
>  - There will be two additional test configurations in Travis. Since
>  Travis
>  runs all configurations in parallel there should not 

Re: [DISCUSS] Python SDK status and next steps

2017-01-17 Thread Lukasz Cwik
I'm also for merging to master.

On Tue, Jan 17, 2017 at 3:39 PM, Jean-Baptiste Onofré 
wrote:

> It makes sense to merge after 0.5.0 release.
>
> Good point Davor: +1
>
> Regards
> JB
>
>
> On 01/17/2017 03:34 PM, Davor Bonaci wrote:
>
>> +1. I think merging to master would be an awesome next step for the Python
>> SDK.
>>
>> And, thanks for a great summary of the current state, roadmap, and impact
>> to the project as a whole -- awesome!
>>
>> Process-wise, I'd suggest starting a formal vote once this discussion
>> seems
>> to be trending towards a conclusion, and complete the merge as soon as the
>> next release (0.5.0) is cut. This would enable additional time before
>> 0.6.0
>> to figure out compliance, release process impact, etc.
>>
>> Great work everyone!
>>
>> On Tue, Jan 17, 2017 at 8:26 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Hi
>>>
>>> I didn't try the Python SDK recently but you provided a clear "state of
>>> the art". Anyway I'm in favor of merging things as quick as possible
>>> (assuming it's in a good shape in term of build, test, ...): it would
>>> potentially grow up the "external" contributions.
>>>
>>> So +1 from my side.
>>>
>>> Regards
>>> JB⁣​
>>>
>>> On Jan 17, 2017, 08:22, at 08:22, Ahmet Altay 
>>> wrote:
>>>
 Hi all,

 tl;dr: I would like to start a discussion about merging python-sdk
 branch
 to master branch. Python SDK is mature enough and merging it to master
 will
 accelerate its development and adoption.

 With a great effort from a lot of contributors(*), Python SDK [1] is
 now a
 mostly complete, tested, performant Python implementation of the Beam
 model. Since June, when we first started with Python SDK in Apache Beam
 we
 have been continuously improving it.

 ** Python SDK currently supports:

 * Model: All main concepts are present (ParDo, GroupByKey, Windowing
 etc.).
 * IO: There are extensible APIs for writing new bounded sources and
 sinks.
 Implementations are provided for Text, Avro, BigQuery, and Datastore.
 * Runners: Python SDK has an extensible base runner module that allows
 building specific runners on top of it. The SDK comes with two pipeline
 runners: DirectRunner and DataflowRunner; and it is possible to add
 more.
 The existing runners are currently limited to bounded execution and
 otherwise equivalent to their Java SDK counterparts in functionality.
 * Testing: Python SDK implements ValidatesRunner test framework for
 implementing integration test for current and future runners. There is
 unit
 test coverage for all modules, and a number of integrations test for
 validating existing runners.
 * Documentation and examples: Documentation work has started on Python
 SDK.
 Beam Programming Guide page has been updated to include Python [2]. The
 code comes with many ready to use examples and we are in a good place
 to
 start documenting those on the website.

 ** We are not done yet, next on the roadmap we have:

 * Streaming: Both of the existing runners lack support for streaming
 execution, and currently there is work going on for adding streaming
 support to DirectRunner [3].
 * Documentation: Filling the rest of the Beam documentations with
 Python
 SDK specific information and examples.
 * SDK consistency: Making Python SDK consistent with the Java SDK. We
 have
 come a long way on this and have only a few items left [4].
 * Beamifying: We have been working on removing Dataflow-specific
 references
 both from the documentation and from the code. There is some work left,
 and
 we are currently working on those as well [5].

 ** Steps and implications of merging to master:

 * Master branch is merged to python-sdk branch at regular intervals and
 the
 last merge was on 12/22. All the past merges were uneventful because
 there
 is a minimal overlap in modified files between branches. Integrating
 python-sdk to master will similarly touch a small number of existing
 files.

 * Python SDK is using the same tools for building and testing. It is
 already integrated with Maven, Jenkins and Travis. Specifically the
 impact
 to the testing infrastructure would be:
 - There will be two additional test configurations in Travis. Since
 Travis
 runs all configurations in parallel there should not be a noticeable
 change
 in the Travis run time.
 - Jenkins pre-commit test will start running the Python SDK tests. It
 will
 add an additional 5 minutes to the completion time of pre-commit test.
 Historically Python SDK tests were not flaky and did not cause any
 random
 failures.
 - Jenkins Python post-commit test is already separated from the other
 post-commit tests 

Re: [DISCUSS] Python SDK status and next steps

2017-01-17 Thread Jean-Baptiste Onofré

It makes sense to merge after 0.5.0 release.

Good point Davor: +1

Regards
JB

On 01/17/2017 03:34 PM, Davor Bonaci wrote:

+1. I think merging to master would be an awesome next step for the Python
SDK.

And, thanks for a great summary of the current state, roadmap, and impact
to the project as a whole -- awesome!

Process-wise, I'd suggest starting a formal vote once this discussion seems
to be trending towards a conclusion, and complete the merge as soon as the
next release (0.5.0) is cut. This would enable additional time before 0.6.0
to figure out compliance, release process impact, etc.

Great work everyone!

On Tue, Jan 17, 2017 at 8:26 AM, Jean-Baptiste Onofré 
wrote:


Hi

I didn't try the Python SDK recently but you provided a clear "state of
the art". Anyway I'm in favor of merging things as quick as possible
(assuming it's in a good shape in term of build, test, ...): it would
potentially grow up the "external" contributions.

So +1 from my side.

Regards
JB⁣​

On Jan 17, 2017, 08:22, at 08:22, Ahmet Altay 
wrote:

Hi all,

tl;dr: I would like to start a discussion about merging python-sdk
branch
to master branch. Python SDK is mature enough and merging it to master
will
accelerate its development and adoption.

With a great effort from a lot of contributors(*), Python SDK [1] is
now a
mostly complete, tested, performant Python implementation of the Beam
model. Since June, when we first started with Python SDK in Apache Beam
we
have been continuously improving it.

** Python SDK currently supports:

* Model: All main concepts are present (ParDo, GroupByKey, Windowing
etc.).
* IO: There are extensible APIs for writing new bounded sources and
sinks.
Implementations are provided for Text, Avro, BigQuery, and Datastore.
* Runners: Python SDK has an extensible base runner module that allows
building specific runners on top of it. The SDK comes with two pipeline
runners: DirectRunner and DataflowRunner; and it is possible to add
more.
The existing runners are currently limited to bounded execution and
otherwise equivalent to their Java SDK counterparts in functionality.
* Testing: Python SDK implements ValidatesRunner test framework for
implementing integration test for current and future runners. There is
unit
test coverage for all modules, and a number of integrations test for
validating existing runners.
* Documentation and examples: Documentation work has started on Python
SDK.
Beam Programming Guide page has been updated to include Python [2]. The
code comes with many ready to use examples and we are in a good place
to
start documenting those on the website.

** We are not done yet, next on the roadmap we have:

* Streaming: Both of the existing runners lack support for streaming
execution, and currently there is work going on for adding streaming
support to DirectRunner [3].
* Documentation: Filling the rest of the Beam documentations with
Python
SDK specific information and examples.
* SDK consistency: Making Python SDK consistent with the Java SDK. We
have
come a long way on this and have only a few items left [4].
* Beamifying: We have been working on removing Dataflow-specific
references
both from the documentation and from the code. There is some work left,
and
we are currently working on those as well [5].

** Steps and implications of merging to master:

* Master branch is merged to python-sdk branch at regular intervals and
the
last merge was on 12/22. All the past merges were uneventful because
there
is a minimal overlap in modified files between branches. Integrating
python-sdk to master will similarly touch a small number of existing
files.

* Python SDK is using the same tools for building and testing. It is
already integrated with Maven, Jenkins and Travis. Specifically the
impact
to the testing infrastructure would be:
- There will be two additional test configurations in Travis. Since
Travis
runs all configurations in parallel there should not be a noticeable
change
in the Travis run time.
- Jenkins pre-commit test will start running the Python SDK tests. It
will
add an additional 5 minutes to the completion time of pre-commit test.
Historically Python SDK tests were not flaky and did not cause any
random
failures.
- Jenkins Python post-commit test is already separated from the other
post-commit tests and will continue to exist. It would not change the
testing time for any other test.

* The release process needs to be updated to accommodate releasing
Python
artifacts. Python SDK would fit in the existing release schedule and
could
be released along with the Java SDK. The additional steps would
include:
- Generating Python artifacts. This could be done with a single command
using Maven today.
- Publishing the artifacts to a central repository such as PyPI.
- Updating the release guide to reflect the changes above.

* Users: There are existing users using the Python SDK. To give a rough
estimate, a distribution of the Beam Python SDK 

Re: [DISCUSS] Python SDK status and next steps

2017-01-17 Thread Davor Bonaci
+1. I think merging to master would be an awesome next step for the Python
SDK.

And, thanks for a great summary of the current state, roadmap, and impact
to the project as a whole -- awesome!

Process-wise, I'd suggest starting a formal vote once this discussion seems
to be trending towards a conclusion, and complete the merge as soon as the
next release (0.5.0) is cut. This would enable additional time before 0.6.0
to figure out compliance, release process impact, etc.

Great work everyone!

On Tue, Jan 17, 2017 at 8:26 AM, Jean-Baptiste Onofré 
wrote:

> Hi
>
> I didn't try the Python SDK recently but you provided a clear "state of
> the art". Anyway I'm in favor of merging things as quick as possible
> (assuming it's in a good shape in term of build, test, ...): it would
> potentially grow up the "external" contributions.
>
> So +1 from my side.
>
> Regards
> JB⁣​
>
> On Jan 17, 2017, 08:22, at 08:22, Ahmet Altay 
> wrote:
> >Hi all,
> >
> >tl;dr: I would like to start a discussion about merging python-sdk
> >branch
> >to master branch. Python SDK is mature enough and merging it to master
> >will
> >accelerate its development and adoption.
> >
> >With a great effort from a lot of contributors(*), Python SDK [1] is
> >now a
> >mostly complete, tested, performant Python implementation of the Beam
> >model. Since June, when we first started with Python SDK in Apache Beam
> >we
> >have been continuously improving it.
> >
> >** Python SDK currently supports:
> >
> >* Model: All main concepts are present (ParDo, GroupByKey, Windowing
> >etc.).
> >* IO: There are extensible APIs for writing new bounded sources and
> >sinks.
> >Implementations are provided for Text, Avro, BigQuery, and Datastore.
> >* Runners: Python SDK has an extensible base runner module that allows
> >building specific runners on top of it. The SDK comes with two pipeline
> >runners: DirectRunner and DataflowRunner; and it is possible to add
> >more.
> >The existing runners are currently limited to bounded execution and
> >otherwise equivalent to their Java SDK counterparts in functionality.
> >* Testing: Python SDK implements ValidatesRunner test framework for
> >implementing integration test for current and future runners. There is
> >unit
> >test coverage for all modules, and a number of integrations test for
> >validating existing runners.
> >* Documentation and examples: Documentation work has started on Python
> >SDK.
> >Beam Programming Guide page has been updated to include Python [2]. The
> >code comes with many ready to use examples and we are in a good place
> >to
> >start documenting those on the website.
> >
> >** We are not done yet, next on the roadmap we have:
> >
> >* Streaming: Both of the existing runners lack support for streaming
> >execution, and currently there is work going on for adding streaming
> >support to DirectRunner [3].
> >* Documentation: Filling the rest of the Beam documentations with
> >Python
> >SDK specific information and examples.
> >* SDK consistency: Making Python SDK consistent with the Java SDK. We
> >have
> >come a long way on this and have only a few items left [4].
> >* Beamifying: We have been working on removing Dataflow-specific
> >references
> >both from the documentation and from the code. There is some work left,
> >and
> >we are currently working on those as well [5].
> >
> >** Steps and implications of merging to master:
> >
> >* Master branch is merged to python-sdk branch at regular intervals and
> >the
> >last merge was on 12/22. All the past merges were uneventful because
> >there
> >is a minimal overlap in modified files between branches. Integrating
> >python-sdk to master will similarly touch a small number of existing
> >files.
> >
> >* Python SDK is using the same tools for building and testing. It is
> >already integrated with Maven, Jenkins and Travis. Specifically the
> >impact
> >to the testing infrastructure would be:
> >- There will be two additional test configurations in Travis. Since
> >Travis
> >runs all configurations in parallel there should not be a noticeable
> >change
> >in the Travis run time.
> >- Jenkins pre-commit test will start running the Python SDK tests. It
> >will
> >add an additional 5 minutes to the completion time of pre-commit test.
> >Historically Python SDK tests were not flaky and did not cause any
> >random
> >failures.
> >- Jenkins Python post-commit test is already separated from the other
> >post-commit tests and will continue to exist. It would not change the
> >testing time for any other test.
> >
> >* The release process needs to be updated to accommodate releasing
> >Python
> >artifacts. Python SDK would fit in the existing release schedule and
> >could
> >be released along with the Java SDK. The additional steps would
> >include:
> >- Generating Python artifacts. This could be done with a single command
> >using Maven today.
> >- Publishing the artifacts to a central repository such as