Re: Release 0.6.0

2017-03-01 Thread Jean-Baptiste Onofré

Thanks Ahmet !

Regards
JB

On 03/02/2017 07:42 AM, Ahmet Altay wrote:

Sure, I can wait. To be clear, Thursday night in which time zone?

Thank you,
Ahmet

On Wed, Mar 1, 2017 at 10:38 PM, Jean-Baptiste Onofré 
wrote:


Hi Ahmet,

Can you wait up to Thursday night ? Trying to merge BEAM-649.

Thanks !
Regards
JB


On 03/01/2017 07:23 PM, Ahmet Altay wrote:


Thank you. I will start working on it.

Ahmet

On Wed, Mar 1, 2017 at 9:03 AM, Aljoscha Krettek 
wrote:

I just closed the last blocking issue, we should be good to go now.


Sorry again for the hold-up.

On Tue, 28 Feb 2017 at 18:38 Ahmet Altay 
wrote:

Thank you all. I will wait for release blocking issues to be closed.

Sergio, thank you for the information. I will document the friction
points
during this release process. Following the release we can start a
discussion about how to fix those.

Ahmet

On Tue, Feb 28, 2017 at 9:22 AM, Aljoscha Krettek 
wrote:

That was my mistake, sorry for that. I should have tagged [1] as a



blocker


because leaking state is probably a bad idea. At least then people would


be


aware and we could have discussed whether it is a blocker.

There is already an open PR for this now.

[1] https://issues.apache.org/jira/browse/BEAM-1517

On Tue, 28 Feb 2017 at 18:21 Jean-Baptiste Onofré 


wrote:



Regarding BEAM-649, it's not a release blocker, it's a good to have.


As I'm pretty close to the end of the Pull Request (hopefully tonight


or



tomorrow), it's a "Good To Have".


Regards
JB

On 02/28/2017 06:09 PM, Davor Bonaci wrote:


Can we please use JIRA to tag potentially release-blocking issues?


Anyone



can just add a 'Fix Versions' field of an open issue to the next



scheduled


release -- and it becomes easily visible to everyone in the project.

In general, I'm not a fan of blocking releases for new functionality.
Rushing new features and a lack of baking time usually translates to


bugs.


However, I think this time it is totally justified -- on a separate


thread


we plan for this to be the last release before the "first stable


release";


and picking the new features now will provide additional coverage for


it.





So, +1, but please tag in JIRA.

On Tue, Feb 28, 2017 at 2:09 AM, Aljoscha Krettek <


aljos...@apache.org





wrote:


I would like to finish these two:

https://issues.apache.org/jira/browse/BEAM-1036: Support for new


State



API


in FlinkRunner

https://issues.apache.org/jira/browse/BEAM-1116: Support for new


Timer



API


in Flink runner


Both of them are finished for the streaming runner, for the batch


runner



I'm merging the code for the first right now and the second will not



take



long.


There is also this: https://issues.apache.org/jira/browse/BEAM-1517


:



User



state in the Flink Streaming Runner is not garbage collected. It's



not a



regression from 0.5.0 where we simply didn't have this feature but



I'm



still somewhat uneasy about this.



On Tue, 28 Feb 2017 at 09:44 Jean-Baptiste Onofré 


wrote:





Fair enough.


I also try to merge https://github.com/apache/beam/pull/1739 asap.

Regards
JB

On 02/28/2017 09:34 AM, Amit Sela wrote:


I'd prefer we wait to merge https://github.com/apache/


beam/pull/2050



Shouldn't take long now..


On Tue, Feb 28, 2017 at 10:00 AM Sergio Fernández <


wik...@apache.org>



wrote:




Sounds good!


Ahmet, notice ASF has not current infrastructure to stage Python


Release



Candidates. Anyway we left unmanaged the Maven deploy lifecycle



for



the



Python SDK, but it should be discussed at some point.




On Mon, Feb 27, 2017 at 11:01 PM, Ahmet Altay







wrote:


Hi all,


It's been about a month since the last release. I would like


propose



starting the next release. There are no releasing blocking bugs



in



JIRA



[1]. Are there any release blocking issues I am missing?


Unless there is an objection I will volunteer to manage this


release.



This



will be the first release with Python content. In case there are


issues



with that it might be easier for me to resolve and document



those



as


part



of the release process.


Thank you,
Ahmet

[1]
https://issues.apache.org/jira/issues/?jql=project%20%
3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%
20fixVersion%20%3D%200.6.0%20ORDER%20BY%20due%20ASC%2C%
20priority%20DESC%2C%20created%20ASC





--
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925 <+43%20660%202747925> <+43%20660%202747925>


<+43%20660%202747925>



<+43%20660%202747925>



e: sergio.fernan...@redlink.co

w: http://redlink.co





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com









--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net

Re: Release 0.6.0

2017-03-01 Thread Jean-Baptiste Onofré

Pacific time is fine.

Regards
JB

On 03/02/2017 07:42 AM, Ahmet Altay wrote:

Sure, I can wait. To be clear, Thursday night in which time zone?

Thank you,
Ahmet

On Wed, Mar 1, 2017 at 10:38 PM, Jean-Baptiste Onofré 
wrote:


Hi Ahmet,

Can you wait up to Thursday night ? Trying to merge BEAM-649.

Thanks !
Regards
JB


On 03/01/2017 07:23 PM, Ahmet Altay wrote:


Thank you. I will start working on it.

Ahmet

On Wed, Mar 1, 2017 at 9:03 AM, Aljoscha Krettek 
wrote:

I just closed the last blocking issue, we should be good to go now.


Sorry again for the hold-up.

On Tue, 28 Feb 2017 at 18:38 Ahmet Altay 
wrote:

Thank you all. I will wait for release blocking issues to be closed.

Sergio, thank you for the information. I will document the friction
points
during this release process. Following the release we can start a
discussion about how to fix those.

Ahmet

On Tue, Feb 28, 2017 at 9:22 AM, Aljoscha Krettek 
wrote:

That was my mistake, sorry for that. I should have tagged [1] as a



blocker


because leaking state is probably a bad idea. At least then people would


be


aware and we could have discussed whether it is a blocker.

There is already an open PR for this now.

[1] https://issues.apache.org/jira/browse/BEAM-1517

On Tue, 28 Feb 2017 at 18:21 Jean-Baptiste Onofré 


wrote:



Regarding BEAM-649, it's not a release blocker, it's a good to have.


As I'm pretty close to the end of the Pull Request (hopefully tonight


or



tomorrow), it's a "Good To Have".


Regards
JB

On 02/28/2017 06:09 PM, Davor Bonaci wrote:


Can we please use JIRA to tag potentially release-blocking issues?


Anyone



can just add a 'Fix Versions' field of an open issue to the next



scheduled


release -- and it becomes easily visible to everyone in the project.

In general, I'm not a fan of blocking releases for new functionality.
Rushing new features and a lack of baking time usually translates to


bugs.


However, I think this time it is totally justified -- on a separate


thread


we plan for this to be the last release before the "first stable


release";


and picking the new features now will provide additional coverage for


it.





So, +1, but please tag in JIRA.

On Tue, Feb 28, 2017 at 2:09 AM, Aljoscha Krettek <


aljos...@apache.org





wrote:


I would like to finish these two:

https://issues.apache.org/jira/browse/BEAM-1036: Support for new


State



API


in FlinkRunner

https://issues.apache.org/jira/browse/BEAM-1116: Support for new


Timer



API


in Flink runner


Both of them are finished for the streaming runner, for the batch


runner



I'm merging the code for the first right now and the second will not



take



long.


There is also this: https://issues.apache.org/jira/browse/BEAM-1517


:



User



state in the Flink Streaming Runner is not garbage collected. It's



not a



regression from 0.5.0 where we simply didn't have this feature but



I'm



still somewhat uneasy about this.



On Tue, 28 Feb 2017 at 09:44 Jean-Baptiste Onofré 


wrote:





Fair enough.


I also try to merge https://github.com/apache/beam/pull/1739 asap.

Regards
JB

On 02/28/2017 09:34 AM, Amit Sela wrote:


I'd prefer we wait to merge https://github.com/apache/


beam/pull/2050



Shouldn't take long now..


On Tue, Feb 28, 2017 at 10:00 AM Sergio Fernández <


wik...@apache.org>



wrote:




Sounds good!


Ahmet, notice ASF has not current infrastructure to stage Python


Release



Candidates. Anyway we left unmanaged the Maven deploy lifecycle



for



the



Python SDK, but it should be discussed at some point.




On Mon, Feb 27, 2017 at 11:01 PM, Ahmet Altay







wrote:


Hi all,


It's been about a month since the last release. I would like


propose



starting the next release. There are no releasing blocking bugs



in



JIRA



[1]. Are there any release blocking issues I am missing?


Unless there is an objection I will volunteer to manage this


release.



This



will be the first release with Python content. In case there are


issues



with that it might be easier for me to resolve and document



those



as


part



of the release process.


Thank you,
Ahmet

[1]
https://issues.apache.org/jira/issues/?jql=project%20%
3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%
20fixVersion%20%3D%200.6.0%20ORDER%20BY%20due%20ASC%2C%
20priority%20DESC%2C%20created%20ASC





--
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925 <+43%20660%202747925> <+43%20660%202747925>


<+43%20660%202747925>



<+43%20660%202747925>



e: sergio.fernan...@redlink.co

w: http://redlink.co





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com









--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanth

Re: Release 0.6.0

2017-03-01 Thread Ahmet Altay
Sure, I can wait. To be clear, Thursday night in which time zone?

Thank you,
Ahmet

On Wed, Mar 1, 2017 at 10:38 PM, Jean-Baptiste Onofré 
wrote:

> Hi Ahmet,
>
> Can you wait up to Thursday night ? Trying to merge BEAM-649.
>
> Thanks !
> Regards
> JB
>
>
> On 03/01/2017 07:23 PM, Ahmet Altay wrote:
>
>> Thank you. I will start working on it.
>>
>> Ahmet
>>
>> On Wed, Mar 1, 2017 at 9:03 AM, Aljoscha Krettek 
>> wrote:
>>
>> I just closed the last blocking issue, we should be good to go now.
>>>
>>> Sorry again for the hold-up.
>>>
>>> On Tue, 28 Feb 2017 at 18:38 Ahmet Altay 
>>> wrote:
>>>
>>> Thank you all. I will wait for release blocking issues to be closed.
>>>
>>> Sergio, thank you for the information. I will document the friction
>>> points
>>> during this release process. Following the release we can start a
>>> discussion about how to fix those.
>>>
>>> Ahmet
>>>
>>> On Tue, Feb 28, 2017 at 9:22 AM, Aljoscha Krettek 
>>> wrote:
>>>
>>> That was my mistake, sorry for that. I should have tagged [1] as a

>>> blocker
>>>
 because leaking state is probably a bad idea. At least then people would

>>> be
>>>
 aware and we could have discussed whether it is a blocker.

 There is already an open PR for this now.

 [1] https://issues.apache.org/jira/browse/BEAM-1517

 On Tue, 28 Feb 2017 at 18:21 Jean-Baptiste Onofré 

>>> wrote:
>>>

 Regarding BEAM-649, it's not a release blocker, it's a good to have.
>
> As I'm pretty close to the end of the Pull Request (hopefully tonight
>
 or
>>>
 tomorrow), it's a "Good To Have".
>
> Regards
> JB
>
> On 02/28/2017 06:09 PM, Davor Bonaci wrote:
>
>> Can we please use JIRA to tag potentially release-blocking issues?
>>
> Anyone

> can just add a 'Fix Versions' field of an open issue to the next
>>
> scheduled
>
>> release -- and it becomes easily visible to everyone in the project.
>>
>> In general, I'm not a fan of blocking releases for new functionality.
>> Rushing new features and a lack of baking time usually translates to
>>
> bugs.
>
>> However, I think this time it is totally justified -- on a separate
>>
> thread
>
>> we plan for this to be the last release before the "first stable
>>
> release";
>
>> and picking the new features now will provide additional coverage for
>>
> it.

>
>> So, +1, but please tag in JIRA.
>>
>> On Tue, Feb 28, 2017 at 2:09 AM, Aljoscha Krettek <
>>
> aljos...@apache.org
>>>

> wrote:
>>
>> I would like to finish these two:
>>> https://issues.apache.org/jira/browse/BEAM-1036: Support for new
>>>
>> State

> API
>
>> in FlinkRunner
>>> https://issues.apache.org/jira/browse/BEAM-1116: Support for new
>>>
>> Timer

> API
>
>> in Flink runner
>>>
>>> Both of them are finished for the streaming runner, for the batch
>>>
>> runner

> I'm merging the code for the first right now and the second will not
>>>
>> take
>
>> long.
>>>
>>> There is also this: https://issues.apache.org/jira/browse/BEAM-1517
>>>
>> :
>>>
 User
>
>> state in the Flink Streaming Runner is not garbage collected. It's
>>>
>> not a

> regression from 0.5.0 where we simply didn't have this feature but
>>>
>> I'm
>>>
 still somewhat uneasy about this.
>>>
>>>
>>> On Tue, 28 Feb 2017 at 09:44 Jean-Baptiste Onofré 
>>>
>> wrote:
>
>>
>>> Fair enough.

 I also try to merge https://github.com/apache/beam/pull/1739 asap.

 Regards
 JB

 On 02/28/2017 09:34 AM, Amit Sela wrote:

> I'd prefer we wait to merge https://github.com/apache/
>
 beam/pull/2050

> Shouldn't take long now..
>
> On Tue, Feb 28, 2017 at 10:00 AM Sergio Fernández <
>
 wik...@apache.org>

> wrote:

>
> Sounds good!
>>
>> Ahmet, notice ASF has not current infrastructure to stage Python
>>
> Release
>>>
 Candidates. Anyway we left unmanaged the Maven deploy lifecycle
>>
> for
>>>
 the
>>>
 Python SDK, but it should be discussed at some point.
>>
>>
>>
>> On Mon, Feb 27, 2017 at 11:01 PM, Ahmet Altay
>>
> >>

> wrote:
>>
>> Hi all,
>>>
>>> It's been about a month since the last release. I would like
>>>
>> propose

> starting the next release. There are no releasing blocking bugs
>>>
>> in
>>>
 JIRA
>>>
 [1]. Are there any release blocking issues I am missing?
>>>
>>> Unless there is an ob

Re: First stable release: version designation?

2017-03-01 Thread Jean-Baptiste Onofré

Hi Davor,


For a Beam community perspective, 1.0.0 would make more sense. We have a 
fair number of people starting with Beam (without knowing Dataflow).


However, as Dataflow SDK (origins of Beam) was in 1.0.0, in order to 
avoid confusion with users coming to Beam from Dataflow, 2.0.0 could help.


I have a preference to 1.0.0 anyway, but I would understand starting 
from 2.0.0.


Regards
JB

On 03/01/2017 07:56 PM, Davor Bonaci wrote:

The first stable release is our next major project-wide goal; see
discussion in [1]. I've been referring to it as "the first stable release"
for a long time, not "1.0.0" or "2.0.0" or "2017" or something else, to
make sure we have an unbiased discussion and a consensus-based decision on
this matter.

I think that now is the time to consider the appropriate designation for
our first stable release, and formally make a decision on it. A reasonable
choices could be "1.0.0" or "2.0.0", perhaps there are others.

1.0.0:
* It logically comes after the current series, 0.x.y.
* Most people would expect it, I suppose.
* A possible confusion between Dataflow SDKs and Beam SDKs carrying the
same number.

2.0.0:
* Follows the pattern some other projects have taken -- continuing their
version numbering scheme from their previous origin.
* Better communicates project's roots, and degree of maturity.
* May be unexpected to some users.

I'd invite everyone to share their thoughts and preferences -- names are
important and well correlated with success. Thanks!

Davor

[1] https://lists.apache.org/thread.html/c35067071aec9029d9100ae973c629
9aa919c31d0de623ac367128e2@%3Cdev.beam.apache.org%3E



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Release 0.6.0

2017-03-01 Thread Jean-Baptiste Onofré

Hi Ahmet,

Can you wait up to Thursday night ? Trying to merge BEAM-649.

Thanks !
Regards
JB

On 03/01/2017 07:23 PM, Ahmet Altay wrote:

Thank you. I will start working on it.

Ahmet

On Wed, Mar 1, 2017 at 9:03 AM, Aljoscha Krettek 
wrote:


I just closed the last blocking issue, we should be good to go now.

Sorry again for the hold-up.

On Tue, 28 Feb 2017 at 18:38 Ahmet Altay  wrote:

Thank you all. I will wait for release blocking issues to be closed.

Sergio, thank you for the information. I will document the friction points
during this release process. Following the release we can start a
discussion about how to fix those.

Ahmet

On Tue, Feb 28, 2017 at 9:22 AM, Aljoscha Krettek 
wrote:


That was my mistake, sorry for that. I should have tagged [1] as a

blocker

because leaking state is probably a bad idea. At least then people would

be

aware and we could have discussed whether it is a blocker.

There is already an open PR for this now.

[1] https://issues.apache.org/jira/browse/BEAM-1517

On Tue, 28 Feb 2017 at 18:21 Jean-Baptiste Onofré 

wrote:



Regarding BEAM-649, it's not a release blocker, it's a good to have.

As I'm pretty close to the end of the Pull Request (hopefully tonight

or

tomorrow), it's a "Good To Have".

Regards
JB

On 02/28/2017 06:09 PM, Davor Bonaci wrote:

Can we please use JIRA to tag potentially release-blocking issues?

Anyone

can just add a 'Fix Versions' field of an open issue to the next

scheduled

release -- and it becomes easily visible to everyone in the project.

In general, I'm not a fan of blocking releases for new functionality.
Rushing new features and a lack of baking time usually translates to

bugs.

However, I think this time it is totally justified -- on a separate

thread

we plan for this to be the last release before the "first stable

release";

and picking the new features now will provide additional coverage for

it.


So, +1, but please tag in JIRA.

On Tue, Feb 28, 2017 at 2:09 AM, Aljoscha Krettek <

aljos...@apache.org



wrote:


I would like to finish these two:
https://issues.apache.org/jira/browse/BEAM-1036: Support for new

State

API

in FlinkRunner
https://issues.apache.org/jira/browse/BEAM-1116: Support for new

Timer

API

in Flink runner

Both of them are finished for the streaming runner, for the batch

runner

I'm merging the code for the first right now and the second will not

take

long.

There is also this: https://issues.apache.org/jira/browse/BEAM-1517

:

User

state in the Flink Streaming Runner is not garbage collected. It's

not a

regression from 0.5.0 where we simply didn't have this feature but

I'm

still somewhat uneasy about this.


On Tue, 28 Feb 2017 at 09:44 Jean-Baptiste Onofré 

wrote:



Fair enough.

I also try to merge https://github.com/apache/beam/pull/1739 asap.

Regards
JB

On 02/28/2017 09:34 AM, Amit Sela wrote:

I'd prefer we wait to merge https://github.com/apache/

beam/pull/2050

Shouldn't take long now..

On Tue, Feb 28, 2017 at 10:00 AM Sergio Fernández <

wik...@apache.org>

wrote:



Sounds good!

Ahmet, notice ASF has not current infrastructure to stage Python

Release

Candidates. Anyway we left unmanaged the Maven deploy lifecycle

for

the

Python SDK, but it should be discussed at some point.



On Mon, Feb 27, 2017 at 11:01 PM, Ahmet Altay




wrote:


Hi all,

It's been about a month since the last release. I would like

propose

starting the next release. There are no releasing blocking bugs

in

JIRA

[1]. Are there any release blocking issues I am missing?

Unless there is an objection I will volunteer to manage this

release.

This

will be the first release with Python content. In case there are

issues

with that it might be easier for me to resolve and document

those

as

part

of the release process.

Thank you,
Ahmet

[1]
https://issues.apache.org/jira/issues/?jql=project%20%
3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%
20fixVersion%20%3D%200.6.0%20ORDER%20BY%20due%20ASC%2C%
20priority%20DESC%2C%20created%20ASC





--
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925 <+43%20660%202747925> <+43%20660%202747925>

<+43%20660%202747925>

<+43%20660%202747925>

e: sergio.fernan...@redlink.co
w: http://redlink.co





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com









--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Pipeline termination in the unified Beam model

2017-03-01 Thread Thomas Groh
+1

I think it's a fair claim that a PCollection is "done" when it's watermark
reaches positive infinity, and then it's easy to claim that a Pipeline is
"done" when all of its PCollections are done. Completion is an especially
reasonable claim if we consider positive infinity to be an actual infinity
- so long as allowed lateness is a finite value, elements that arrive
whenever a watermark is at positive infinity will be "infinitely" late, and
thus can be dropped by the runner.

As an aside, this is only about "finishing because the pipeline is
complete" - it's unrelated to "finished because of an unrecoverable error"
or similar reasons pipelines can stop running, yes?

On Wed, Mar 1, 2017 at 5:54 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> Raising this onto the mailing list from
> https://issues.apache.org/jira/browse/BEAM-849
>
> The issue came up: what does it mean for a pipeline to finish, in the Beam
> model?
>
> Note that I am deliberately not talking about "batch" and "streaming"
> pipelines, because this distinction does not exist in the model. Several
> runners have batch/streaming *modes*, which implement the same semantics
> (potentially different subsets: in batch mode typically a runner will
> reject pipelines that have at least one unbounded PCollection) but in an
> operationally different way. However we should define pipeline termination
> at the level of the unified model, and then make sure that all runners in
> all modes implement that properly.
>
> One natural way is to say "a pipeline terminates when the output watermarks
> of all of its PCollection's progress to +infinity". (Note: this can be
> generalized, I guess, to having partial executions of a pipeline: if you're
> interested in the full contents of only some collections, then you wait
> until only the watermarks of those collections progress to infinity)
>
> A typical "batch" runner mode does not implement watermarks - we can think
> of it as assigning watermark -infinity to an output of a transform that
> hasn't started executing yet, and +infinity to output of a transform that
> has finished executing. This is consistent with how such runners implement
> termination in practice.
>
> Dataflow streaming runner additionally implements such termination for
> pipeline drain operation: it has 2 parts: 1) stop consuming input from the
> sources, and 2) wait until all watermarks progress to infinity.
>
> Let us fill the gap by making this part of the Beam model and declaring
> that all runners should implement this behavior. This will give nice
> properties, e.g.:
> - A pipeline that has only bounded collections can be run by any runner in
> any mode, with the same results and termination behavior (this is actually
> my motivating example for raising this issue is: I was running Splittable
> DoFn tests
>  core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java>
> with the streaming Dataflow runner - these tests produce only bounded
> collections - and noticed that they wouldn't terminate even though all data
> was processed)
> - It will be possible to implement pipelines that stream data for a while
> and then eventually successfully terminate based on some condition. E.g. a
> pipeline that watches a continuously growing file until it is marked
> read-only, or a pipeline that reads a Kafka topic partition until it
> receives a "poison pill" message. This seems handy.
>


Pipeline termination in the unified Beam model

2017-03-01 Thread Eugene Kirpichov
Raising this onto the mailing list from
https://issues.apache.org/jira/browse/BEAM-849

The issue came up: what does it mean for a pipeline to finish, in the Beam
model?

Note that I am deliberately not talking about "batch" and "streaming"
pipelines, because this distinction does not exist in the model. Several
runners have batch/streaming *modes*, which implement the same semantics
(potentially different subsets: in batch mode typically a runner will
reject pipelines that have at least one unbounded PCollection) but in an
operationally different way. However we should define pipeline termination
at the level of the unified model, and then make sure that all runners in
all modes implement that properly.

One natural way is to say "a pipeline terminates when the output watermarks
of all of its PCollection's progress to +infinity". (Note: this can be
generalized, I guess, to having partial executions of a pipeline: if you're
interested in the full contents of only some collections, then you wait
until only the watermarks of those collections progress to infinity)

A typical "batch" runner mode does not implement watermarks - we can think
of it as assigning watermark -infinity to an output of a transform that
hasn't started executing yet, and +infinity to output of a transform that
has finished executing. This is consistent with how such runners implement
termination in practice.

Dataflow streaming runner additionally implements such termination for
pipeline drain operation: it has 2 parts: 1) stop consuming input from the
sources, and 2) wait until all watermarks progress to infinity.

Let us fill the gap by making this part of the Beam model and declaring
that all runners should implement this behavior. This will give nice
properties, e.g.:
- A pipeline that has only bounded collections can be run by any runner in
any mode, with the same results and termination behavior (this is actually
my motivating example for raising this issue is: I was running Splittable
DoFn tests

with the streaming Dataflow runner - these tests produce only bounded
collections - and noticed that they wouldn't terminate even though all data
was processed)
- It will be possible to implement pipelines that stream data for a while
and then eventually successfully terminate based on some condition. E.g. a
pipeline that watches a continuously growing file until it is marked
read-only, or a pipeline that reads a Kafka topic partition until it
receives a "poison pill" message. This seems handy.


Re: Beam File System in the Python SDK

2017-03-01 Thread Chamikara Jayalath
Great! Thanks Sourabh.

- Cham

On Wed, Mar 1, 2017 at 3:58 PM Robert Bradshaw 
wrote:

> Much needed! Added a couple of comments.
>
> On Wed, Mar 1, 2017 at 3:08 PM, Sourabh Bajaj <
> sourabhba...@google.com.invalid> wrote:
>
> > Hi,
> >
> > BEAM-1441  is a ticket
> > for
> > implementing the Beam File System in the Python SDK similar to the one
> > introduced in BEAM-59 . I
> > tried to take a pass on the implementation in #2136
> >  and followed the Java API as
> > closely as possible. Please feel free to give your comments here or on
> the
> > pull request directly.
> >
> > Reference: Original design doc
> >  > XJsVG3qel2lhdKTknmZ_7M/edit#>
> >
> >
> > Thanks
> > Sourabh
> >
>


Re: First stable release: version designation?

2017-03-01 Thread Ted Yu
The following explanation for adopting 2.0 version should be put in release
notes for the stable release.

Cheers

On Wed, Mar 1, 2017 at 2:03 PM, Dan Halperin 
wrote:

> A large set of Beam users will be coming from the pre-Apache technologies
> (aka Google Cloud Dataflow, Scio). Because Dataflow was 1.0 before Beam
> started, there is a lot of pre-existing documentation, Stack Overflow, etc.
> that refers to version 1.0 to mean what is now a year-and-a-half old
> release.
>
> I think starting Beam from "2.0.0" will be best for that set of users and
> frankly also new ones -- this will make it unambiguous whether referring to
> pre-Beam or Beam releases.
>
> I understand the 1.0 motivation -- it's cleaner in isolation -- but I think
> it would lead to long-term confusion in the user community.
>
> On Wed, Mar 1, 2017 at 1:11 PM, Ted Yu  wrote:
>
> > +1 to what Jesse and Amit said.
> >
> > On Wed, Mar 1, 2017 at 12:32 PM, Amit Sela  wrote:
> >
> > > I think 1.0.0 for a couple of reasons:
> > >
> > > * It makes sense coming after 0.X (+1 Jesse).
> > > * It is the FIRST stable release as a project, regardless of its roots.
> > > * while the SDK is definitely a 2.0.0, Beam is not made only of the
> SDK,
> > > and I hope we'll have more milage with users running all sorts of
> runners
> > > in production before our 2.0.0 release.
> > >
> > > Amit.
> > >
> > > On Wed, Mar 1, 2017 at 10:25 PM Jesse Anderson 
> > > wrote:
> > >
> > > I think 1.0 makes the most sense.
> > >
> > > On Wed, Mar 1, 2017, 10:57 AM Davor Bonaci  wrote:
> > >
> > > > The first stable release is our next major project-wide goal; see
> > > > discussion in [1]. I've been referring to it as "the first stable
> > > release"
> > > > for a long time, not "1.0.0" or "2.0.0" or "2017" or something else,
> to
> > > > make sure we have an unbiased discussion and a consensus-based
> decision
> > > on
> > > > this matter.
> > > >
> > > > I think that now is the time to consider the appropriate designation
> > for
> > > > our first stable release, and formally make a decision on it. A
> > > reasonable
> > > > choices could be "1.0.0" or "2.0.0", perhaps there are others.
> > > >
> > > > 1.0.0:
> > > > * It logically comes after the current series, 0.x.y.
> > > > * Most people would expect it, I suppose.
> > > > * A possible confusion between Dataflow SDKs and Beam SDKs carrying
> the
> > > > same number.
> > > >
> > > > 2.0.0:
> > > > * Follows the pattern some other projects have taken -- continuing
> > their
> > > > version numbering scheme from their previous origin.
> > > > * Better communicates project's roots, and degree of maturity.
> > > > * May be unexpected to some users.
> > > >
> > > > I'd invite everyone to share their thoughts and preferences -- names
> > are
> > > > important and well correlated with success. Thanks!
> > > >
> > > > Davor
> > > >
> > > > [1] https://lists.apache.org/thread.html/c35067071aec9029d9100ae
> > 973c629
> > > > 9aa919c31d0de623ac367128e2@%3Cdev.beam.apache.org%3E
> > > >
> > >
> >
>


Re: Beam File System in the Python SDK

2017-03-01 Thread Robert Bradshaw
Much needed! Added a couple of comments.

On Wed, Mar 1, 2017 at 3:08 PM, Sourabh Bajaj <
sourabhba...@google.com.invalid> wrote:

> Hi,
>
> BEAM-1441  is a ticket
> for
> implementing the Beam File System in the Python SDK similar to the one
> introduced in BEAM-59 . I
> tried to take a pass on the implementation in #2136
>  and followed the Java API as
> closely as possible. Please feel free to give your comments here or on the
> pull request directly.
>
> Reference: Original design doc
>  XJsVG3qel2lhdKTknmZ_7M/edit#>
>
>
> Thanks
> Sourabh
>


Beam File System in the Python SDK

2017-03-01 Thread Sourabh Bajaj
Hi,

BEAM-1441  is a ticket for
implementing the Beam File System in the Python SDK similar to the one
introduced in BEAM-59 . I
tried to take a pass on the implementation in #2136
 and followed the Java API as
closely as possible. Please feel free to give your comments here or on the
pull request directly.

Reference: Original design doc



Thanks
Sourabh


Re: First stable release: version designation?

2017-03-01 Thread Dan Halperin
A large set of Beam users will be coming from the pre-Apache technologies
(aka Google Cloud Dataflow, Scio). Because Dataflow was 1.0 before Beam
started, there is a lot of pre-existing documentation, Stack Overflow, etc.
that refers to version 1.0 to mean what is now a year-and-a-half old
release.

I think starting Beam from "2.0.0" will be best for that set of users and
frankly also new ones -- this will make it unambiguous whether referring to
pre-Beam or Beam releases.

I understand the 1.0 motivation -- it's cleaner in isolation -- but I think
it would lead to long-term confusion in the user community.

On Wed, Mar 1, 2017 at 1:11 PM, Ted Yu  wrote:

> +1 to what Jesse and Amit said.
>
> On Wed, Mar 1, 2017 at 12:32 PM, Amit Sela  wrote:
>
> > I think 1.0.0 for a couple of reasons:
> >
> > * It makes sense coming after 0.X (+1 Jesse).
> > * It is the FIRST stable release as a project, regardless of its roots.
> > * while the SDK is definitely a 2.0.0, Beam is not made only of the SDK,
> > and I hope we'll have more milage with users running all sorts of runners
> > in production before our 2.0.0 release.
> >
> > Amit.
> >
> > On Wed, Mar 1, 2017 at 10:25 PM Jesse Anderson 
> > wrote:
> >
> > I think 1.0 makes the most sense.
> >
> > On Wed, Mar 1, 2017, 10:57 AM Davor Bonaci  wrote:
> >
> > > The first stable release is our next major project-wide goal; see
> > > discussion in [1]. I've been referring to it as "the first stable
> > release"
> > > for a long time, not "1.0.0" or "2.0.0" or "2017" or something else, to
> > > make sure we have an unbiased discussion and a consensus-based decision
> > on
> > > this matter.
> > >
> > > I think that now is the time to consider the appropriate designation
> for
> > > our first stable release, and formally make a decision on it. A
> > reasonable
> > > choices could be "1.0.0" or "2.0.0", perhaps there are others.
> > >
> > > 1.0.0:
> > > * It logically comes after the current series, 0.x.y.
> > > * Most people would expect it, I suppose.
> > > * A possible confusion between Dataflow SDKs and Beam SDKs carrying the
> > > same number.
> > >
> > > 2.0.0:
> > > * Follows the pattern some other projects have taken -- continuing
> their
> > > version numbering scheme from their previous origin.
> > > * Better communicates project's roots, and degree of maturity.
> > > * May be unexpected to some users.
> > >
> > > I'd invite everyone to share their thoughts and preferences -- names
> are
> > > important and well correlated with success. Thanks!
> > >
> > > Davor
> > >
> > > [1] https://lists.apache.org/thread.html/c35067071aec9029d9100ae
> 973c629
> > > 9aa919c31d0de623ac367128e2@%3Cdev.beam.apache.org%3E
> > >
> >
>


Re: First stable release: version designation?

2017-03-01 Thread Ted Yu
+1 to what Jesse and Amit said.

On Wed, Mar 1, 2017 at 12:32 PM, Amit Sela  wrote:

> I think 1.0.0 for a couple of reasons:
>
> * It makes sense coming after 0.X (+1 Jesse).
> * It is the FIRST stable release as a project, regardless of its roots.
> * while the SDK is definitely a 2.0.0, Beam is not made only of the SDK,
> and I hope we'll have more milage with users running all sorts of runners
> in production before our 2.0.0 release.
>
> Amit.
>
> On Wed, Mar 1, 2017 at 10:25 PM Jesse Anderson 
> wrote:
>
> I think 1.0 makes the most sense.
>
> On Wed, Mar 1, 2017, 10:57 AM Davor Bonaci  wrote:
>
> > The first stable release is our next major project-wide goal; see
> > discussion in [1]. I've been referring to it as "the first stable
> release"
> > for a long time, not "1.0.0" or "2.0.0" or "2017" or something else, to
> > make sure we have an unbiased discussion and a consensus-based decision
> on
> > this matter.
> >
> > I think that now is the time to consider the appropriate designation for
> > our first stable release, and formally make a decision on it. A
> reasonable
> > choices could be "1.0.0" or "2.0.0", perhaps there are others.
> >
> > 1.0.0:
> > * It logically comes after the current series, 0.x.y.
> > * Most people would expect it, I suppose.
> > * A possible confusion between Dataflow SDKs and Beam SDKs carrying the
> > same number.
> >
> > 2.0.0:
> > * Follows the pattern some other projects have taken -- continuing their
> > version numbering scheme from their previous origin.
> > * Better communicates project's roots, and degree of maturity.
> > * May be unexpected to some users.
> >
> > I'd invite everyone to share their thoughts and preferences -- names are
> > important and well correlated with success. Thanks!
> >
> > Davor
> >
> > [1] https://lists.apache.org/thread.html/c35067071aec9029d9100ae973c629
> > 9aa919c31d0de623ac367128e2@%3Cdev.beam.apache.org%3E
> >
>


Apache Beam (virtual) contributor meeting @ Tue Mar 7, 2017

2017-03-01 Thread Davor Bonaci
Hi everyone,
Based on the high demand [1], let's try to organize a virtual contributor
meeting on Tuesday, March 7, 2017 at 15:00 UTC. For convenience, calendar
link [2] and an .ics file are attached.

I tried to accommodate as many time zones as possible, but I know it might
be hard for some of us at 7 AM on the US west coast or 11 PM in China.
Sorry about that.

Let's use Google Hangouts as the video conferencing technology. I think we
may be limited to something like 30 participants, so I'd encourage any
co-located contributors to consider joining together (if appropriate).
Joining the meeting should be straightforward -- please find the link
within. No special requirements that I'm aware of.

Just to re-state the expectations:
* This is totally optional and informal.
* It is simply a chance for everyone to meet others and see the faces of
people we share a common passion with.
* No specific agenda.
* An open discussion on any topic of interest to the contributor community
is
welcome -- please feel free to bring up any topics you care about.
* No formal discussion or decisions should to be made.
* We'll keep notes and share them on the mailing list shortly after the
meeting.

Thanks -- and hope to see all of you there!

Davor

[1]
https://lists.apache.org/thread.html/baf057b81c5f6d4127abadac165d923a224d34438fe67b71d73743ad@%3Cdev.beam.apache.org%3E
[2]
https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=a3A2MzdhaWdhdjByNWRibzZrN2ZnOG1kMTAgZGF2b3JAZ29vZ2xlLmNvbQ&tmsrc=davor%40google.com
BEGIN:VCALENDAR
PRODID:-//Google Inc//Google Calendar 70.9054//EN
VERSION:2.0
CALSCALE:GREGORIAN
METHOD:REQUEST
BEGIN:VEVENT
DTSTART:20170307T15Z
DTEND:20170307T16Z
DTSTAMP:20170301T203852Z
ORGANIZER;CN=Davor Bonaci:mailto:da...@google.com
UID:kp637aigav0r5dbo6k7fg8m...@google.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=OPT-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=dev@beam.apache.org;X-NUM-GUESTS=0:mailto:dev@beam.apache.org
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=Davor Bonaci;X-NUM-GUESTS=0:mailto:da...@google.com
CREATED:20170301T203851Z
DESCRIPTION:Hi everyone\,\nBased on the high demand [1]\, let's try to orga
 nize a virtual contributor meeting on Tuesday\, March 7\, 2017 at 15:00 UTC
 .\n\nI tried to accommodate as many time zones as possible\, but I know it 
 might be hard for some of us at 7 AM on the US west coast or 11 PM in China
 . Sorry about that.\n\nLet's use Google Hangouts as the video conferencing 
 technology. I think we may be limited to something like 30 participants\, s
 o I'd encourage any co-located contributors to consider joining together (i
 f appropriate). Joining the meeting should be straightforward -- please fin
 d the link within. No special requirements that I'm aware of.\n\nJust to re
 -state the expectations:\n* This is totally optional and informal.\n* It is
  simply a chance for everyone to meet others and see the faces of people we
  share a common passion with.\n* No specific agenda.\n* An open discussion 
 on any topic of interest to the contributor community is\nwelcome -- please
  feel free to bring up any topics you care about.\n* No formal discussion o
 r decisions should to be made.\n* We'll keep notes and share them on the ma
 iling list shortly after the meeting.\n\nIf you are planning to attend\, pl
 ease RSVP on this invitation.\n\nThanks -- and hope to see all of you there
 !\n\nDavor\n\n[1] https://lists.apache.org/thread.html/baf057b81c5f6d4127ab
 adac165d923a224d34438fe67b71d73743ad@%3Cdev.beam.apache.org%3E\n\nThis even
 t has a Google Hangouts video call.\nJoin: https://plus.google.com/hangouts
 /_/google.com/beam-dev-mtg?hceid=ZGF2b3JAZ29vZ2xlLmNvbQ.kp637aigav0r5dbo6k7
 fg8md10&hs=121\n\nView your event at https://www.google.com/calendar/event?
 action=VIEW&eid=a3A2MzdhaWdhdjByNWRibzZrN2ZnOG1kMTAgZGV2QGJlYW0uYXBhY2hlLm9
 yZw&tok=MTYjZGF2b3JAZ29vZ2xlLmNvbTljMWI0YTllOWVjNDZhNTExM2M0YTdjZGZkYmVmMTE
 4ODAxY2IwOGM&ctz=America/Los_Angeles&hl=en.
LAST-MODIFIED:20170301T203851Z
LOCATION:Google Hangouts\; just join the video call specified within
SEQUENCE:0
STATUS:CONFIRMED
SUMMARY:Apache Beam (virtual) contributor meeting
TRANSP:OPAQUE
END:VEVENT
END:VCALENDAR


Re: First stable release: version designation?

2017-03-01 Thread Amit Sela
I think 1.0.0 for a couple of reasons:

* It makes sense coming after 0.X (+1 Jesse).
* It is the FIRST stable release as a project, regardless of its roots.
* while the SDK is definitely a 2.0.0, Beam is not made only of the SDK,
and I hope we'll have more milage with users running all sorts of runners
in production before our 2.0.0 release.

Amit.

On Wed, Mar 1, 2017 at 10:25 PM Jesse Anderson 
wrote:

I think 1.0 makes the most sense.

On Wed, Mar 1, 2017, 10:57 AM Davor Bonaci  wrote:

> The first stable release is our next major project-wide goal; see
> discussion in [1]. I've been referring to it as "the first stable release"
> for a long time, not "1.0.0" or "2.0.0" or "2017" or something else, to
> make sure we have an unbiased discussion and a consensus-based decision on
> this matter.
>
> I think that now is the time to consider the appropriate designation for
> our first stable release, and formally make a decision on it. A reasonable
> choices could be "1.0.0" or "2.0.0", perhaps there are others.
>
> 1.0.0:
> * It logically comes after the current series, 0.x.y.
> * Most people would expect it, I suppose.
> * A possible confusion between Dataflow SDKs and Beam SDKs carrying the
> same number.
>
> 2.0.0:
> * Follows the pattern some other projects have taken -- continuing their
> version numbering scheme from their previous origin.
> * Better communicates project's roots, and degree of maturity.
> * May be unexpected to some users.
>
> I'd invite everyone to share their thoughts and preferences -- names are
> important and well correlated with success. Thanks!
>
> Davor
>
> [1] https://lists.apache.org/thread.html/c35067071aec9029d9100ae973c629
> 9aa919c31d0de623ac367128e2@%3Cdev.beam.apache.org%3E
>


Re: First stable release: version designation?

2017-03-01 Thread Jesse Anderson
I think 1.0 makes the most sense.

On Wed, Mar 1, 2017, 10:57 AM Davor Bonaci  wrote:

> The first stable release is our next major project-wide goal; see
> discussion in [1]. I've been referring to it as "the first stable release"
> for a long time, not "1.0.0" or "2.0.0" or "2017" or something else, to
> make sure we have an unbiased discussion and a consensus-based decision on
> this matter.
>
> I think that now is the time to consider the appropriate designation for
> our first stable release, and formally make a decision on it. A reasonable
> choices could be "1.0.0" or "2.0.0", perhaps there are others.
>
> 1.0.0:
> * It logically comes after the current series, 0.x.y.
> * Most people would expect it, I suppose.
> * A possible confusion between Dataflow SDKs and Beam SDKs carrying the
> same number.
>
> 2.0.0:
> * Follows the pattern some other projects have taken -- continuing their
> version numbering scheme from their previous origin.
> * Better communicates project's roots, and degree of maturity.
> * May be unexpected to some users.
>
> I'd invite everyone to share their thoughts and preferences -- names are
> important and well correlated with success. Thanks!
>
> Davor
>
> [1] https://lists.apache.org/thread.html/c35067071aec9029d9100ae973c629
> 9aa919c31d0de623ac367128e2@%3Cdev.beam.apache.org%3E
>


Re: Let's make Beam transforms comply with PTransform Style Guide

2017-03-01 Thread Eugene Kirpichov
Hey all,

First couple rounds of fixes are in. Thanks Aviem Zur for contributing
TextIO fixes and Dan Halperin for reviewing! One more fix by Reuven in
progress (https://github.com/apache/beam/pull/1927).

Follow https://issues.apache.org/jira/browse/BEAM-1353 and sub-issues for
the status.
Many of the changes are backwards-incompatible (though with only minor
changes in pipelines required). I'll make a separate announcement about
that on the users list now.

Call for help with the other sub-issues still stands! They are pretty
simple work items but pretty important since this is a blocker for
declaring stable BEAM API.

On Wed, Feb 8, 2017 at 3:51 AM Jean-Baptiste Onofré  wrote:

Thanks Eugene.

I will tackle some Jira when back next week.

Regards
JB

On Feb 7, 2017, 18:16, at 18:16, Eugene Kirpichov
 wrote:
>Hey all,
>
>I bit the bullet and audited all PTransform classes in Beam Java SDK
>and
>filed JIRA issues for all violations I could find.
>I linked all them to the master JIRA issue
>https://issues.apache.org/jira/browse/BEAM-1353
>
>In general, all of these should be fixed before declaring Beam stable
>API.
>Would appreciate if some senior folks looked at the issues and
>confirmed
>that my suggested changes make sense.
>
>PRs very welcome :) (though I'll be gone for the next few weeks so I
>can't
>review right now)
>Many of these are very easy to fix (a few lines of code); some require
>a
>little more code, but as far as I can tell all of them are mechanical.
>
>
>On Tue, Jan 31, 2017 at 4:10 PM Eugene Kirpichov 
>wrote:
>
>> On Mon, Jan 30, 2017 at 7:56 PM Dan Halperin
>
>> wrote:
>>
>> On Mon, Jan 30, 2017 at 5:42 PM, Eugene Kirpichov <
>> kirpic...@google.com.invalid> wrote:
>>
>> > Hello,
>> >
>> > The PTransform Style Guide is live
>> > https://beam.apache.org/contribute/ptransform-style-guide/ - a
>natural
>> > next
>> > step is to audit Beam libraries for compliance and file JIRAs for
>places
>> > that need to be fixed. It'd be great to finish these cleanups
>before
>> > declaring Beam stable API.
>> >
>> > Please take a look and file JIRAs / post suggestions on this
>thread!
>> >
>> > I think it'll also make a great source of easy and useful work for
>new
>> > contributors.
>> >
>> > Some things I remember off the top of my head:
>> > - TextIO, KafkaIO use coders improperly - coders should not be used
>as a
>> > general-purpose byte parsing mechanism.
>> >
>>
>> Can you say more about Kafka? Kafka actually exports byte[] by
>default,
>> whereas Text files are String by default. So it does not seem nearly
>as
>> egregious for Kafka as it is for Text.
>>
>> Agreed that KafkaIO is less egregious, but it still has methods
>> withKeyCoder and withValueCoder - these should be replaced with
>something
>> that doesn't take Coder.
>>
>>
>>
>> - HadoopFileSource is not packaged as a PTransform
>> > - Some connectors, e.g. KafkaIO, should use AutoValue for their
>parameter
>> > builders, but don't
>> >
>>
>> Isn't AutoValue entirely an internal implementation detail that is
>not
>> exposed(*) to users? I think this is irrelevant to a stable API.
>>
>> Agreed - doesn't block stable API, but still a good thing to do
>because it
>> makes the code cleaner (for KafkaIO there's a long-standing PR that
>was
>> blocked on ratifying the style guide
>> https://github.com/apache/beam/pull/1048)
>>
>>
>>
>> (*) except that it makes transforms not able to be final, which is a
>> regression.
>>
>> I think AutoValue use should generally be considered *very* optional.
>In
>> transforms I author, I prefer not to use AutoValue because it makes
>the
>> code more complex and less readable.
>>
>> Yeah, guidance on when to use / not use AutoValue could be improved.
>I
>> think it makes a lot of sense when the transform has more than one or
>two
>> parameters or when the set of parameters can grow.
>>
>>
>>
>>
>> > - A few connectors improperly use
>> > - Some transforms expose their transform type as "Something.Bound"
>and
>> > "Something.Unbound", e.g. TextIO.Read.Bound - such names are banned
>> >
>>
>> "banned" is a strong word to use here. All of these are just
>> recommendations.
>>
>> In general yes; the goal of the style guide is to be the default,
>where if
>> you deviate from it, you should have a good reason. I don't think
>there
>> ever exists a good reason to name a transform Something.Bound/Unbound
>> though.
>>
>>
>>
>>
>> >
>> > I filed an umbrella JIRA
>https://issues.apache.org/jira/browse/BEAM-1353
>> > about
>> > making existing Beam transforms comply with the guide - let's
>crowdsource
>> > this!
>> >
>> > Thanks.
>> >
>>
>>


Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-03-01 Thread Stephen Sisk
I wanted to follow up on this thread since I see some potential blocking
questions arising, and I'm trying to help dipti along with her PR.

Dipti's PR[1] is currently written to put files into:
io/hadoop/inputformat

The recent changes to create hadoop-common created:
io/hadoop-common

This means that the overall structure if we take the HIFIO PR as-is would
be:
io/hadoop/inputformat - the HIFIO (copies of some code in hadoop-common and
hdfs, but no dependency on hadoop-common)
io/hadoop-common - module with some shared code
io/hbase - hbase IO transforms
io/hdfs - FileInputFormat IO transforms - much shared code with
hadoop/inputformat.

Which I don't think is great b/c there's a common dir, but only some
directories use it, and there's lots of similar-but-slightly different code
in hadoop/inputformat and hdfsio. I don't believe anyone intends this to be
the final result.

After looking at the comments in this thread, I'd like to recommend the
following end-result:  (#1)
io/hadoop -  the HIFIO (dependency on  hadoop-common) - contains both
HadoopInputFormatIO.java and HDFSFileSink/HDFSFileSource (so contents of
hdfs and hadoop/inputformat)
io/hadoop-common - module with some shared code
io/hbase - hbase IO transforms

To get there I propose the following steps:
1. finish current PR [1] with only renaming the containing module from
hadoop/inputformat -> hadoop, and taking dependency on hadoop-common
2. someone does cleanup to reconcile hdfs and hadoop directories, including
renaming the files so they make sense

I would also be fine with: (#2)
io/hadoop - container dir only
io/hadoop/common
io/hadoop/hbase
io/hadoop/inputformat

I think the downside of #2 is that it hides hbase, which I think deserves
to be top level.

Other comments:
It should be noted that when we have all modules use hadoop-common, we'll
be forcing all hadoop modules to have the same dependencies on hadoop - I
think this makes sense, but worth noting that as the one advantage of the
"every hadoop IO transform has its own hadoop dependency"

On the naming discussion: I personally prefer "inputformat" as the name of
the directory, but I defer to the folks who know the hadoop community more.

S

[1] HadoopInputFormatIO PR - https://github.com/apache/beam/pull/1994
[2] HdfsIO dependency change PR - https://github.com/apache/beam/pull/2087


On Fri, Feb 17, 2017 at 9:38 AM, Dipti Kulkarni <
dipti_dkulka...@persistent.com> wrote:

> Thank you  all for your inputs!
>
>
> -Original Message-
> From: Dan Halperin [mailto:dhalp...@google.com.INVALID]
> Sent: Friday, February 17, 2017 12:17 PM
> To: dev@beam.apache.org
> Subject: Re: Merge HadoopInputFormatIO and HDFSIO in a single module
>
> Raghu, Amit -- +1 to your expertise :)
>
> On Thu, Feb 16, 2017 at 3:39 PM, Amit Sela  wrote:
>
> > I agree with Dan on everything regarding HdfsFileSystem - it's super
> > convenient for users to use TextIO with HdfsFileSystem rather then
> > replacing the IO and also specifying the InputFormat type.
> >
> > I disagree on "HadoopIO" - I think that people who work with Hadoop
> > would find this name intuitive, and that's whats important.
> > Even more, and joining Raghu's comment, it is also recognized as
> > "compatible with Hadoop", so for example someone running a Beam
> > pipeline using the Spark runner on Amazon's S3 and wants to read/write
> > Hadoop sequence files would simply use HadoopIO and provide the
> > appropriate runtime dependencies (actually true for GS as well).
> >
> > On Thu, Feb 16, 2017 at 9:08 PM Raghu Angadi
> > 
> > wrote:
> >
> > > FileInputFormat is extremely widely used, pretty much all the file
> > > based input formats extend it. All of them call into to list the
> > > input files, split (with some tweaks on top of that). The special
> > > API ( *FileInputFormat.setMinInputSplitSize(job,
> > > desiredBundleSizeBytes)* ) is how the split size is normally
> > communicated.
> > > New IO can use the api directly.
> > >
> > > HdfsIO as implemented in Beam is not HDFS specific at all. There are
> > > no hdfs imports and HDFS name does not appear anywhere other than in
> > HdfsIO's
> > > own class and method names. AvroHdfsFileSource etc would work just
> > > as
> > well
> > > with new IO.
> > >
> > > On Thu, Feb 16, 2017 at 8:17 AM, Dan Halperin
> >  > > >
> > > wrote:
> > >
> > > > (And I think renaming to HadoopIO doesn't make sense.
> > > > "InputFormat" is
> > > the
> > > > key component of the name -- it reads things that implement the
> > > InputFormat
> > > > interface. "Hadoop" means a lot more than that.)
> > > >
> > >
> > > Often 'IO' in Beam implies both sources and sinks. It might not be
> > > long before we might be supporting Hadoop OutputFormat as well. In
> > > addition HadoopInputFormatIO is quite a mouthful. Agreed, Hadoop can
> > > mean a lot of things depending on the context. In 'IO' context it
> > > might not be too
> > broad.
> > > Normally it implies 'any FileSystem supported in Hadoop, e.g. S3'.
> > >

First stable release: version designation?

2017-03-01 Thread Davor Bonaci
The first stable release is our next major project-wide goal; see
discussion in [1]. I've been referring to it as "the first stable release"
for a long time, not "1.0.0" or "2.0.0" or "2017" or something else, to
make sure we have an unbiased discussion and a consensus-based decision on
this matter.

I think that now is the time to consider the appropriate designation for
our first stable release, and formally make a decision on it. A reasonable
choices could be "1.0.0" or "2.0.0", perhaps there are others.

1.0.0:
* It logically comes after the current series, 0.x.y.
* Most people would expect it, I suppose.
* A possible confusion between Dataflow SDKs and Beam SDKs carrying the
same number.

2.0.0:
* Follows the pattern some other projects have taken -- continuing their
version numbering scheme from their previous origin.
* Better communicates project's roots, and degree of maturity.
* May be unexpected to some users.

I'd invite everyone to share their thoughts and preferences -- names are
important and well correlated with success. Thanks!

Davor

[1] https://lists.apache.org/thread.html/c35067071aec9029d9100ae973c629
9aa919c31d0de623ac367128e2@%3Cdev.beam.apache.org%3E


Re: Release 0.6.0

2017-03-01 Thread Ahmet Altay
Thank you. I will start working on it.

Ahmet

On Wed, Mar 1, 2017 at 9:03 AM, Aljoscha Krettek 
wrote:

> I just closed the last blocking issue, we should be good to go now.
>
> Sorry again for the hold-up.
>
> On Tue, 28 Feb 2017 at 18:38 Ahmet Altay  wrote:
>
> Thank you all. I will wait for release blocking issues to be closed.
>
> Sergio, thank you for the information. I will document the friction points
> during this release process. Following the release we can start a
> discussion about how to fix those.
>
> Ahmet
>
> On Tue, Feb 28, 2017 at 9:22 AM, Aljoscha Krettek 
> wrote:
>
> > That was my mistake, sorry for that. I should have tagged [1] as a
> blocker
> > because leaking state is probably a bad idea. At least then people would
> be
> > aware and we could have discussed whether it is a blocker.
> >
> > There is already an open PR for this now.
> >
> > [1] https://issues.apache.org/jira/browse/BEAM-1517
> >
> > On Tue, 28 Feb 2017 at 18:21 Jean-Baptiste Onofré 
> wrote:
> >
> > > Regarding BEAM-649, it's not a release blocker, it's a good to have.
> > >
> > > As I'm pretty close to the end of the Pull Request (hopefully tonight
> or
> > > tomorrow), it's a "Good To Have".
> > >
> > > Regards
> > > JB
> > >
> > > On 02/28/2017 06:09 PM, Davor Bonaci wrote:
> > > > Can we please use JIRA to tag potentially release-blocking issues?
> > Anyone
> > > > can just add a 'Fix Versions' field of an open issue to the next
> > > scheduled
> > > > release -- and it becomes easily visible to everyone in the project.
> > > >
> > > > In general, I'm not a fan of blocking releases for new functionality.
> > > > Rushing new features and a lack of baking time usually translates to
> > > bugs.
> > > > However, I think this time it is totally justified -- on a separate
> > > thread
> > > > we plan for this to be the last release before the "first stable
> > > release";
> > > > and picking the new features now will provide additional coverage for
> > it.
> > > >
> > > > So, +1, but please tag in JIRA.
> > > >
> > > > On Tue, Feb 28, 2017 at 2:09 AM, Aljoscha Krettek <
> aljos...@apache.org
> > >
> > > > wrote:
> > > >
> > > >> I would like to finish these two:
> > > >> https://issues.apache.org/jira/browse/BEAM-1036: Support for new
> > State
> > > API
> > > >> in FlinkRunner
> > > >> https://issues.apache.org/jira/browse/BEAM-1116: Support for new
> > Timer
> > > API
> > > >> in Flink runner
> > > >>
> > > >> Both of them are finished for the streaming runner, for the batch
> > runner
> > > >> I'm merging the code for the first right now and the second will not
> > > take
> > > >> long.
> > > >>
> > > >> There is also this: https://issues.apache.org/jira/browse/BEAM-1517
> :
> > > User
> > > >> state in the Flink Streaming Runner is not garbage collected. It's
> > not a
> > > >> regression from 0.5.0 where we simply didn't have this feature but
> I'm
> > > >> still somewhat uneasy about this.
> > > >>
> > > >>
> > > >> On Tue, 28 Feb 2017 at 09:44 Jean-Baptiste Onofré 
> > > wrote:
> > > >>
> > > >>> Fair enough.
> > > >>>
> > > >>> I also try to merge https://github.com/apache/beam/pull/1739 asap.
> > > >>>
> > > >>> Regards
> > > >>> JB
> > > >>>
> > > >>> On 02/28/2017 09:34 AM, Amit Sela wrote:
> > >  I'd prefer we wait to merge https://github.com/apache/
> > beam/pull/2050
> > >  Shouldn't take long now..
> > > 
> > >  On Tue, Feb 28, 2017 at 10:00 AM Sergio Fernández <
> > wik...@apache.org>
> > > >>> wrote:
> > > 
> > > > Sounds good!
> > > >
> > > > Ahmet, notice ASF has not current infrastructure to stage Python
> > > >> Release
> > > > Candidates. Anyway we left unmanaged the Maven deploy lifecycle
> for
> > > >> the
> > > > Python SDK, but it should be discussed at some point.
> > > >
> > > >
> > > >
> > > > On Mon, Feb 27, 2017 at 11:01 PM, Ahmet Altay
> > > >>  > > 
> > > > wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> It's been about a month since the last release. I would like
> > propose
> > > >> starting the next release. There are no releasing blocking bugs
> in
> > > >> JIRA
> > > >> [1]. Are there any release blocking issues I am missing?
> > > >>
> > > >> Unless there is an objection I will volunteer to manage this
> > > release.
> > > > This
> > > >> will be the first release with Python content. In case there are
> > > >> issues
> > > >> with that it might be easier for me to resolve and document
> those
> > as
> > > >>> part
> > > >> of the release process.
> > > >>
> > > >> Thank you,
> > > >> Ahmet
> > > >>
> > > >> [1]
> > > >> https://issues.apache.org/jira/issues/?jql=project%20%
> > > >> 3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%
> > > >> 20fixVersion%20%3D%200.6.0%20ORDER%20BY%20due%20ASC%2C%
> > > >> 20priority%20DESC%2C%20created%20ASC
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Sergio Fer

Re: Enforcer Rule- JDK1.7 for Beam

2017-03-01 Thread Stephen Sisk
I would rather not turn off "all compiler warnings are errors" in a module
unless we absolutely have to, since I think compiler warnings give us
useful information.

I wanted to talk about a structure that might work for what Dan is
suggesting and anticipate some questions that Dipti might have for the beam
community.

Assumptions I'm working under: (I haven't worked with enforcer before, so
please correct me if I'm wrong)
*  java enforcer runs on all the code currently being used (ie,
compile/test-compile goals would produce different sets of classes to be
examined,  and if the phrase being requested by the user is "verify" then
it'll look at both) - the enforcer plugin's
* java enforcer can be turned off on a per module level (by changing the
settings for the enforcer in that module's pom )
* java enforcer can be turned on or off for a given module via a profile
(by putting the settings change for enforcer in the profile in that module)

1. One naive approach would be to have the enforcer off by default and only
enable the enforcer rule in the HIFIO module when a particular profile
(jdk-version-enforcer?) is enabled. We would then only run that profile
when specifying only the compile goal on the command line. This would imply
that the  community would need to add to the (travis or jenkins?) test runs
a configuration that runs that separate profile. Are we okay with adding
another test run like that/is that a pattern we like others to follow? This
would mean that if there's an enforcer break in the HIFIO module, we likely
wouldn't catch it until those test runs, since it's unlikely most
developers will run the tests on their own.

2. Move HIFIO's cassandra tests into a separate child module of the HIFIO
which has enforcer set to JDK1.8

3. Configure enforcer in the HIFIO  module to just use JDK1.8

My assumption is that if #2 (module based) would work, that the beam
community would prefer it over #1 (profile based) since it's fewer test
runs and less surprising, and prefer it over #3 (HIFIO is only 1.8) since
we don't want HIFIO to be 1.7 only (it wouldn't be restricted and should
work fine, but could break without us knowing.)

It seems to me like we it would be preferable to change enforcer's rules in
the cassandra module instead of disabling all compiler warnings (again,
this is on the assumption that that is possible.)

S

On Mon, Feb 27, 2017 at 9:18 PM, Dipti Kulkarni <
dipti_dkulka...@persistent.com> wrote:

> Sure , I am going to investigate this. Will update the thread shortly.
>
> -Original Message-
> From: Dan Halperin [mailto:dhalp...@google.com.INVALID]
> Sent: Monday, February 27, 2017 11:32 PM
> To: dev@beam.apache.org
> Subject: Re: Enforcer Rule- JDK1.7 for Beam
>
> I think there are a few separable questions:
>
> 1. Can the module itself be used with Java7 only, and can we enforce this?
> 2. When testing, can we use Java 8 dependencies and how?
>
> I think that the automated enforcements are useful to ensure property #1,
> assuming we want the module to work in Java7. Not enforcing it means we're
> at risk of losing it. So I'm not a fan of simply disabling these checks.
>
> However, it should be possible to use java8 code as well. Maybe it goes in
> a profile? Maybe it goes into a separate module, maybe just a separate
> profile?
>
> Can you investigate?
>
> On Mon, Feb 27, 2017 at 9:23 AM, Jean-Baptiste Onofré 
> wrote:
>
> > Hi Dipti,
> >
> > I had the same "issue" on CassandraIO.
> >
> > If you take a look on:
> >
> > https://github.com/apache/beam/pull/592/files
> >
> > you will see that I use  to
> > remove the default -Werror causing issue with Java8 dependencies.
> >
> > Regards
> > JB
> >
> >
> > On 02/27/2017 06:16 PM, Dipti Kulkarni wrote:
> >
> >> Hi all,
> >>
> >> While working on HadoopInputFormatIO, when I rebased my code to fetch
> >> the latest from Beam repo, I see an Enforcer rule added to the parent
> >> POM to ensure that all dependencies used are compliant with JDK 1.7 I
> >> have written tests for Cassandra and Elasticsearch to test if HIFIO
> >> works ok to read data from these sources using their respective
> >> inputformat classes, for these tests, I am using the latest version
> >> of ES, and Cassandra which need dependencies working on JDK 1.8.
> >> These dependencies are marked as test scope, .and my builds continue
> >> to fail due to the Banned Dependency error.  Is it mandatory for the
> >> tests also to comply with the enforcer rule of JDK 1.7? Or is there
> >> any way we can override the JDK compliance for Tests? Thoughts?
> >>
> >> -Dipti
> >>
> >>
> >> DISCLAIMER
> >> ==
> >> This e-mail may contain privileged and confidential information which
> >> is the property of Persistent Systems Ltd. It is intended only for
> >> the use of the individual or entity to which it is addressed. If you
> >> are not the intended recipient, you are not authorized to read,
> >> retain, copy, print, distribute or use this message. If you have
> >> receive

Re: Next major milestone: first stable release

2017-03-01 Thread Ismaël Mejía
Just added the two I mentioned in my previous message.Thanks Davor.

On Wed, Mar 1, 2017 at 6:27 PM, Aljoscha Krettek 
wrote:

> On it!
>
> On Wed, 1 Mar 2017 at 18:17 Davor Bonaci  wrote:
>
> > We've now moved the discussion into the content of the first stable
> > release.
> >
> > I've created a version in JIRA called "First stable release". I'd like to
> > invite everyone to triage JIRA issues you care about, and assign "Fix
> > Versions" field to "First stable release" to mark the issue blocking for
> > the first stable release. This creates a project-wide burndown list and
> we
> > can track our progress towards the goal.
> >
> > I'll try make a pass over as many JIRA issues as possible over the next
> day
> > or two, but it would be great if everyone, particularly component leads
> in
> > JIRA, take a pass too!
> >
> > On Wed, Mar 1, 2017 at 2:51 AM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > Yes, fully agree.
> > >
> > > As far as I understood/know, BEAM-59 is targeted for Beam 1.0 (it's
> what
> > > we discussed with Pei and Davor).
> > >
> > > Regards
> > > JB
> > >
> > >
> > > On 03/01/2017 11:39 AM, Ismaël Mejía wrote:
> > >
> > >> Also joining a bit late, I agree with Amit, HDFS improvements are a
> > really
> > >> good thing to have before the stable release. I will also add the
> > >> IOChannelFactory refactorings to support things like
> > Read.from(“hdfs://”)
> > >> aka BEAM-59.
> > >>
> > >> In the worse case particular IOs can still be marked as experimental
> to
> > >> show users that they can still evolve, even after the first ‘stable’
> > >> release, the part that we have to pay more attention is not to break
> the
> > >> core SDK. And the question about Data Locality (BEAM-673) is where I
> am
> > >> afraid that we can have some breaking changes because there is not a
> way
> > >> from the IOs (Source/Sink) to send ‘a hint’ to the runner about Data
> > >> Locality (please correct me if I am wrong). And this even if not
> > supported
> > >> in the first stable release by any runner, would be a really great
> thing
> > >> to
> > >> have and I think this is a good moment to do it, to avoid breaking any
> > >> IO/runner signature because of new methods.
> > >>
> > >> What do the others think ?
> > >> Ismaël
> > >>
> > >>
> > >>
> > >> On Tue, Feb 28, 2017 at 6:29 PM, Amit Sela 
> > wrote:
> > >>
> > >> Joining in just a bit late, I'll be quick and say that IMHO the SDK is
> > >>> mature enough and so my only point to add is *HDFS support*.
> > >>> I think that in terms of adoption we have to support HDFS as a
> > >>> "first-class
> > >>> citizen" via the FileSystem API, and provide data locality (batch) on
> > top
> > >>> of it - it serves not only HDFS, but other eco-system IOs such as
> > HBase.
> > >>> From my experience with talking to people and companies, most are
> > running
> > >>> batch in production with some streaming POC or even production use,
> but
> > >>> batch still takes most of production work. If we give them the same
> > >>> production results, with the Beam API, we can on-board them faster
> and
> > >>> make
> > >>> it easier for them to adopt streaming as well.
> > >>>
> > >>> Thanks,
> > >>> Amit
> > >>>
> > >>> On Tue, Feb 28, 2017 at 7:12 PM Davor Bonaci 
> wrote:
> > >>>
> > >>> Alright -- sounds like we have a consensus to proceed with the first
> > 
> > >>> stable
> > >>>
> >  release after 0.6.0, targeting end of March / early April. I'll kick
> > off
> >  separate threads for specific decisions we need to make.
> > 
> >  On Thu, Feb 23, 2017 at 6:07 AM, Aljoscha Krettek <
> > aljos...@apache.org>
> >  wrote:
> > 
> >  I think we're ready for this! The public APIs are in very good
> shape,
> > > especially now that we have the new DoFn, user facing state and
> > timers
> > >
> >  and
> > 
> > > splittable DoFn. Not all Runners support the more advanced features
> > but
> > >
> >  we
> > 
> > > can work on this after a stable release and there are enough
> runners
> > >
> >  that
> > >>>
> >  support a large part of the features.
> > >
> > > Best,
> > > Aljoscha
> > >
> > > On Thu, 23 Feb 2017 at 06:15 Kenneth Knowles
>  > >
> > > wrote:
> > >
> > > On Wed, Feb 22, 2017 at 5:35 PM, Chamikara Jayalath <
> > >>
> > > chamik...@apache.org>
> > >
> > >> wrote:
> > >>
> > >>>
> > >>> I think, this point applies to Python SDK as well (though as you
> > >>>
> > >> mentioned,
> > >>
> > >>> API hiding in Python is a mere convention (prefix with
> underscore)
> > >>>
> > >> not
> > 
> > > enforced. We already have mechanism for marking APIs as deprecated
> > >>>
> > >> which
> > >
> > >> might be useful here:
> > >>> https://github.com/apache/beam/blob/master/sdks/python/
> > >>> apache_beam/utils/annotations.py
> > >>>
> > >>> - Cham
> > >>>
> > >>>
> > >> 

Re: Next major milestone: first stable release

2017-03-01 Thread Aljoscha Krettek
On it!

On Wed, 1 Mar 2017 at 18:17 Davor Bonaci  wrote:

> We've now moved the discussion into the content of the first stable
> release.
>
> I've created a version in JIRA called "First stable release". I'd like to
> invite everyone to triage JIRA issues you care about, and assign "Fix
> Versions" field to "First stable release" to mark the issue blocking for
> the first stable release. This creates a project-wide burndown list and we
> can track our progress towards the goal.
>
> I'll try make a pass over as many JIRA issues as possible over the next day
> or two, but it would be great if everyone, particularly component leads in
> JIRA, take a pass too!
>
> On Wed, Mar 1, 2017 at 2:51 AM, Jean-Baptiste Onofré 
> wrote:
>
> > Yes, fully agree.
> >
> > As far as I understood/know, BEAM-59 is targeted for Beam 1.0 (it's what
> > we discussed with Pei and Davor).
> >
> > Regards
> > JB
> >
> >
> > On 03/01/2017 11:39 AM, Ismaël Mejía wrote:
> >
> >> Also joining a bit late, I agree with Amit, HDFS improvements are a
> really
> >> good thing to have before the stable release. I will also add the
> >> IOChannelFactory refactorings to support things like
> Read.from(“hdfs://”)
> >> aka BEAM-59.
> >>
> >> In the worse case particular IOs can still be marked as experimental to
> >> show users that they can still evolve, even after the first ‘stable’
> >> release, the part that we have to pay more attention is not to break the
> >> core SDK. And the question about Data Locality (BEAM-673) is where I am
> >> afraid that we can have some breaking changes because there is not a way
> >> from the IOs (Source/Sink) to send ‘a hint’ to the runner about Data
> >> Locality (please correct me if I am wrong). And this even if not
> supported
> >> in the first stable release by any runner, would be a really great thing
> >> to
> >> have and I think this is a good moment to do it, to avoid breaking any
> >> IO/runner signature because of new methods.
> >>
> >> What do the others think ?
> >> Ismaël
> >>
> >>
> >>
> >> On Tue, Feb 28, 2017 at 6:29 PM, Amit Sela 
> wrote:
> >>
> >> Joining in just a bit late, I'll be quick and say that IMHO the SDK is
> >>> mature enough and so my only point to add is *HDFS support*.
> >>> I think that in terms of adoption we have to support HDFS as a
> >>> "first-class
> >>> citizen" via the FileSystem API, and provide data locality (batch) on
> top
> >>> of it - it serves not only HDFS, but other eco-system IOs such as
> HBase.
> >>> From my experience with talking to people and companies, most are
> running
> >>> batch in production with some streaming POC or even production use, but
> >>> batch still takes most of production work. If we give them the same
> >>> production results, with the Beam API, we can on-board them faster and
> >>> make
> >>> it easier for them to adopt streaming as well.
> >>>
> >>> Thanks,
> >>> Amit
> >>>
> >>> On Tue, Feb 28, 2017 at 7:12 PM Davor Bonaci  wrote:
> >>>
> >>> Alright -- sounds like we have a consensus to proceed with the first
> 
> >>> stable
> >>>
>  release after 0.6.0, targeting end of March / early April. I'll kick
> off
>  separate threads for specific decisions we need to make.
> 
>  On Thu, Feb 23, 2017 at 6:07 AM, Aljoscha Krettek <
> aljos...@apache.org>
>  wrote:
> 
>  I think we're ready for this! The public APIs are in very good shape,
> > especially now that we have the new DoFn, user facing state and
> timers
> >
>  and
> 
> > splittable DoFn. Not all Runners support the more advanced features
> but
> >
>  we
> 
> > can work on this after a stable release and there are enough runners
> >
>  that
> >>>
>  support a large part of the features.
> >
> > Best,
> > Aljoscha
> >
> > On Thu, 23 Feb 2017 at 06:15 Kenneth Knowles  >
> > wrote:
> >
> > On Wed, Feb 22, 2017 at 5:35 PM, Chamikara Jayalath <
> >>
> > chamik...@apache.org>
> >
> >> wrote:
> >>
> >>>
> >>> I think, this point applies to Python SDK as well (though as you
> >>>
> >> mentioned,
> >>
> >>> API hiding in Python is a mere convention (prefix with underscore)
> >>>
> >> not
> 
> > enforced. We already have mechanism for marking APIs as deprecated
> >>>
> >> which
> >
> >> might be useful here:
> >>> https://github.com/apache/beam/blob/master/sdks/python/
> >>> apache_beam/utils/annotations.py
> >>>
> >>> - Cham
> >>>
> >>>
> >> Perhaps an explicit @public annotation would fit. I could imagine
> >>
> > easily
> 
> > generating a spec to check against from such annotations, though
> >>
> > tooling
> 
> > is
> >
> >> secondary to documentation.
> >>
> >> Kenn
> >>
> >>
> >
> 
> >>>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.co

Re: Next major milestone: first stable release

2017-03-01 Thread Davor Bonaci
We've now moved the discussion into the content of the first stable release.

I've created a version in JIRA called "First stable release". I'd like to
invite everyone to triage JIRA issues you care about, and assign "Fix
Versions" field to "First stable release" to mark the issue blocking for
the first stable release. This creates a project-wide burndown list and we
can track our progress towards the goal.

I'll try make a pass over as many JIRA issues as possible over the next day
or two, but it would be great if everyone, particularly component leads in
JIRA, take a pass too!

On Wed, Mar 1, 2017 at 2:51 AM, Jean-Baptiste Onofré 
wrote:

> Yes, fully agree.
>
> As far as I understood/know, BEAM-59 is targeted for Beam 1.0 (it's what
> we discussed with Pei and Davor).
>
> Regards
> JB
>
>
> On 03/01/2017 11:39 AM, Ismaël Mejía wrote:
>
>> Also joining a bit late, I agree with Amit, HDFS improvements are a really
>> good thing to have before the stable release. I will also add the
>> IOChannelFactory refactorings to support things like Read.from(“hdfs://”)
>> aka BEAM-59.
>>
>> In the worse case particular IOs can still be marked as experimental to
>> show users that they can still evolve, even after the first ‘stable’
>> release, the part that we have to pay more attention is not to break the
>> core SDK. And the question about Data Locality (BEAM-673) is where I am
>> afraid that we can have some breaking changes because there is not a way
>> from the IOs (Source/Sink) to send ‘a hint’ to the runner about Data
>> Locality (please correct me if I am wrong). And this even if not supported
>> in the first stable release by any runner, would be a really great thing
>> to
>> have and I think this is a good moment to do it, to avoid breaking any
>> IO/runner signature because of new methods.
>>
>> What do the others think ?
>> Ismaël
>>
>>
>>
>> On Tue, Feb 28, 2017 at 6:29 PM, Amit Sela  wrote:
>>
>> Joining in just a bit late, I'll be quick and say that IMHO the SDK is
>>> mature enough and so my only point to add is *HDFS support*.
>>> I think that in terms of adoption we have to support HDFS as a
>>> "first-class
>>> citizen" via the FileSystem API, and provide data locality (batch) on top
>>> of it - it serves not only HDFS, but other eco-system IOs such as HBase.
>>> From my experience with talking to people and companies, most are running
>>> batch in production with some streaming POC or even production use, but
>>> batch still takes most of production work. If we give them the same
>>> production results, with the Beam API, we can on-board them faster and
>>> make
>>> it easier for them to adopt streaming as well.
>>>
>>> Thanks,
>>> Amit
>>>
>>> On Tue, Feb 28, 2017 at 7:12 PM Davor Bonaci  wrote:
>>>
>>> Alright -- sounds like we have a consensus to proceed with the first

>>> stable
>>>
 release after 0.6.0, targeting end of March / early April. I'll kick off
 separate threads for specific decisions we need to make.

 On Thu, Feb 23, 2017 at 6:07 AM, Aljoscha Krettek 
 wrote:

 I think we're ready for this! The public APIs are in very good shape,
> especially now that we have the new DoFn, user facing state and timers
>
 and

> splittable DoFn. Not all Runners support the more advanced features but
>
 we

> can work on this after a stable release and there are enough runners
>
 that
>>>
 support a large part of the features.
>
> Best,
> Aljoscha
>
> On Thu, 23 Feb 2017 at 06:15 Kenneth Knowles 
> wrote:
>
> On Wed, Feb 22, 2017 at 5:35 PM, Chamikara Jayalath <
>>
> chamik...@apache.org>
>
>> wrote:
>>
>>>
>>> I think, this point applies to Python SDK as well (though as you
>>>
>> mentioned,
>>
>>> API hiding in Python is a mere convention (prefix with underscore)
>>>
>> not

> enforced. We already have mechanism for marking APIs as deprecated
>>>
>> which
>
>> might be useful here:
>>> https://github.com/apache/beam/blob/master/sdks/python/
>>> apache_beam/utils/annotations.py
>>>
>>> - Cham
>>>
>>>
>> Perhaps an explicit @public annotation would fit. I could imagine
>>
> easily

> generating a spec to check against from such annotations, though
>>
> tooling

> is
>
>> secondary to documentation.
>>
>> Kenn
>>
>>
>

>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Travis retest-this-please magic

2017-03-01 Thread Davor Bonaci
Use your best judgement. Travis right now provides multi-JDK,
multi-platform coverage not available in Jenkins. If the change is not
sensitive to that, it is probably reasonable to proceed.

On Wed, Mar 1, 2017 at 9:01 AM, Amit Sela  wrote:

> +1
> Can we merge PRs without waiting for Travis as long as it's not working ?
>
> On Wed, Mar 1, 2017 at 6:52 PM Davor Bonaci  wrote:
>
> > It cannot be done at this time.
> >
> > We should really move all Travis coverage into Jenkins and completely
> > deprecate Travis. I know Jason is looking into that ;-)
> >
> > On Wed, Mar 1, 2017 at 3:51 AM, Amit Sela  wrote:
> >
> > > Hi all,
> > >
> > > Recently I've encountered PRs where everything was green in Jenkins but
> > > Travis was stuck and didn't execute.
> > > I couldn't (as the committer/reviewer) to do the same "retest this
> > please"
> > > magic we apply to Jenkins, and I don't know of the possibility to do
> this
> > > in Travis.
> > > I know that on "my" Travis I can "Restart Build"  but I'm not sure
> > > contributors can do so on their, and I couldn't (on someone else's PR).
> > >
> > > Anyone knows how we can make this easier ?
> > >
> > > Appreciate the help.
> > >
> > > Amit
> > >
> >
>


Re: Release 0.6.0

2017-03-01 Thread Aljoscha Krettek
I just closed the last blocking issue, we should be good to go now.

Sorry again for the hold-up.

On Tue, 28 Feb 2017 at 18:38 Ahmet Altay  wrote:

Thank you all. I will wait for release blocking issues to be closed.

Sergio, thank you for the information. I will document the friction points
during this release process. Following the release we can start a
discussion about how to fix those.

Ahmet

On Tue, Feb 28, 2017 at 9:22 AM, Aljoscha Krettek 
wrote:

> That was my mistake, sorry for that. I should have tagged [1] as a blocker
> because leaking state is probably a bad idea. At least then people would
be
> aware and we could have discussed whether it is a blocker.
>
> There is already an open PR for this now.
>
> [1] https://issues.apache.org/jira/browse/BEAM-1517
>
> On Tue, 28 Feb 2017 at 18:21 Jean-Baptiste Onofré  wrote:
>
> > Regarding BEAM-649, it's not a release blocker, it's a good to have.
> >
> > As I'm pretty close to the end of the Pull Request (hopefully tonight or
> > tomorrow), it's a "Good To Have".
> >
> > Regards
> > JB
> >
> > On 02/28/2017 06:09 PM, Davor Bonaci wrote:
> > > Can we please use JIRA to tag potentially release-blocking issues?
> Anyone
> > > can just add a 'Fix Versions' field of an open issue to the next
> > scheduled
> > > release -- and it becomes easily visible to everyone in the project.
> > >
> > > In general, I'm not a fan of blocking releases for new functionality.
> > > Rushing new features and a lack of baking time usually translates to
> > bugs.
> > > However, I think this time it is totally justified -- on a separate
> > thread
> > > we plan for this to be the last release before the "first stable
> > release";
> > > and picking the new features now will provide additional coverage for
> it.
> > >
> > > So, +1, but please tag in JIRA.
> > >
> > > On Tue, Feb 28, 2017 at 2:09 AM, Aljoscha Krettek  >
> > > wrote:
> > >
> > >> I would like to finish these two:
> > >> https://issues.apache.org/jira/browse/BEAM-1036: Support for new
> State
> > API
> > >> in FlinkRunner
> > >> https://issues.apache.org/jira/browse/BEAM-1116: Support for new
> Timer
> > API
> > >> in Flink runner
> > >>
> > >> Both of them are finished for the streaming runner, for the batch
> runner
> > >> I'm merging the code for the first right now and the second will not
> > take
> > >> long.
> > >>
> > >> There is also this: https://issues.apache.org/jira/browse/BEAM-1517:
> > User
> > >> state in the Flink Streaming Runner is not garbage collected. It's
> not a
> > >> regression from 0.5.0 where we simply didn't have this feature but
I'm
> > >> still somewhat uneasy about this.
> > >>
> > >>
> > >> On Tue, 28 Feb 2017 at 09:44 Jean-Baptiste Onofré 
> > wrote:
> > >>
> > >>> Fair enough.
> > >>>
> > >>> I also try to merge https://github.com/apache/beam/pull/1739 asap.
> > >>>
> > >>> Regards
> > >>> JB
> > >>>
> > >>> On 02/28/2017 09:34 AM, Amit Sela wrote:
> >  I'd prefer we wait to merge https://github.com/apache/
> beam/pull/2050
> >  Shouldn't take long now..
> > 
> >  On Tue, Feb 28, 2017 at 10:00 AM Sergio Fernández <
> wik...@apache.org>
> > >>> wrote:
> > 
> > > Sounds good!
> > >
> > > Ahmet, notice ASF has not current infrastructure to stage Python
> > >> Release
> > > Candidates. Anyway we left unmanaged the Maven deploy lifecycle
for
> > >> the
> > > Python SDK, but it should be discussed at some point.
> > >
> > >
> > >
> > > On Mon, Feb 27, 2017 at 11:01 PM, Ahmet Altay
> > >>  > 
> > > wrote:
> > >
> > >> Hi all,
> > >>
> > >> It's been about a month since the last release. I would like
> propose
> > >> starting the next release. There are no releasing blocking bugs
in
> > >> JIRA
> > >> [1]. Are there any release blocking issues I am missing?
> > >>
> > >> Unless there is an objection I will volunteer to manage this
> > release.
> > > This
> > >> will be the first release with Python content. In case there are
> > >> issues
> > >> with that it might be easier for me to resolve and document those
> as
> > >>> part
> > >> of the release process.
> > >>
> > >> Thank you,
> > >> Ahmet
> > >>
> > >> [1]
> > >> https://issues.apache.org/jira/issues/?jql=project%20%
> > >> 3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%
> > >> 20fixVersion%20%3D%200.6.0%20ORDER%20BY%20due%20ASC%2C%
> > >> 20priority%20DESC%2C%20created%20ASC
> > >>
> > >
> > >
> > >
> > > --
> > > Sergio Fernández
> > > Partner Technology Manager
> > > Redlink GmbH
> > > m: +43 6602747925 <+43%20660%202747925> <+43%20660%202747925>
<+43%20660%202747925>
> > <+43%20660%202747925>
> > > e: sergio.fernan...@redlink.co
> > > w: http://redlink.co
> > >
> > 
> > >>>
> > >>> --
> > >>> Jean-Baptiste Onofré
> > >>> jbono...@apache.org
> > >>> http://blog.nanthrax.net
> > >>> Talend - http://www.talend.com
> > >>>
> > 

Re: Travis retest-this-please magic

2017-03-01 Thread Amit Sela
+1
Can we merge PRs without waiting for Travis as long as it's not working ?

On Wed, Mar 1, 2017 at 6:52 PM Davor Bonaci  wrote:

> It cannot be done at this time.
>
> We should really move all Travis coverage into Jenkins and completely
> deprecate Travis. I know Jason is looking into that ;-)
>
> On Wed, Mar 1, 2017 at 3:51 AM, Amit Sela  wrote:
>
> > Hi all,
> >
> > Recently I've encountered PRs where everything was green in Jenkins but
> > Travis was stuck and didn't execute.
> > I couldn't (as the committer/reviewer) to do the same "retest this
> please"
> > magic we apply to Jenkins, and I don't know of the possibility to do this
> > in Travis.
> > I know that on "my" Travis I can "Restart Build"  but I'm not sure
> > contributors can do so on their, and I couldn't (on someone else's PR).
> >
> > Anyone knows how we can make this easier ?
> >
> > Appreciate the help.
> >
> > Amit
> >
>


Re: Travis retest-this-please magic

2017-03-01 Thread Davor Bonaci
It cannot be done at this time.

We should really move all Travis coverage into Jenkins and completely
deprecate Travis. I know Jason is looking into that ;-)

On Wed, Mar 1, 2017 at 3:51 AM, Amit Sela  wrote:

> Hi all,
>
> Recently I've encountered PRs where everything was green in Jenkins but
> Travis was stuck and didn't execute.
> I couldn't (as the committer/reviewer) to do the same "retest this please"
> magic we apply to Jenkins, and I don't know of the possibility to do this
> in Travis.
> I know that on "my" Travis I can "Restart Build"  but I'm not sure
> contributors can do so on their, and I couldn't (on someone else's PR).
>
> Anyone knows how we can make this easier ?
>
> Appreciate the help.
>
> Amit
>


Re: Performance Testing Next Steps

2017-03-01 Thread Aljoscha Krettek
Thanks for writing this and taking care of this, Jason!

I'm afraid I also cannot add anything except that I'm excited to see some
results from this.

On Wed, 1 Mar 2017 at 03:28 Kenneth Knowles  wrote:

Just got a chance to look this over. I don't have anything to add, but I'm
pretty excited to follow this project. Have the JIRAs been filed since you
shared the doc?

On Wed, Feb 22, 2017 at 10:38 AM, Jason Kuster <
jasonkus...@google.com.invalid> wrote:

> Hey all, just wanted to pop this up again for people -- if anyone has
> thoughts on performance testing please feel welcome to chime in. :)
>
> On Fri, Feb 17, 2017 at 4:03 PM, Jason Kuster 
> wrote:
>
> > Hi all,
> >
> > I've written up a doc on next steps for getting performance testing up
> and
> > running for Beam. I'd love to hear from people -- there's a fair amount
> of
> > work encapsulated in here, but the end result is that we have a
> performance
> > testing system which we can use for benchmarking all aspects of Beam,
> which
> > would be really exciting. Looking forward to your thoughts.
> >
> > https://docs.google.com/document/d/1PsjGPSN6FuorEEPrKEP3u3m16tyOz
> > ph5FnL2DhaRDz0/edit?ts=58a78e73
> >
> > Best,
> >
> > Jason
> >
> > --
> > ---
> > Jason Kuster
> > Apache Beam / Google Cloud Dataflow
> >
>
>
>
> --
> ---
> Jason Kuster
> Apache Beam / Google Cloud Dataflow
>


Travis retest-this-please magic

2017-03-01 Thread Amit Sela
Hi all,

Recently I've encountered PRs where everything was green in Jenkins but
Travis was stuck and didn't execute.
I couldn't (as the committer/reviewer) to do the same "retest this please"
magic we apply to Jenkins, and I don't know of the possibility to do this
in Travis.
I know that on "my" Travis I can "Restart Build"  but I'm not sure
contributors can do so on their, and I couldn't (on someone else's PR).

Anyone knows how we can make this easier ?

Appreciate the help.

Amit


Re: Next major milestone: first stable release

2017-03-01 Thread Jean-Baptiste Onofré

Yes, fully agree.

As far as I understood/know, BEAM-59 is targeted for Beam 1.0 (it's what 
we discussed with Pei and Davor).


Regards
JB

On 03/01/2017 11:39 AM, Ismaël Mejía wrote:

Also joining a bit late, I agree with Amit, HDFS improvements are a really
good thing to have before the stable release. I will also add the
IOChannelFactory refactorings to support things like Read.from(“hdfs://”)
aka BEAM-59.

In the worse case particular IOs can still be marked as experimental to
show users that they can still evolve, even after the first ‘stable’
release, the part that we have to pay more attention is not to break the
core SDK. And the question about Data Locality (BEAM-673) is where I am
afraid that we can have some breaking changes because there is not a way
from the IOs (Source/Sink) to send ‘a hint’ to the runner about Data
Locality (please correct me if I am wrong). And this even if not supported
in the first stable release by any runner, would be a really great thing to
have and I think this is a good moment to do it, to avoid breaking any
IO/runner signature because of new methods.

What do the others think ?
Ismaël



On Tue, Feb 28, 2017 at 6:29 PM, Amit Sela  wrote:


Joining in just a bit late, I'll be quick and say that IMHO the SDK is
mature enough and so my only point to add is *HDFS support*.
I think that in terms of adoption we have to support HDFS as a "first-class
citizen" via the FileSystem API, and provide data locality (batch) on top
of it - it serves not only HDFS, but other eco-system IOs such as HBase.
From my experience with talking to people and companies, most are running
batch in production with some streaming POC or even production use, but
batch still takes most of production work. If we give them the same
production results, with the Beam API, we can on-board them faster and make
it easier for them to adopt streaming as well.

Thanks,
Amit

On Tue, Feb 28, 2017 at 7:12 PM Davor Bonaci  wrote:


Alright -- sounds like we have a consensus to proceed with the first

stable

release after 0.6.0, targeting end of March / early April. I'll kick off
separate threads for specific decisions we need to make.

On Thu, Feb 23, 2017 at 6:07 AM, Aljoscha Krettek 
wrote:


I think we're ready for this! The public APIs are in very good shape,
especially now that we have the new DoFn, user facing state and timers

and

splittable DoFn. Not all Runners support the more advanced features but

we

can work on this after a stable release and there are enough runners

that

support a large part of the features.

Best,
Aljoscha

On Thu, 23 Feb 2017 at 06:15 Kenneth Knowles 
wrote:


On Wed, Feb 22, 2017 at 5:35 PM, Chamikara Jayalath <

chamik...@apache.org>

wrote:


I think, this point applies to Python SDK as well (though as you

mentioned,

API hiding in Python is a mere convention (prefix with underscore)

not

enforced. We already have mechanism for marking APIs as deprecated

which

might be useful here:
https://github.com/apache/beam/blob/master/sdks/python/
apache_beam/utils/annotations.py

- Cham



Perhaps an explicit @public annotation would fit. I could imagine

easily

generating a spec to check against from such annotations, though

tooling

is

secondary to documentation.

Kenn











--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Next major milestone: first stable release

2017-03-01 Thread Ismaël Mejía
Also joining a bit late, I agree with Amit, HDFS improvements are a really
good thing to have before the stable release. I will also add the
IOChannelFactory refactorings to support things like Read.from(“hdfs://”)
aka BEAM-59.

In the worse case particular IOs can still be marked as experimental to
show users that they can still evolve, even after the first ‘stable’
release, the part that we have to pay more attention is not to break the
core SDK. And the question about Data Locality (BEAM-673) is where I am
afraid that we can have some breaking changes because there is not a way
from the IOs (Source/Sink) to send ‘a hint’ to the runner about Data
Locality (please correct me if I am wrong). And this even if not supported
in the first stable release by any runner, would be a really great thing to
have and I think this is a good moment to do it, to avoid breaking any
IO/runner signature because of new methods.

What do the others think ?
Ismaël



On Tue, Feb 28, 2017 at 6:29 PM, Amit Sela  wrote:

> Joining in just a bit late, I'll be quick and say that IMHO the SDK is
> mature enough and so my only point to add is *HDFS support*.
> I think that in terms of adoption we have to support HDFS as a "first-class
> citizen" via the FileSystem API, and provide data locality (batch) on top
> of it - it serves not only HDFS, but other eco-system IOs such as HBase.
> From my experience with talking to people and companies, most are running
> batch in production with some streaming POC or even production use, but
> batch still takes most of production work. If we give them the same
> production results, with the Beam API, we can on-board them faster and make
> it easier for them to adopt streaming as well.
>
> Thanks,
> Amit
>
> On Tue, Feb 28, 2017 at 7:12 PM Davor Bonaci  wrote:
>
> > Alright -- sounds like we have a consensus to proceed with the first
> stable
> > release after 0.6.0, targeting end of March / early April. I'll kick off
> > separate threads for specific decisions we need to make.
> >
> > On Thu, Feb 23, 2017 at 6:07 AM, Aljoscha Krettek 
> > wrote:
> >
> > > I think we're ready for this! The public APIs are in very good shape,
> > > especially now that we have the new DoFn, user facing state and timers
> > and
> > > splittable DoFn. Not all Runners support the more advanced features but
> > we
> > > can work on this after a stable release and there are enough runners
> that
> > > support a large part of the features.
> > >
> > > Best,
> > > Aljoscha
> > >
> > > On Thu, 23 Feb 2017 at 06:15 Kenneth Knowles 
> > > wrote:
> > >
> > > > On Wed, Feb 22, 2017 at 5:35 PM, Chamikara Jayalath <
> > > chamik...@apache.org>
> > > > wrote:
> > > > >
> > > > > I think, this point applies to Python SDK as well (though as you
> > > > mentioned,
> > > > > API hiding in Python is a mere convention (prefix with underscore)
> > not
> > > > > enforced. We already have mechanism for marking APIs as deprecated
> > > which
> > > > > might be useful here:
> > > > > https://github.com/apache/beam/blob/master/sdks/python/
> > > > > apache_beam/utils/annotations.py
> > > > >
> > > > > - Cham
> > > > >
> > > >
> > > > Perhaps an explicit @public annotation would fit. I could imagine
> > easily
> > > > generating a spec to check against from such annotations, though
> > tooling
> > > is
> > > > secondary to documentation.
> > > >
> > > > Kenn
> > > >
> > >
> >
>