Re: Pipeline termination in the unified Beam model

2017-03-02 Thread Stas Levin
+1!

I think it's a very cool way to abstract away the batch vs. streaming
dissonance from the Beam model.

It does require that practitioners are *educated* to think this way as well.
I believe that nowadays the terms "batch" and "streaming" are so deeply
rooted, that they play a key role in the users' mental model. For example,
these terms are often employed to reason about whether
"pipeline.waitUntilFinish(...)" is expected to ever return (batch - yes,
streaming - not so much).

The approach Eugene advocates turns the question of whether an execution of
a pipeline ever finishes into a property of the pipeline (i.e., its
sources), instead of a property of the runner (i.e., whether it runs as
batch or streaming).

This sounds like something that makes the world a better place, and may
help address some of the major points discussed in
https://issues.apache.org/jira/browse/BEAM-849.


On Thu, Mar 2, 2017 at 4:03 AM Thomas Groh  wrote:

> +1
>
> I think it's a fair claim that a PCollection is "done" when it's watermark
> reaches positive infinity, and then it's easy to claim that a Pipeline is
> "done" when all of its PCollections are done. Completion is an especially
> reasonable claim if we consider positive infinity to be an actual infinity
> - so long as allowed lateness is a finite value, elements that arrive
> whenever a watermark is at positive infinity will be "infinitely" late, and
> thus can be dropped by the runner.
>
> As an aside, this is only about "finishing because the pipeline is
> complete" - it's unrelated to "finished because of an unrecoverable error"
> or similar reasons pipelines can stop running, yes?
>
> On Wed, Mar 1, 2017 at 5:54 PM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
> > Raising this onto the mailing list from
> > https://issues.apache.org/jira/browse/BEAM-849
> >
> > The issue came up: what does it mean for a pipeline to finish, in the
> Beam
> > model?
> >
> > Note that I am deliberately not talking about "batch" and "streaming"
> > pipelines, because this distinction does not exist in the model. Several
> > runners have batch/streaming *modes*, which implement the same semantics
> > (potentially different subsets: in batch mode typically a runner will
> > reject pipelines that have at least one unbounded PCollection) but in an
> > operationally different way. However we should define pipeline
> termination
> > at the level of the unified model, and then make sure that all runners in
> > all modes implement that properly.
> >
> > One natural way is to say "a pipeline terminates when the output
> watermarks
> > of all of its PCollection's progress to +infinity". (Note: this can be
> > generalized, I guess, to having partial executions of a pipeline: if
> you're
> > interested in the full contents of only some collections, then you wait
> > until only the watermarks of those collections progress to infinity)
> >
> > A typical "batch" runner mode does not implement watermarks - we can
> think
> > of it as assigning watermark -infinity to an output of a transform that
> > hasn't started executing yet, and +infinity to output of a transform that
> > has finished executing. This is consistent with how such runners
> implement
> > termination in practice.
> >
> > Dataflow streaming runner additionally implements such termination for
> > pipeline drain operation: it has 2 parts: 1) stop consuming input from
> the
> > sources, and 2) wait until all watermarks progress to infinity.
> >
> > Let us fill the gap by making this part of the Beam model and declaring
> > that all runners should implement this behavior. This will give nice
> > properties, e.g.:
> > - A pipeline that has only bounded collections can be run by any runner
> in
> > any mode, with the same results and termination behavior (this is
> actually
> > my motivating example for raising this issue is: I was running Splittable
> > DoFn tests
> >  >
> core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java>
> > with the streaming Dataflow runner - these tests produce only bounded
> > collections - and noticed that they wouldn't terminate even though all
> data
> > was processed)
> > - It will be possible to implement pipelines that stream data for a while
> > and then eventually successfully terminate based on some condition. E.g.
> a
> > pipeline that watches a continuously growing file until it is marked
> > read-only, or a pipeline that reads a Kafka topic partition until it
> > receives a "poison pill" message. This seems handy.
> >
>


Re: Pipeline termination in the unified Beam model

2017-03-02 Thread Jean-Baptiste Onofré

+1

Good idea !!

Regards
JB

On 03/02/2017 02:54 AM, Eugene Kirpichov wrote:

Raising this onto the mailing list from
https://issues.apache.org/jira/browse/BEAM-849

The issue came up: what does it mean for a pipeline to finish, in the Beam
model?

Note that I am deliberately not talking about "batch" and "streaming"
pipelines, because this distinction does not exist in the model. Several
runners have batch/streaming *modes*, which implement the same semantics
(potentially different subsets: in batch mode typically a runner will
reject pipelines that have at least one unbounded PCollection) but in an
operationally different way. However we should define pipeline termination
at the level of the unified model, and then make sure that all runners in
all modes implement that properly.

One natural way is to say "a pipeline terminates when the output watermarks
of all of its PCollection's progress to +infinity". (Note: this can be
generalized, I guess, to having partial executions of a pipeline: if you're
interested in the full contents of only some collections, then you wait
until only the watermarks of those collections progress to infinity)

A typical "batch" runner mode does not implement watermarks - we can think
of it as assigning watermark -infinity to an output of a transform that
hasn't started executing yet, and +infinity to output of a transform that
has finished executing. This is consistent with how such runners implement
termination in practice.

Dataflow streaming runner additionally implements such termination for
pipeline drain operation: it has 2 parts: 1) stop consuming input from the
sources, and 2) wait until all watermarks progress to infinity.

Let us fill the gap by making this part of the Beam model and declaring
that all runners should implement this behavior. This will give nice
properties, e.g.:
- A pipeline that has only bounded collections can be run by any runner in
any mode, with the same results and termination behavior (this is actually
my motivating example for raising this issue is: I was running Splittable
DoFn tests

with the streaming Dataflow runner - these tests produce only bounded
collections - and noticed that they wouldn't terminate even though all data
was processed)
- It will be possible to implement pipelines that stream data for a while
and then eventually successfully terminate based on some condition. E.g. a
pipeline that watches a continuously growing file until it is marked
read-only, or a pipeline that reads a Kafka topic partition until it
receives a "poison pill" message. This seems handy.



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: First stable release: version designation?

2017-03-02 Thread Aljoscha Krettek
I prefer 2.0.0 for the first stable release. It totally makes sense for
people coming from Dataflow 1.x and I can already envision the confusion
between Beam 1.5 and Dataflow 1.5.

On Thu, 2 Mar 2017 at 07:42 Jean-Baptiste Onofré  wrote:

> Hi Davor,
>
>
> For a Beam community perspective, 1.0.0 would make more sense. We have a
> fair number of people starting with Beam (without knowing Dataflow).
>
> However, as Dataflow SDK (origins of Beam) was in 1.0.0, in order to
> avoid confusion with users coming to Beam from Dataflow, 2.0.0 could help.
>
> I have a preference to 1.0.0 anyway, but I would understand starting
> from 2.0.0.
>
> Regards
> JB
>
> On 03/01/2017 07:56 PM, Davor Bonaci wrote:
> > The first stable release is our next major project-wide goal; see
> > discussion in [1]. I've been referring to it as "the first stable
> release"
> > for a long time, not "1.0.0" or "2.0.0" or "2017" or something else, to
> > make sure we have an unbiased discussion and a consensus-based decision
> on
> > this matter.
> >
> > I think that now is the time to consider the appropriate designation for
> > our first stable release, and formally make a decision on it. A
> reasonable
> > choices could be "1.0.0" or "2.0.0", perhaps there are others.
> >
> > 1.0.0:
> > * It logically comes after the current series, 0.x.y.
> > * Most people would expect it, I suppose.
> > * A possible confusion between Dataflow SDKs and Beam SDKs carrying the
> > same number.
> >
> > 2.0.0:
> > * Follows the pattern some other projects have taken -- continuing their
> > version numbering scheme from their previous origin.
> > * Better communicates project's roots, and degree of maturity.
> > * May be unexpected to some users.
> >
> > I'd invite everyone to share their thoughts and preferences -- names are
> > important and well correlated with success. Thanks!
> >
> > Davor
> >
> > [1] https://lists.apache.org/thread.html/c35067071aec9029d9100ae973c629
> > 9aa919c31d0de623ac367128e2@%3Cdev.beam.apache.org%3E
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Apache Beam (virtual) contributor meeting @ Tue Mar 7, 2017

2017-03-02 Thread Aljoscha Krettek
Shoot, I can't because I already have another meeting scheduled. Don't mind
me, though. Will you also maybe produce a video of the meeting?

On Wed, 1 Mar 2017 at 21:50 Davor Bonaci  wrote:

> Hi everyone,
> Based on the high demand [1], let's try to organize a virtual contributor
> meeting on Tuesday, March 7, 2017 at 15:00 UTC. For convenience, calendar
> link [2] and an .ics file are attached.
>
> I tried to accommodate as many time zones as possible, but I know it might
> be hard for some of us at 7 AM on the US west coast or 11 PM in China.
> Sorry about that.
>
> Let's use Google Hangouts as the video conferencing technology. I think we
> may be limited to something like 30 participants, so I'd encourage any
> co-located contributors to consider joining together (if appropriate).
> Joining the meeting should be straightforward -- please find the link
> within. No special requirements that I'm aware of.
>
> Just to re-state the expectations:
> * This is totally optional and informal.
> * It is simply a chance for everyone to meet others and see the faces of
> people we share a common passion with.
> * No specific agenda.
> * An open discussion on any topic of interest to the contributor community
> is
> welcome -- please feel free to bring up any topics you care about.
> * No formal discussion or decisions should to be made.
> * We'll keep notes and share them on the mailing list shortly after the
> meeting.
>
> Thanks -- and hope to see all of you there!
>
> Davor
>
> [1]
> https://lists.apache.org/thread.html/baf057b81c5f6d4127abadac165d923a224d34438fe67b71d73743ad@%3Cdev.beam.apache.org%3E
> [2]
> https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=a3A2MzdhaWdhdjByNWRibzZrN2ZnOG1kMTAgZGF2b3JAZ29vZ2xlLmNvbQ&tmsrc=davor%40google.com
>


Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-03-02 Thread Jean-Baptiste Onofré

Hi Stephen,

I agree to use the following structure (and it's basically what I 
proposed in a comment of the PR):


io/hadoop
io/hadoop-common
io/hbase

I would be more than happy to help on the "merge" of HdfsIO and 
HadoopFormat.


Regards
JB

On 03/01/2017 08:00 PM, Stephen Sisk wrote:

I wanted to follow up on this thread since I see some potential blocking
questions arising, and I'm trying to help dipti along with her PR.

Dipti's PR[1] is currently written to put files into:
io/hadoop/inputformat

The recent changes to create hadoop-common created:
io/hadoop-common

This means that the overall structure if we take the HIFIO PR as-is would
be:
io/hadoop/inputformat - the HIFIO (copies of some code in hadoop-common and
hdfs, but no dependency on hadoop-common)
io/hadoop-common - module with some shared code
io/hbase - hbase IO transforms
io/hdfs - FileInputFormat IO transforms - much shared code with
hadoop/inputformat.

Which I don't think is great b/c there's a common dir, but only some
directories use it, and there's lots of similar-but-slightly different code
in hadoop/inputformat and hdfsio. I don't believe anyone intends this to be
the final result.

After looking at the comments in this thread, I'd like to recommend the
following end-result:  (#1)
io/hadoop -  the HIFIO (dependency on  hadoop-common) - contains both
HadoopInputFormatIO.java and HDFSFileSink/HDFSFileSource (so contents of
hdfs and hadoop/inputformat)
io/hadoop-common - module with some shared code
io/hbase - hbase IO transforms

To get there I propose the following steps:
1. finish current PR [1] with only renaming the containing module from
hadoop/inputformat -> hadoop, and taking dependency on hadoop-common
2. someone does cleanup to reconcile hdfs and hadoop directories, including
renaming the files so they make sense

I would also be fine with: (#2)
io/hadoop - container dir only
io/hadoop/common
io/hadoop/hbase
io/hadoop/inputformat

I think the downside of #2 is that it hides hbase, which I think deserves
to be top level.

Other comments:
It should be noted that when we have all modules use hadoop-common, we'll
be forcing all hadoop modules to have the same dependencies on hadoop - I
think this makes sense, but worth noting that as the one advantage of the
"every hadoop IO transform has its own hadoop dependency"

On the naming discussion: I personally prefer "inputformat" as the name of
the directory, but I defer to the folks who know the hadoop community more.

S

[1] HadoopInputFormatIO PR - https://github.com/apache/beam/pull/1994
[2] HdfsIO dependency change PR - https://github.com/apache/beam/pull/2087


On Fri, Feb 17, 2017 at 9:38 AM, Dipti Kulkarni <
dipti_dkulka...@persistent.com> wrote:


Thank you  all for your inputs!


-Original Message-
From: Dan Halperin [mailto:dhalp...@google.com.INVALID]
Sent: Friday, February 17, 2017 12:17 PM
To: dev@beam.apache.org
Subject: Re: Merge HadoopInputFormatIO and HDFSIO in a single module

Raghu, Amit -- +1 to your expertise :)

On Thu, Feb 16, 2017 at 3:39 PM, Amit Sela  wrote:


I agree with Dan on everything regarding HdfsFileSystem - it's super
convenient for users to use TextIO with HdfsFileSystem rather then
replacing the IO and also specifying the InputFormat type.

I disagree on "HadoopIO" - I think that people who work with Hadoop
would find this name intuitive, and that's whats important.
Even more, and joining Raghu's comment, it is also recognized as
"compatible with Hadoop", so for example someone running a Beam
pipeline using the Spark runner on Amazon's S3 and wants to read/write
Hadoop sequence files would simply use HadoopIO and provide the
appropriate runtime dependencies (actually true for GS as well).

On Thu, Feb 16, 2017 at 9:08 PM Raghu Angadi

wrote:


FileInputFormat is extremely widely used, pretty much all the file
based input formats extend it. All of them call into to list the
input files, split (with some tweaks on top of that). The special
API ( *FileInputFormat.setMinInputSplitSize(job,
desiredBundleSizeBytes)* ) is how the split size is normally

communicated.

New IO can use the api directly.

HdfsIO as implemented in Beam is not HDFS specific at all. There are
no hdfs imports and HDFS name does not appear anywhere other than in

HdfsIO's

own class and method names. AvroHdfsFileSource etc would work just
as

well

with new IO.

On Thu, Feb 16, 2017 at 8:17 AM, Dan Halperin




wrote:


(And I think renaming to HadoopIO doesn't make sense.
"InputFormat" is

the

key component of the name -- it reads things that implement the

InputFormat

interface. "Hadoop" means a lot more than that.)



Often 'IO' in Beam implies both sources and sinks. It might not be
long before we might be supporting Hadoop OutputFormat as well. In
addition HadoopInputFormatIO is quite a mouthful. Agreed, Hadoop can
mean a lot of things depending on the context. In 'IO' context it
might not be too

broad.

Normally it implies 'any FileSystem supported in Hadoo

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-03-02 Thread Jean-Baptiste Onofré
By the way Stephen, when BEAM-59 will be done, hadoop IO will only 
contains the hadoop format support (no HdfsFileSource or HdfsSink 
required as it will use the "regular" FileIO).


Agree ?

Regards
JB

On 03/02/2017 03:27 PM, Jean-Baptiste Onofré wrote:

Hi Stephen,

I agree to use the following structure (and it's basically what I
proposed in a comment of the PR):

io/hadoop
io/hadoop-common
io/hbase

I would be more than happy to help on the "merge" of HdfsIO and
HadoopFormat.

Regards
JB

On 03/01/2017 08:00 PM, Stephen Sisk wrote:

I wanted to follow up on this thread since I see some potential blocking
questions arising, and I'm trying to help dipti along with her PR.

Dipti's PR[1] is currently written to put files into:
io/hadoop/inputformat

The recent changes to create hadoop-common created:
io/hadoop-common

This means that the overall structure if we take the HIFIO PR as-is would
be:
io/hadoop/inputformat - the HIFIO (copies of some code in
hadoop-common and
hdfs, but no dependency on hadoop-common)
io/hadoop-common - module with some shared code
io/hbase - hbase IO transforms
io/hdfs - FileInputFormat IO transforms - much shared code with
hadoop/inputformat.

Which I don't think is great b/c there's a common dir, but only some
directories use it, and there's lots of similar-but-slightly different
code
in hadoop/inputformat and hdfsio. I don't believe anyone intends this
to be
the final result.

After looking at the comments in this thread, I'd like to recommend the
following end-result:  (#1)
io/hadoop -  the HIFIO (dependency on  hadoop-common) - contains both
HadoopInputFormatIO.java and HDFSFileSink/HDFSFileSource (so contents of
hdfs and hadoop/inputformat)
io/hadoop-common - module with some shared code
io/hbase - hbase IO transforms

To get there I propose the following steps:
1. finish current PR [1] with only renaming the containing module from
hadoop/inputformat -> hadoop, and taking dependency on hadoop-common
2. someone does cleanup to reconcile hdfs and hadoop directories,
including
renaming the files so they make sense

I would also be fine with: (#2)
io/hadoop - container dir only
io/hadoop/common
io/hadoop/hbase
io/hadoop/inputformat

I think the downside of #2 is that it hides hbase, which I think deserves
to be top level.

Other comments:
It should be noted that when we have all modules use hadoop-common, we'll
be forcing all hadoop modules to have the same dependencies on hadoop - I
think this makes sense, but worth noting that as the one advantage of the
"every hadoop IO transform has its own hadoop dependency"

On the naming discussion: I personally prefer "inputformat" as the
name of
the directory, but I defer to the folks who know the hadoop community
more.

S

[1] HadoopInputFormatIO PR - https://github.com/apache/beam/pull/1994
[2] HdfsIO dependency change PR -
https://github.com/apache/beam/pull/2087


On Fri, Feb 17, 2017 at 9:38 AM, Dipti Kulkarni <
dipti_dkulka...@persistent.com> wrote:


Thank you  all for your inputs!


-Original Message-
From: Dan Halperin [mailto:dhalp...@google.com.INVALID]
Sent: Friday, February 17, 2017 12:17 PM
To: dev@beam.apache.org
Subject: Re: Merge HadoopInputFormatIO and HDFSIO in a single module

Raghu, Amit -- +1 to your expertise :)

On Thu, Feb 16, 2017 at 3:39 PM, Amit Sela  wrote:


I agree with Dan on everything regarding HdfsFileSystem - it's super
convenient for users to use TextIO with HdfsFileSystem rather then
replacing the IO and also specifying the InputFormat type.

I disagree on "HadoopIO" - I think that people who work with Hadoop
would find this name intuitive, and that's whats important.
Even more, and joining Raghu's comment, it is also recognized as
"compatible with Hadoop", so for example someone running a Beam
pipeline using the Spark runner on Amazon's S3 and wants to read/write
Hadoop sequence files would simply use HadoopIO and provide the
appropriate runtime dependencies (actually true for GS as well).

On Thu, Feb 16, 2017 at 9:08 PM Raghu Angadi

wrote:


FileInputFormat is extremely widely used, pretty much all the file
based input formats extend it. All of them call into to list the
input files, split (with some tweaks on top of that). The special
API ( *FileInputFormat.setMinInputSplitSize(job,
desiredBundleSizeBytes)* ) is how the split size is normally

communicated.

New IO can use the api directly.

HdfsIO as implemented in Beam is not HDFS specific at all. There are
no hdfs imports and HDFS name does not appear anywhere other than in

HdfsIO's

own class and method names. AvroHdfsFileSource etc would work just
as

well

with new IO.

On Thu, Feb 16, 2017 at 8:17 AM, Dan Halperin




wrote:


(And I think renaming to HadoopIO doesn't make sense.
"InputFormat" is

the

key component of the name -- it reads things that implement the

InputFormat

interface. "Hadoop" means a lot more than that.)



Often 'IO' in Beam implies both sources and sinks. It might not be
long before we might be suppo

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-03-02 Thread Ismaël Mejía
​Hello,

I answer since I have been leading the refactor to hadoop-common. My
criteria to move a class into hadoop-common is that it is used at least by
more than one other module or IO, this is the reason is not big, but it can
grow if needed.

+1 for option #1 because of the visibility reasons you mention.
For the concrete PR I have the following remarks:

>From looking at the PR I think today you can already do the basic refactors
that depend on hadoop-common (to avoid adding repeated code):
- Remove NullWritableCoder from hadoop/inputformat and refactor to use
WritableCoder from hadoop-common.
- Remove WritableCoder from hadoop/inputformat and refactor to use
WritableCoder from hadoop-common.

I have other comments but since those are not directly related to the
refactoring I will address those in the PR.

Thanks for bringing this issue back to the mailing-list Stephen.
Ismaël
​


On Thu, Mar 2, 2017 at 3:32 PM, Jean-Baptiste Onofré 
wrote:

> By the way Stephen, when BEAM-59 will be done, hadoop IO will only
> contains the hadoop format support (no HdfsFileSource or HdfsSink required
> as it will use the "regular" FileIO).
>
> Agree ?
>
> Regards
> JB
>
>
> On 03/02/2017 03:27 PM, Jean-Baptiste Onofré wrote:
>
>> Hi Stephen,
>>
>> I agree to use the following structure (and it's basically what I
>> proposed in a comment of the PR):
>>
>> io/hadoop
>> io/hadoop-common
>> io/hbase
>>
>> I would be more than happy to help on the "merge" of HdfsIO and
>> HadoopFormat.
>>
>> Regards
>> JB
>>
>> On 03/01/2017 08:00 PM, Stephen Sisk wrote:
>>
>>> I wanted to follow up on this thread since I see some potential blocking
>>> questions arising, and I'm trying to help dipti along with her PR.
>>>
>>> Dipti's PR[1] is currently written to put files into:
>>> io/hadoop/inputformat
>>>
>>> The recent changes to create hadoop-common created:
>>> io/hadoop-common
>>>
>>> This means that the overall structure if we take the HIFIO PR as-is would
>>> be:
>>> io/hadoop/inputformat - the HIFIO (copies of some code in
>>> hadoop-common and
>>> hdfs, but no dependency on hadoop-common)
>>> io/hadoop-common - module with some shared code
>>> io/hbase - hbase IO transforms
>>> io/hdfs - FileInputFormat IO transforms - much shared code with
>>> hadoop/inputformat.
>>>
>>> Which I don't think is great b/c there's a common dir, but only some
>>> directories use it, and there's lots of similar-but-slightly different
>>> code
>>> in hadoop/inputformat and hdfsio. I don't believe anyone intends this
>>> to be
>>> the final result.
>>>
>>> After looking at the comments in this thread, I'd like to recommend the
>>> following end-result:  (#1)
>>> io/hadoop -  the HIFIO (dependency on  hadoop-common) - contains both
>>> HadoopInputFormatIO.java and HDFSFileSink/HDFSFileSource (so contents of
>>> hdfs and hadoop/inputformat)
>>> io/hadoop-common - module with some shared code
>>> io/hbase - hbase IO transforms
>>>
>>> To get there I propose the following steps:
>>> 1. finish current PR [1] with only renaming the containing module from
>>> hadoop/inputformat -> hadoop, and taking dependency on hadoop-common
>>> 2. someone does cleanup to reconcile hdfs and hadoop directories,
>>> including
>>> renaming the files so they make sense
>>>
>>> I would also be fine with: (#2)
>>> io/hadoop - container dir only
>>> io/hadoop/common
>>> io/hadoop/hbase
>>> io/hadoop/inputformat
>>>
>>> I think the downside of #2 is that it hides hbase, which I think deserves
>>> to be top level.
>>>
>>> Other comments:
>>> It should be noted that when we have all modules use hadoop-common, we'll
>>> be forcing all hadoop modules to have the same dependencies on hadoop - I
>>> think this makes sense, but worth noting that as the one advantage of the
>>> "every hadoop IO transform has its own hadoop dependency"
>>>
>>> On the naming discussion: I personally prefer "inputformat" as the
>>> name of
>>> the directory, but I defer to the folks who know the hadoop community
>>> more.
>>>
>>> S
>>>
>>> [1] HadoopInputFormatIO PR - https://github.com/apache/beam/pull/1994
>>> [2] HdfsIO dependency change PR -
>>> https://github.com/apache/beam/pull/2087
>>>
>>>
>>> On Fri, Feb 17, 2017 at 9:38 AM, Dipti Kulkarni <
>>> dipti_dkulka...@persistent.com> wrote:
>>>
>>> Thank you  all for your inputs!


 -Original Message-
 From: Dan Halperin [mailto:dhalp...@google.com.INVALID]
 Sent: Friday, February 17, 2017 12:17 PM
 To: dev@beam.apache.org
 Subject: Re: Merge HadoopInputFormatIO and HDFSIO in a single module

 Raghu, Amit -- +1 to your expertise :)

 On Thu, Feb 16, 2017 at 3:39 PM, Amit Sela 
 wrote:

 I agree with Dan on everything regarding HdfsFileSystem - it's super
> convenient for users to use TextIO with HdfsFileSystem rather then
> replacing the IO and also specifying the InputFormat type.
>
> I disagree on "HadoopIO" - I think that people who work with Hadoop
> would find this na

Re: Pipeline termination in the unified Beam model

2017-03-02 Thread Dan Halperin
Note that even "unbounded pipeline in a streaming runner".waitUntilFinish()
can return, e.g., if you cancel it or terminate it. It's totally reasonable
for users to want to understand and handle these cases.

+1

Dan

On Thu, Mar 2, 2017 at 2:53 AM, Jean-Baptiste Onofré 
wrote:

> +1
>
> Good idea !!
>
> Regards
> JB
>
>
> On 03/02/2017 02:54 AM, Eugene Kirpichov wrote:
>
>> Raising this onto the mailing list from
>> https://issues.apache.org/jira/browse/BEAM-849
>>
>> The issue came up: what does it mean for a pipeline to finish, in the Beam
>> model?
>>
>> Note that I am deliberately not talking about "batch" and "streaming"
>> pipelines, because this distinction does not exist in the model. Several
>> runners have batch/streaming *modes*, which implement the same semantics
>> (potentially different subsets: in batch mode typically a runner will
>> reject pipelines that have at least one unbounded PCollection) but in an
>> operationally different way. However we should define pipeline termination
>> at the level of the unified model, and then make sure that all runners in
>> all modes implement that properly.
>>
>> One natural way is to say "a pipeline terminates when the output
>> watermarks
>> of all of its PCollection's progress to +infinity". (Note: this can be
>> generalized, I guess, to having partial executions of a pipeline: if
>> you're
>> interested in the full contents of only some collections, then you wait
>> until only the watermarks of those collections progress to infinity)
>>
>> A typical "batch" runner mode does not implement watermarks - we can think
>> of it as assigning watermark -infinity to an output of a transform that
>> hasn't started executing yet, and +infinity to output of a transform that
>> has finished executing. This is consistent with how such runners implement
>> termination in practice.
>>
>> Dataflow streaming runner additionally implements such termination for
>> pipeline drain operation: it has 2 parts: 1) stop consuming input from the
>> sources, and 2) wait until all watermarks progress to infinity.
>>
>> Let us fill the gap by making this part of the Beam model and declaring
>> that all runners should implement this behavior. This will give nice
>> properties, e.g.:
>> - A pipeline that has only bounded collections can be run by any runner in
>> any mode, with the same results and termination behavior (this is actually
>> my motivating example for raising this issue is: I was running Splittable
>> DoFn tests
>> > src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java>
>> with the streaming Dataflow runner - these tests produce only bounded
>> collections - and noticed that they wouldn't terminate even though all
>> data
>> was processed)
>> - It will be possible to implement pipelines that stream data for a while
>> and then eventually successfully terminate based on some condition. E.g. a
>> pipeline that watches a continuously growing file until it is marked
>> read-only, or a pipeline that reads a Kafka topic partition until it
>> receives a "poison pill" message. This seems handy.
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Pipeline termination in the unified Beam model

2017-03-02 Thread Kenneth Knowles
Isn't this already the case? I think semantically it is an unavoidable
conclusion, so certainly +1 to that.

The DirectRunner and TestDataflowRunner both have this behavior already.
I've always considered that a streaming job running forever is just [very]
suboptimal shutdown latency :-)

Some bits of the discussion on the ticket seem to surround whether or how
to communicate this property in a generic way. Since a runner owns its
PipelineResult it doesn't seem necessary.

So is the bottom line just that you want to more strongly insist that
runners really terminate in a timely manner? I'm +1 to that, too, for
basically the reason Stas gives: In order to easily programmatically
orchestrate Beam pipelines in a portable way, you do need to know whether
the pipeline will finish without thinking about the specific runner and its
options (as with our RunnableOnService tests).

Kenn

On Thu, Mar 2, 2017 at 9:09 AM, Dan Halperin 
wrote:

> Note that even "unbounded pipeline in a streaming runner".waitUntilFinish()
> can return, e.g., if you cancel it or terminate it. It's totally reasonable
> for users to want to understand and handle these cases.
>
> +1
>
> Dan
>
> On Thu, Mar 2, 2017 at 2:53 AM, Jean-Baptiste Onofré 
> wrote:
>
> > +1
> >
> > Good idea !!
> >
> > Regards
> > JB
> >
> >
> > On 03/02/2017 02:54 AM, Eugene Kirpichov wrote:
> >
> >> Raising this onto the mailing list from
> >> https://issues.apache.org/jira/browse/BEAM-849
> >>
> >> The issue came up: what does it mean for a pipeline to finish, in the
> Beam
> >> model?
> >>
> >> Note that I am deliberately not talking about "batch" and "streaming"
> >> pipelines, because this distinction does not exist in the model. Several
> >> runners have batch/streaming *modes*, which implement the same semantics
> >> (potentially different subsets: in batch mode typically a runner will
> >> reject pipelines that have at least one unbounded PCollection) but in an
> >> operationally different way. However we should define pipeline
> termination
> >> at the level of the unified model, and then make sure that all runners
> in
> >> all modes implement that properly.
> >>
> >> One natural way is to say "a pipeline terminates when the output
> >> watermarks
> >> of all of its PCollection's progress to +infinity". (Note: this can be
> >> generalized, I guess, to having partial executions of a pipeline: if
> >> you're
> >> interested in the full contents of only some collections, then you wait
> >> until only the watermarks of those collections progress to infinity)
> >>
> >> A typical "batch" runner mode does not implement watermarks - we can
> think
> >> of it as assigning watermark -infinity to an output of a transform that
> >> hasn't started executing yet, and +infinity to output of a transform
> that
> >> has finished executing. This is consistent with how such runners
> implement
> >> termination in practice.
> >>
> >> Dataflow streaming runner additionally implements such termination for
> >> pipeline drain operation: it has 2 parts: 1) stop consuming input from
> the
> >> sources, and 2) wait until all watermarks progress to infinity.
> >>
> >> Let us fill the gap by making this part of the Beam model and declaring
> >> that all runners should implement this behavior. This will give nice
> >> properties, e.g.:
> >> - A pipeline that has only bounded collections can be run by any runner
> in
> >> any mode, with the same results and termination behavior (this is
> actually
> >> my motivating example for raising this issue is: I was running
> Splittable
> >> DoFn tests
> >>  >> src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java>
> >> with the streaming Dataflow runner - these tests produce only bounded
> >> collections - and noticed that they wouldn't terminate even though all
> >> data
> >> was processed)
> >> - It will be possible to implement pipelines that stream data for a
> while
> >> and then eventually successfully terminate based on some condition.
> E.g. a
> >> pipeline that watches a continuously growing file until it is marked
> >> read-only, or a pipeline that reads a Kafka topic partition until it
> >> receives a "poison pill" message. This seems handy.
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-03-02 Thread Stephen Sisk
Thanks for your time thinking about this!

Sounds like we all like #1. That's great. I see that Ismael commented on
the PR to suggest the specific changes, so I think we should be good to go.

To answer JB's question about the later merging:
> when BEAM-59 will be done, hadoop IO will only contains the hadoop format
support (no HdfsFileSource or HdfsSink required as it will use the
"regular" FileIO).
I think there are still some improvements to be had by having at least some
code unique to HdfsFileSource/Sink. Dan called out above "the
FileInputFormat reader gets to call some special APIs that the
generic InputFormat reader cannot -- so they are not completely
redundant. Specifically,
FileInputFormat reader can do size-based splitting."

Dan recommended: "See if we can "inline" the FileInputFormat specific parts
of HdfsIO inside of HadoopInputFormatIO via reflection. If so, we can get
the best of both worlds with shared code." This seems reasonable to me.

I created BEAM-1592 so we can further discuss there if need be.

S

On Thu, Mar 2, 2017 at 6:45 AM, Ismaël Mejía  wrote:

> ​Hello,
>
> I answer since I have been leading the refactor to hadoop-common. My
> criteria to move a class into hadoop-common is that it is used at least by
> more than one other module or IO, this is the reason is not big, but it can
> grow if needed.
>
> +1 for option #1 because of the visibility reasons you mention.
> For the concrete PR I have the following remarks:
>
> From looking at the PR I think today you can already do the basic refactors
> that depend on hadoop-common (to avoid adding repeated code):
> - Remove NullWritableCoder from hadoop/inputformat and refactor to use
> WritableCoder from hadoop-common.
> - Remove WritableCoder from hadoop/inputformat and refactor to use
> WritableCoder from hadoop-common.
>
> I have other comments but since those are not directly related to the
> refactoring I will address those in the PR.
>
> Thanks for bringing this issue back to the mailing-list Stephen.
> Ismaël
> ​
>
>
> On Thu, Mar 2, 2017 at 3:32 PM, Jean-Baptiste Onofré 
> wrote:
>
> > By the way Stephen, when BEAM-59 will be done, hadoop IO will only
> > contains the hadoop format support (no HdfsFileSource or HdfsSink
> required
> > as it will use the "regular" FileIO).
> >
> > Agree ?
> >
> > Regards
> > JB
> >
> >
> > On 03/02/2017 03:27 PM, Jean-Baptiste Onofré wrote:
> >
> >> Hi Stephen,
> >>
> >> I agree to use the following structure (and it's basically what I
> >> proposed in a comment of the PR):
> >>
> >> io/hadoop
> >> io/hadoop-common
> >> io/hbase
> >>
> >> I would be more than happy to help on the "merge" of HdfsIO and
> >> HadoopFormat.
> >>
> >> Regards
> >> JB
> >>
> >> On 03/01/2017 08:00 PM, Stephen Sisk wrote:
> >>
> >>> I wanted to follow up on this thread since I see some potential
> blocking
> >>> questions arising, and I'm trying to help dipti along with her PR.
> >>>
> >>> Dipti's PR[1] is currently written to put files into:
> >>> io/hadoop/inputformat
> >>>
> >>> The recent changes to create hadoop-common created:
> >>> io/hadoop-common
> >>>
> >>> This means that the overall structure if we take the HIFIO PR as-is
> would
> >>> be:
> >>> io/hadoop/inputformat - the HIFIO (copies of some code in
> >>> hadoop-common and
> >>> hdfs, but no dependency on hadoop-common)
> >>> io/hadoop-common - module with some shared code
> >>> io/hbase - hbase IO transforms
> >>> io/hdfs - FileInputFormat IO transforms - much shared code with
> >>> hadoop/inputformat.
> >>>
> >>> Which I don't think is great b/c there's a common dir, but only some
> >>> directories use it, and there's lots of similar-but-slightly different
> >>> code
> >>> in hadoop/inputformat and hdfsio. I don't believe anyone intends this
> >>> to be
> >>> the final result.
> >>>
> >>> After looking at the comments in this thread, I'd like to recommend the
> >>> following end-result:  (#1)
> >>> io/hadoop -  the HIFIO (dependency on  hadoop-common) - contains both
> >>> HadoopInputFormatIO.java and HDFSFileSink/HDFSFileSource (so contents
> of
> >>> hdfs and hadoop/inputformat)
> >>> io/hadoop-common - module with some shared code
> >>> io/hbase - hbase IO transforms
> >>>
> >>> To get there I propose the following steps:
> >>> 1. finish current PR [1] with only renaming the containing module from
> >>> hadoop/inputformat -> hadoop, and taking dependency on hadoop-common
> >>> 2. someone does cleanup to reconcile hdfs and hadoop directories,
> >>> including
> >>> renaming the files so they make sense
> >>>
> >>> I would also be fine with: (#2)
> >>> io/hadoop - container dir only
> >>> io/hadoop/common
> >>> io/hadoop/hbase
> >>> io/hadoop/inputformat
> >>>
> >>> I think the downside of #2 is that it hides hbase, which I think
> deserves
> >>> to be top level.
> >>>
> >>> Other comments:
> >>> It should be noted that when we have all modules use hadoop-common,
> we'll
> >>> be forcing all hadoop modules to have the same dependencies on hadoop
> -

Re: Pipeline termination in the unified Beam model

2017-03-02 Thread Eugene Kirpichov
OK, I'm glad everybody is in agreement on this. I raised this point because
we've been discussing implementing this behavior in the Dataflow streaming
runner, and I wanted to make sure that people are okay with it from a
conceptual point of view before proceeding.

On Thu, Mar 2, 2017 at 10:27 AM Kenneth Knowles 
wrote:

Isn't this already the case? I think semantically it is an unavoidable
conclusion, so certainly +1 to that.

The DirectRunner and TestDataflowRunner both have this behavior already.
I've always considered that a streaming job running forever is just [very]
suboptimal shutdown latency :-)

Some bits of the discussion on the ticket seem to surround whether or how
to communicate this property in a generic way. Since a runner owns its
PipelineResult it doesn't seem necessary.

So is the bottom line just that you want to more strongly insist that
runners really terminate in a timely manner? I'm +1 to that, too, for
basically the reason Stas gives: In order to easily programmatically
orchestrate Beam pipelines in a portable way, you do need to know whether
the pipeline will finish without thinking about the specific runner and its
options (as with our RunnableOnService tests).

Kenn

On Thu, Mar 2, 2017 at 9:09 AM, Dan Halperin 
wrote:

> Note that even "unbounded pipeline in a streaming
runner".waitUntilFinish()
> can return, e.g., if you cancel it or terminate it. It's totally
reasonable
> for users to want to understand and handle these cases.
>
> +1
>
> Dan
>
> On Thu, Mar 2, 2017 at 2:53 AM, Jean-Baptiste Onofré 
> wrote:
>
> > +1
> >
> > Good idea !!
> >
> > Regards
> > JB
> >
> >
> > On 03/02/2017 02:54 AM, Eugene Kirpichov wrote:
> >
> >> Raising this onto the mailing list from
> >> https://issues.apache.org/jira/browse/BEAM-849
> >>
> >> The issue came up: what does it mean for a pipeline to finish, in the
> Beam
> >> model?
> >>
> >> Note that I am deliberately not talking about "batch" and "streaming"
> >> pipelines, because this distinction does not exist in the model.
Several
> >> runners have batch/streaming *modes*, which implement the same
semantics
> >> (potentially different subsets: in batch mode typically a runner will
> >> reject pipelines that have at least one unbounded PCollection) but in
an
> >> operationally different way. However we should define pipeline
> termination
> >> at the level of the unified model, and then make sure that all runners
> in
> >> all modes implement that properly.
> >>
> >> One natural way is to say "a pipeline terminates when the output
> >> watermarks
> >> of all of its PCollection's progress to +infinity". (Note: this can be
> >> generalized, I guess, to having partial executions of a pipeline: if
> >> you're
> >> interested in the full contents of only some collections, then you wait
> >> until only the watermarks of those collections progress to infinity)
> >>
> >> A typical "batch" runner mode does not implement watermarks - we can
> think
> >> of it as assigning watermark -infinity to an output of a transform that
> >> hasn't started executing yet, and +infinity to output of a transform
> that
> >> has finished executing. This is consistent with how such runners
> implement
> >> termination in practice.
> >>
> >> Dataflow streaming runner additionally implements such termination for
> >> pipeline drain operation: it has 2 parts: 1) stop consuming input from
> the
> >> sources, and 2) wait until all watermarks progress to infinity.
> >>
> >> Let us fill the gap by making this part of the Beam model and declaring
> >> that all runners should implement this behavior. This will give nice
> >> properties, e.g.:
> >> - A pipeline that has only bounded collections can be run by any runner
> in
> >> any mode, with the same results and termination behavior (this is
> actually
> >> my motivating example for raising this issue is: I was running
> Splittable
> >> DoFn tests
> >>  >> src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java>
> >> with the streaming Dataflow runner - these tests produce only bounded
> >> collections - and noticed that they wouldn't terminate even though all
> >> data
> >> was processed)
> >> - It will be possible to implement pipelines that stream data for a
> while
> >> and then eventually successfully terminate based on some condition.
> E.g. a
> >> pipeline that watches a continuously growing file until it is marked
> >> read-only, or a pipeline that reads a Kafka topic partition until it
> >> receives a "poison pill" message. This seems handy.
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Vacation for a few weeks

2017-03-02 Thread Dan Halperin
Hey folks,

I wanted to give you a heads-up that I'll be offline starting tomorrow
through 20th March.

I think I've handled most of the questions and pull requests and JIRA
issues you've sent me, but I know the community will be happy to help with
urgent issues in the rest.

(I also will not be able to vote on the 0.6.0 release, or attend the
virtual meetup next week. Sorry, and good luck!)

Have fun!
Dan


Re: Vacation for a few weeks

2017-03-02 Thread Jean-Baptiste Onofré

ENJOY !

You deserve great vacations ! Say hi to the Maoris for me ;)

Regards
JB

On 03/02/2017 08:22 PM, Dan Halperin wrote:

Hey folks,

I wanted to give you a heads-up that I'll be offline starting tomorrow
through 20th March.

I think I've handled most of the questions and pull requests and JIRA
issues you've sent me, but I know the community will be happy to help with
urgent issues in the rest.

(I also will not be able to vote on the 0.6.0 release, or attend the
virtual meetup next week. Sorry, and good luck!)

Have fun!
Dan



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Performance Testing Next Steps

2017-03-02 Thread Jason Kuster
Glad to hear the excitement. :)

Filed BEAM-1595 - 1609 to track work items. Some of these fall under runner
components, please feel free to reach out to me if you have any questions
about how to accomplish these.

Best,

Jason

On Wed, Mar 1, 2017 at 5:50 AM, Aljoscha Krettek 
wrote:

> Thanks for writing this and taking care of this, Jason!
>
> I'm afraid I also cannot add anything except that I'm excited to see some
> results from this.
>
> On Wed, 1 Mar 2017 at 03:28 Kenneth Knowles 
> wrote:
>
> Just got a chance to look this over. I don't have anything to add, but I'm
> pretty excited to follow this project. Have the JIRAs been filed since you
> shared the doc?
>
> On Wed, Feb 22, 2017 at 10:38 AM, Jason Kuster <
> jasonkus...@google.com.invalid> wrote:
>
> > Hey all, just wanted to pop this up again for people -- if anyone has
> > thoughts on performance testing please feel welcome to chime in. :)
> >
> > On Fri, Feb 17, 2017 at 4:03 PM, Jason Kuster 
> > wrote:
> >
> > > Hi all,
> > >
> > > I've written up a doc on next steps for getting performance testing up
> > and
> > > running for Beam. I'd love to hear from people -- there's a fair amount
> > of
> > > work encapsulated in here, but the end result is that we have a
> > performance
> > > testing system which we can use for benchmarking all aspects of Beam,
> > which
> > > would be really exciting. Looking forward to your thoughts.
> > >
> > > https://docs.google.com/document/d/1PsjGPSN6FuorEEPrKEP3u3m16tyOz
> > > ph5FnL2DhaRDz0/edit?ts=58a78e73
> > >
> > > Best,
> > >
> > > Jason
> > >
> > > --
> > > ---
> > > Jason Kuster
> > > Apache Beam / Google Cloud Dataflow
> > >
> >
> >
> >
> > --
> > ---
> > Jason Kuster
> > Apache Beam / Google Cloud Dataflow
> >
>



-- 
---
Jason Kuster
Apache Beam / Google Cloud Dataflow


Re: Performance Testing Next Steps

2017-03-02 Thread Ahmet Altay
Thank you Jason, this is great.

Which one of these issues fall into the land of sdk-py?

Ahmet

On Thu, Mar 2, 2017 at 12:34 PM, Jason Kuster <
jasonkus...@google.com.invalid> wrote:

> Glad to hear the excitement. :)
>
> Filed BEAM-1595 - 1609 to track work items. Some of these fall under runner
> components, please feel free to reach out to me if you have any questions
> about how to accomplish these.
>
> Best,
>
> Jason
>
> On Wed, Mar 1, 2017 at 5:50 AM, Aljoscha Krettek 
> wrote:
>
> > Thanks for writing this and taking care of this, Jason!
> >
> > I'm afraid I also cannot add anything except that I'm excited to see some
> > results from this.
> >
> > On Wed, 1 Mar 2017 at 03:28 Kenneth Knowles 
> > wrote:
> >
> > Just got a chance to look this over. I don't have anything to add, but
> I'm
> > pretty excited to follow this project. Have the JIRAs been filed since
> you
> > shared the doc?
> >
> > On Wed, Feb 22, 2017 at 10:38 AM, Jason Kuster <
> > jasonkus...@google.com.invalid> wrote:
> >
> > > Hey all, just wanted to pop this up again for people -- if anyone has
> > > thoughts on performance testing please feel welcome to chime in. :)
> > >
> > > On Fri, Feb 17, 2017 at 4:03 PM, Jason Kuster 
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I've written up a doc on next steps for getting performance testing
> up
> > > and
> > > > running for Beam. I'd love to hear from people -- there's a fair
> amount
> > > of
> > > > work encapsulated in here, but the end result is that we have a
> > > performance
> > > > testing system which we can use for benchmarking all aspects of Beam,
> > > which
> > > > would be really exciting. Looking forward to your thoughts.
> > > >
> > > > https://docs.google.com/document/d/1PsjGPSN6FuorEEPrKEP3u3m16tyOz
> > > > ph5FnL2DhaRDz0/edit?ts=58a78e73
> > > >
> > > > Best,
> > > >
> > > > Jason
> > > >
> > > > --
> > > > ---
> > > > Jason Kuster
> > > > Apache Beam / Google Cloud Dataflow
> > > >
> > >
> > >
> > >
> > > --
> > > ---
> > > Jason Kuster
> > > Apache Beam / Google Cloud Dataflow
> > >
> >
>
>
>
> --
> ---
> Jason Kuster
> Apache Beam / Google Cloud Dataflow
>


Re: Performance Testing Next Steps

2017-03-02 Thread Jason Kuster
D'oh, my bad Ahmet. I've opened BEAM-1610, which handles support for Python
in PKB against the Dataflow runner. Once the Fn API progresses some more we
can add some work items for the other runners too. Let's chat about this
more, maybe next week?

On Thu, Mar 2, 2017 at 1:31 PM, Ahmet Altay 
wrote:

> Thank you Jason, this is great.
>
> Which one of these issues fall into the land of sdk-py?
>
> Ahmet
>
> On Thu, Mar 2, 2017 at 12:34 PM, Jason Kuster <
> jasonkus...@google.com.invalid> wrote:
>
> > Glad to hear the excitement. :)
> >
> > Filed BEAM-1595 - 1609 to track work items. Some of these fall under
> runner
> > components, please feel free to reach out to me if you have any questions
> > about how to accomplish these.
> >
> > Best,
> >
> > Jason
> >
> > On Wed, Mar 1, 2017 at 5:50 AM, Aljoscha Krettek 
> > wrote:
> >
> > > Thanks for writing this and taking care of this, Jason!
> > >
> > > I'm afraid I also cannot add anything except that I'm excited to see
> some
> > > results from this.
> > >
> > > On Wed, 1 Mar 2017 at 03:28 Kenneth Knowles 
> > > wrote:
> > >
> > > Just got a chance to look this over. I don't have anything to add, but
> > I'm
> > > pretty excited to follow this project. Have the JIRAs been filed since
> > you
> > > shared the doc?
> > >
> > > On Wed, Feb 22, 2017 at 10:38 AM, Jason Kuster <
> > > jasonkus...@google.com.invalid> wrote:
> > >
> > > > Hey all, just wanted to pop this up again for people -- if anyone has
> > > > thoughts on performance testing please feel welcome to chime in. :)
> > > >
> > > > On Fri, Feb 17, 2017 at 4:03 PM, Jason Kuster <
> jasonkus...@google.com>
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I've written up a doc on next steps for getting performance testing
> > up
> > > > and
> > > > > running for Beam. I'd love to hear from people -- there's a fair
> > amount
> > > > of
> > > > > work encapsulated in here, but the end result is that we have a
> > > > performance
> > > > > testing system which we can use for benchmarking all aspects of
> Beam,
> > > > which
> > > > > would be really exciting. Looking forward to your thoughts.
> > > > >
> > > > > https://docs.google.com/document/d/1PsjGPSN6FuorEEPrKEP3u3m16tyOz
> > > > > ph5FnL2DhaRDz0/edit?ts=58a78e73
> > > > >
> > > > > Best,
> > > > >
> > > > > Jason
> > > > >
> > > > > --
> > > > > ---
> > > > > Jason Kuster
> > > > > Apache Beam / Google Cloud Dataflow
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > ---
> > > > Jason Kuster
> > > > Apache Beam / Google Cloud Dataflow
> > > >
> > >
> >
> >
> >
> > --
> > ---
> > Jason Kuster
> > Apache Beam / Google Cloud Dataflow
> >
>



-- 
---
Jason Kuster
Apache Beam / Google Cloud Dataflow


Re: Performance Testing Next Steps

2017-03-02 Thread Ahmet Altay
Sounds great, thank you!

On Thu, Mar 2, 2017 at 1:41 PM, Jason Kuster  wrote:

> D'oh, my bad Ahmet. I've opened BEAM-1610, which handles support for Python
> in PKB against the Dataflow runner. Once the Fn API progresses some more we
> can add some work items for the other runners too. Let's chat about this
> more, maybe next week?
>
> On Thu, Mar 2, 2017 at 1:31 PM, Ahmet Altay 
> wrote:
>
> > Thank you Jason, this is great.
> >
> > Which one of these issues fall into the land of sdk-py?
> >
> > Ahmet
> >
> > On Thu, Mar 2, 2017 at 12:34 PM, Jason Kuster <
> > jasonkus...@google.com.invalid> wrote:
> >
> > > Glad to hear the excitement. :)
> > >
> > > Filed BEAM-1595 - 1609 to track work items. Some of these fall under
> > runner
> > > components, please feel free to reach out to me if you have any
> questions
> > > about how to accomplish these.
> > >
> > > Best,
> > >
> > > Jason
> > >
> > > On Wed, Mar 1, 2017 at 5:50 AM, Aljoscha Krettek 
> > > wrote:
> > >
> > > > Thanks for writing this and taking care of this, Jason!
> > > >
> > > > I'm afraid I also cannot add anything except that I'm excited to see
> > some
> > > > results from this.
> > > >
> > > > On Wed, 1 Mar 2017 at 03:28 Kenneth Knowles 
> > > > wrote:
> > > >
> > > > Just got a chance to look this over. I don't have anything to add,
> but
> > > I'm
> > > > pretty excited to follow this project. Have the JIRAs been filed
> since
> > > you
> > > > shared the doc?
> > > >
> > > > On Wed, Feb 22, 2017 at 10:38 AM, Jason Kuster <
> > > > jasonkus...@google.com.invalid> wrote:
> > > >
> > > > > Hey all, just wanted to pop this up again for people -- if anyone
> has
> > > > > thoughts on performance testing please feel welcome to chime in. :)
> > > > >
> > > > > On Fri, Feb 17, 2017 at 4:03 PM, Jason Kuster <
> > jasonkus...@google.com>
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I've written up a doc on next steps for getting performance
> testing
> > > up
> > > > > and
> > > > > > running for Beam. I'd love to hear from people -- there's a fair
> > > amount
> > > > > of
> > > > > > work encapsulated in here, but the end result is that we have a
> > > > > performance
> > > > > > testing system which we can use for benchmarking all aspects of
> > Beam,
> > > > > which
> > > > > > would be really exciting. Looking forward to your thoughts.
> > > > > >
> > > > > > https://docs.google.com/document/d/
> 1PsjGPSN6FuorEEPrKEP3u3m16tyOz
> > > > > > ph5FnL2DhaRDz0/edit?ts=58a78e73
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Jason
> > > > > >
> > > > > > --
> > > > > > ---
> > > > > > Jason Kuster
> > > > > > Apache Beam / Google Cloud Dataflow
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > ---
> > > > > Jason Kuster
> > > > > Apache Beam / Google Cloud Dataflow
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > ---
> > > Jason Kuster
> > > Apache Beam / Google Cloud Dataflow
> > >
> >
>
>
>
> --
> ---
> Jason Kuster
> Apache Beam / Google Cloud Dataflow
>


Re: Apache Beam (virtual) contributor meeting @ Tue Mar 7, 2017

2017-03-02 Thread Amit Sela
I'll be there!

On Thu, Mar 2, 2017 at 1:06 PM Aljoscha Krettek  wrote:

> Shoot, I can't because I already have another meeting scheduled. Don't mind
> me, though. Will you also maybe produce a video of the meeting?
>
> On Wed, 1 Mar 2017 at 21:50 Davor Bonaci  wrote:
>
> > Hi everyone,
> > Based on the high demand [1], let's try to organize a virtual contributor
> > meeting on Tuesday, March 7, 2017 at 15:00 UTC. For convenience, calendar
> > link [2] and an .ics file are attached.
> >
> > I tried to accommodate as many time zones as possible, but I know it
> might
> > be hard for some of us at 7 AM on the US west coast or 11 PM in China.
> > Sorry about that.
> >
> > Let's use Google Hangouts as the video conferencing technology. I think
> we
> > may be limited to something like 30 participants, so I'd encourage any
> > co-located contributors to consider joining together (if appropriate).
> > Joining the meeting should be straightforward -- please find the link
> > within. No special requirements that I'm aware of.
> >
> > Just to re-state the expectations:
> > * This is totally optional and informal.
> > * It is simply a chance for everyone to meet others and see the faces of
> > people we share a common passion with.
> > * No specific agenda.
> > * An open discussion on any topic of interest to the contributor
> community
> > is
> > welcome -- please feel free to bring up any topics you care about.
> > * No formal discussion or decisions should to be made.
> > * We'll keep notes and share them on the mailing list shortly after the
> > meeting.
> >
> > Thanks -- and hope to see all of you there!
> >
> > Davor
> >
> > [1]
> >
> https://lists.apache.org/thread.html/baf057b81c5f6d4127abadac165d923a224d34438fe67b71d73743ad@%3Cdev.beam.apache.org%3E
> > [2]
> >
> https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=a3A2MzdhaWdhdjByNWRibzZrN2ZnOG1kMTAgZGF2b3JAZ29vZ2xlLmNvbQ&tmsrc=davor%40google.com
> >
>


Re: Pipeline termination in the unified Beam model

2017-03-02 Thread Amit Sela
+1 on Eugene's words - this shows how batch is conceptually a subset of a
streaming problem.
I also believe that Stas has a very good point on education - we have to
try and understand developer's current perspective and try to make the
transition to the Beam model as natural as possible for new users.
In addition to good documentation and examples, I think that
https://issues.apache.org/jira/browse/BEAM-849 is critical, as this is the
user's end-point to the behaviours discussed here, and so it should be:
* clear and concise - pipeline state at any point should be informative.
* well documented - documentation, examples, and use-cases (e.g., Eugene's
"poison pill").
* strict API for runners - joining Stas' not on unified implementation for
portability.

On Thu, Mar 2, 2017 at 8:49 PM Eugene Kirpichov
 wrote:

> OK, I'm glad everybody is in agreement on this. I raised this point because
> we've been discussing implementing this behavior in the Dataflow streaming
> runner, and I wanted to make sure that people are okay with it from a
> conceptual point of view before proceeding.
>
> On Thu, Mar 2, 2017 at 10:27 AM Kenneth Knowles 
> wrote:
>
> Isn't this already the case? I think semantically it is an unavoidable
> conclusion, so certainly +1 to that.
>
> The DirectRunner and TestDataflowRunner both have this behavior already.
> I've always considered that a streaming job running forever is just [very]
> suboptimal shutdown latency :-)
>
> Some bits of the discussion on the ticket seem to surround whether or how
> to communicate this property in a generic way. Since a runner owns its
> PipelineResult it doesn't seem necessary.
>
> So is the bottom line just that you want to more strongly insist that
> runners really terminate in a timely manner? I'm +1 to that, too, for
> basically the reason Stas gives: In order to easily programmatically
> orchestrate Beam pipelines in a portable way, you do need to know whether
> the pipeline will finish without thinking about the specific runner and its
> options (as with our RunnableOnService tests).
>
> Kenn
>
> On Thu, Mar 2, 2017 at 9:09 AM, Dan Halperin 
> wrote:
>
> > Note that even "unbounded pipeline in a streaming
> runner".waitUntilFinish()
> > can return, e.g., if you cancel it or terminate it. It's totally
> reasonable
> > for users to want to understand and handle these cases.
> >
> > +1
> >
> > Dan
> >
> > On Thu, Mar 2, 2017 at 2:53 AM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > +1
> > >
> > > Good idea !!
> > >
> > > Regards
> > > JB
> > >
> > >
> > > On 03/02/2017 02:54 AM, Eugene Kirpichov wrote:
> > >
> > >> Raising this onto the mailing list from
> > >> https://issues.apache.org/jira/browse/BEAM-849
> > >>
> > >> The issue came up: what does it mean for a pipeline to finish, in the
> > Beam
> > >> model?
> > >>
> > >> Note that I am deliberately not talking about "batch" and "streaming"
> > >> pipelines, because this distinction does not exist in the model.
> Several
> > >> runners have batch/streaming *modes*, which implement the same
> semantics
> > >> (potentially different subsets: in batch mode typically a runner will
> > >> reject pipelines that have at least one unbounded PCollection) but in
> an
> > >> operationally different way. However we should define pipeline
> > termination
> > >> at the level of the unified model, and then make sure that all runners
> > in
> > >> all modes implement that properly.
> > >>
> > >> One natural way is to say "a pipeline terminates when the output
> > >> watermarks
> > >> of all of its PCollection's progress to +infinity". (Note: this can be
> > >> generalized, I guess, to having partial executions of a pipeline: if
> > >> you're
> > >> interested in the full contents of only some collections, then you
> wait
> > >> until only the watermarks of those collections progress to infinity)
> > >>
> > >> A typical "batch" runner mode does not implement watermarks - we can
> > think
> > >> of it as assigning watermark -infinity to an output of a transform
> that
> > >> hasn't started executing yet, and +infinity to output of a transform
> > that
> > >> has finished executing. This is consistent with how such runners
> > implement
> > >> termination in practice.
> > >>
> > >> Dataflow streaming runner additionally implements such termination for
> > >> pipeline drain operation: it has 2 parts: 1) stop consuming input from
> > the
> > >> sources, and 2) wait until all watermarks progress to infinity.
> > >>
> > >> Let us fill the gap by making this part of the Beam model and
> declaring
> > >> that all runners should implement this behavior. This will give nice
> > >> properties, e.g.:
> > >> - A pipeline that has only bounded collections can be run by any
> runner
> > in
> > >> any mode, with the same results and termination behavior (this is
> > actually
> > >> my motivating example for raising this issue is: I was running
> > Splittable
> > >> DoFn tests
> > >> 

Re: Performance Testing Next Steps

2017-03-02 Thread Amit Sela
Looks great, and I'll be sure to follow this. Ping me if I can assist in
any way!

On Fri, Mar 3, 2017 at 12:09 AM Ahmet Altay 
wrote:

> Sounds great, thank you!
>
> On Thu, Mar 2, 2017 at 1:41 PM, Jason Kuster  .invalid
> > wrote:
>
> > D'oh, my bad Ahmet. I've opened BEAM-1610, which handles support for
> Python
> > in PKB against the Dataflow runner. Once the Fn API progresses some more
> we
> > can add some work items for the other runners too. Let's chat about this
> > more, maybe next week?
> >
> > On Thu, Mar 2, 2017 at 1:31 PM, Ahmet Altay 
> > wrote:
> >
> > > Thank you Jason, this is great.
> > >
> > > Which one of these issues fall into the land of sdk-py?
> > >
> > > Ahmet
> > >
> > > On Thu, Mar 2, 2017 at 12:34 PM, Jason Kuster <
> > > jasonkus...@google.com.invalid> wrote:
> > >
> > > > Glad to hear the excitement. :)
> > > >
> > > > Filed BEAM-1595 - 1609 to track work items. Some of these fall under
> > > runner
> > > > components, please feel free to reach out to me if you have any
> > questions
> > > > about how to accomplish these.
> > > >
> > > > Best,
> > > >
> > > > Jason
> > > >
> > > > On Wed, Mar 1, 2017 at 5:50 AM, Aljoscha Krettek <
> aljos...@apache.org>
> > > > wrote:
> > > >
> > > > > Thanks for writing this and taking care of this, Jason!
> > > > >
> > > > > I'm afraid I also cannot add anything except that I'm excited to
> see
> > > some
> > > > > results from this.
> > > > >
> > > > > On Wed, 1 Mar 2017 at 03:28 Kenneth Knowles  >
> > > > > wrote:
> > > > >
> > > > > Just got a chance to look this over. I don't have anything to add,
> > but
> > > > I'm
> > > > > pretty excited to follow this project. Have the JIRAs been filed
> > since
> > > > you
> > > > > shared the doc?
> > > > >
> > > > > On Wed, Feb 22, 2017 at 10:38 AM, Jason Kuster <
> > > > > jasonkus...@google.com.invalid> wrote:
> > > > >
> > > > > > Hey all, just wanted to pop this up again for people -- if anyone
> > has
> > > > > > thoughts on performance testing please feel welcome to chime in.
> :)
> > > > > >
> > > > > > On Fri, Feb 17, 2017 at 4:03 PM, Jason Kuster <
> > > jasonkus...@google.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I've written up a doc on next steps for getting performance
> > testing
> > > > up
> > > > > > and
> > > > > > > running for Beam. I'd love to hear from people -- there's a
> fair
> > > > amount
> > > > > > of
> > > > > > > work encapsulated in here, but the end result is that we have a
> > > > > > performance
> > > > > > > testing system which we can use for benchmarking all aspects of
> > > Beam,
> > > > > > which
> > > > > > > would be really exciting. Looking forward to your thoughts.
> > > > > > >
> > > > > > > https://docs.google.com/document/d/
> > 1PsjGPSN6FuorEEPrKEP3u3m16tyOz
> > > > > > > ph5FnL2DhaRDz0/edit?ts=58a78e73
> > > > > > >
> > > > > > > Best,
> > > > > > >
> > > > > > > Jason
> > > > > > >
> > > > > > > --
> > > > > > > ---
> > > > > > > Jason Kuster
> > > > > > > Apache Beam / Google Cloud Dataflow
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > ---
> > > > > > Jason Kuster
> > > > > > Apache Beam / Google Cloud Dataflow
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > ---
> > > > Jason Kuster
> > > > Apache Beam / Google Cloud Dataflow
> > > >
> > >
> >
> >
> >
> > --
> > ---
> > Jason Kuster
> > Apache Beam / Google Cloud Dataflow
> >
>


Re: Release 0.6.0

2017-03-02 Thread Kenneth Knowles
Hi all,

I've just filed https://issues.apache.org/jira/browse/BEAM-1611. It is
technically not a bug in Beam but the easiest quick fix is to workaround in
the DataflowRunner, so I'd like to block the release on it. It should be
available ahead of the release's existing schedule, and can easily be
cherry-picked.

Kenn

On Wed, Mar 1, 2017 at 10:50 PM, Jean-Baptiste Onofré 
wrote:

> Thanks Ahmet !
>
> Regards
> JB
>
> On 03/02/2017 07:42 AM, Ahmet Altay wrote:
>
>> Sure, I can wait. To be clear, Thursday night in which time zone?
>>
>> Thank you,
>> Ahmet
>>
>> On Wed, Mar 1, 2017 at 10:38 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Hi Ahmet,
>>>
>>> Can you wait up to Thursday night ? Trying to merge BEAM-649.
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>>
>>> On 03/01/2017 07:23 PM, Ahmet Altay wrote:
>>>
>>> Thank you. I will start working on it.

 Ahmet

 On Wed, Mar 1, 2017 at 9:03 AM, Aljoscha Krettek 
 wrote:

 I just closed the last blocking issue, we should be good to go now.

>
> Sorry again for the hold-up.
>
> On Tue, 28 Feb 2017 at 18:38 Ahmet Altay 
> wrote:
>
> Thank you all. I will wait for release blocking issues to be closed.
>
> Sergio, thank you for the information. I will document the friction
> points
> during this release process. Following the release we can start a
> discussion about how to fix those.
>
> Ahmet
>
> On Tue, Feb 28, 2017 at 9:22 AM, Aljoscha Krettek  >
> wrote:
>
> That was my mistake, sorry for that. I should have tagged [1] as a
>
>>
>> blocker
>
> because leaking state is probably a bad idea. At least then people
>> would
>>
>> be
>
> aware and we could have discussed whether it is a blocker.
>>
>> There is already an open PR for this now.
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-1517
>>
>> On Tue, 28 Feb 2017 at 18:21 Jean-Baptiste Onofré 
>>
>> wrote:
>
>
>> Regarding BEAM-649, it's not a release blocker, it's a good to have.
>>
>>>
>>> As I'm pretty close to the end of the Pull Request (hopefully tonight
>>>
>>> or
>>
>
> tomorrow), it's a "Good To Have".
>>
>>>
>>> Regards
>>> JB
>>>
>>> On 02/28/2017 06:09 PM, Davor Bonaci wrote:
>>>
>>> Can we please use JIRA to tag potentially release-blocking issues?

 Anyone
>>>
>>
>> can just add a 'Fix Versions' field of an open issue to the next
>>>

 scheduled
>>>
>>> release -- and it becomes easily visible to everyone in the project.

 In general, I'm not a fan of blocking releases for new
 functionality.
 Rushing new features and a lack of baking time usually translates to

 bugs.
>>>
>>> However, I think this time it is totally justified -- on a separate

 thread
>>>
>>> we plan for this to be the last release before the "first stable

 release";
>>>
>>> and picking the new features now will provide additional coverage for

 it.
>>>
>>
>>
>>> So, +1, but please tag in JIRA.

 On Tue, Feb 28, 2017 at 2:09 AM, Aljoscha Krettek <

 aljos...@apache.org
>>>
>>
>
>> wrote:
>>>

 I would like to finish these two:

> https://issues.apache.org/jira/browse/BEAM-1036: Support for new
>
> State

>>>
>> API
>>>
>>> in FlinkRunner

> https://issues.apache.org/jira/browse/BEAM-1116: Support for new
>
> Timer

>>>
>> API
>>>
>>> in Flink runner

>
> Both of them are finished for the streaming runner, for the batch
>
> runner

>>>
>> I'm merging the code for the first right now and the second will not
>>>

> take

>>>
>>> long.

>
> There is also this: https://issues.apache.org/jira
> /browse/BEAM-1517
>
> :

>>>
> User
>>
>>>
>>> state in the Flink Streaming Runner is not garbage collected. It's

>
> not a

>>>
>> regression from 0.5.0 where we simply didn't have this feature but
>>>

> I'm

>>>
> still somewhat uneasy about this.
>>
>>>
>
> On Tue, 28 Feb 2017 at 09:44 Jean-Baptiste Onofré  >
>
> wrote:

>>>
>>>
 Fair enough.
>
>>
>> I also try to merge https://github.com/apache/beam/pull/1739
>> asap.
>>
>> Regards
>> JB
>>
>> On 02/28/2017 09:34 AM, Amit Sela wrote:
>>
>> I'd prefer we wait to merge http

Re: Apache Beam (virtual) contributor meeting @ Tue Mar 7, 2017

2017-03-02 Thread Davor Bonaci
I'd prefer not to record the video; just to keep things informal. We'll,
however, keep the notes and share anything that may be relevant.

On Thu, Mar 2, 2017 at 2:24 PM, Amit Sela  wrote:

> I'll be there!
>
> On Thu, Mar 2, 2017 at 1:06 PM Aljoscha Krettek 
> wrote:
>
> > Shoot, I can't because I already have another meeting scheduled. Don't
> mind
> > me, though. Will you also maybe produce a video of the meeting?
> >
> > On Wed, 1 Mar 2017 at 21:50 Davor Bonaci  wrote:
> >
> > > Hi everyone,
> > > Based on the high demand [1], let's try to organize a virtual
> contributor
> > > meeting on Tuesday, March 7, 2017 at 15:00 UTC. For convenience,
> calendar
> > > link [2] and an .ics file are attached.
> > >
> > > I tried to accommodate as many time zones as possible, but I know it
> > might
> > > be hard for some of us at 7 AM on the US west coast or 11 PM in China.
> > > Sorry about that.
> > >
> > > Let's use Google Hangouts as the video conferencing technology. I think
> > we
> > > may be limited to something like 30 participants, so I'd encourage any
> > > co-located contributors to consider joining together (if appropriate).
> > > Joining the meeting should be straightforward -- please find the link
> > > within. No special requirements that I'm aware of.
> > >
> > > Just to re-state the expectations:
> > > * This is totally optional and informal.
> > > * It is simply a chance for everyone to meet others and see the faces
> of
> > > people we share a common passion with.
> > > * No specific agenda.
> > > * An open discussion on any topic of interest to the contributor
> > community
> > > is
> > > welcome -- please feel free to bring up any topics you care about.
> > > * No formal discussion or decisions should to be made.
> > > * We'll keep notes and share them on the mailing list shortly after the
> > > meeting.
> > >
> > > Thanks -- and hope to see all of you there!
> > >
> > > Davor
> > >
> > > [1]
> > >
> > https://lists.apache.org/thread.html/baf057b81c5f6d4127abadac165d92
> 3a224d34438fe67b71d73743ad@%3Cdev.beam.apache.org%3E
> > > [2]
> > >
> > https://calendar.google.com/calendar/event?action=TEMPLATE&tmeid=
> a3A2MzdhaWdhdjByNWRibzZrN2ZnOG1kMTAgZGF2b3JAZ29vZ2xlLmNvbQ&
> tmsrc=davor%40google.com
> > >
> >
>


Re: Release 0.6.0

2017-03-02 Thread Jean-Baptiste Onofré

Hi Kenn,

Fair enough. +1

Regards
JB

On 03/03/2017 12:28 AM, Kenneth Knowles wrote:

Hi all,

I've just filed https://issues.apache.org/jira/browse/BEAM-1611. It is
technically not a bug in Beam but the easiest quick fix is to workaround in
the DataflowRunner, so I'd like to block the release on it. It should be
available ahead of the release's existing schedule, and can easily be
cherry-picked.

Kenn

On Wed, Mar 1, 2017 at 10:50 PM, Jean-Baptiste Onofré 
wrote:


Thanks Ahmet !

Regards
JB

On 03/02/2017 07:42 AM, Ahmet Altay wrote:


Sure, I can wait. To be clear, Thursday night in which time zone?

Thank you,
Ahmet

On Wed, Mar 1, 2017 at 10:38 PM, Jean-Baptiste Onofré 
wrote:

Hi Ahmet,


Can you wait up to Thursday night ? Trying to merge BEAM-649.

Thanks !
Regards
JB


On 03/01/2017 07:23 PM, Ahmet Altay wrote:

Thank you. I will start working on it.


Ahmet

On Wed, Mar 1, 2017 at 9:03 AM, Aljoscha Krettek 
wrote:

I just closed the last blocking issue, we should be good to go now.



Sorry again for the hold-up.

On Tue, 28 Feb 2017 at 18:38 Ahmet Altay 
wrote:

Thank you all. I will wait for release blocking issues to be closed.

Sergio, thank you for the information. I will document the friction
points
during this release process. Following the release we can start a
discussion about how to fix those.

Ahmet

On Tue, Feb 28, 2017 at 9:22 AM, Aljoscha Krettek 


wrote:

That was my mistake, sorry for that. I should have tagged [1] as a



blocker


because leaking state is probably a bad idea. At least then people

would

be


aware and we could have discussed whether it is a blocker.


There is already an open PR for this now.

[1] https://issues.apache.org/jira/browse/BEAM-1517

On Tue, 28 Feb 2017 at 18:21 Jean-Baptiste Onofré 

wrote:




Regarding BEAM-649, it's not a release blocker, it's a good to have.



As I'm pretty close to the end of the Pull Request (hopefully tonight

or




tomorrow), it's a "Good To Have".




Regards
JB

On 02/28/2017 06:09 PM, Davor Bonaci wrote:

Can we please use JIRA to tag potentially release-blocking issues?


Anyone




can just add a 'Fix Versions' field of an open issue to the next




scheduled


release -- and it becomes easily visible to everyone in the project.


In general, I'm not a fan of blocking releases for new
functionality.
Rushing new features and a lack of baking time usually translates to

bugs.


However, I think this time it is totally justified -- on a separate


thread


we plan for this to be the last release before the "first stable


release";


and picking the new features now will provide additional coverage for


it.






So, +1, but please tag in JIRA.


On Tue, Feb 28, 2017 at 2:09 AM, Aljoscha Krettek <

aljos...@apache.org







wrote:




I would like to finish these two:


https://issues.apache.org/jira/browse/BEAM-1036: Support for new

State





API


in FlinkRunner



https://issues.apache.org/jira/browse/BEAM-1116: Support for new

Timer





API


in Flink runner




Both of them are finished for the streaming runner, for the batch

runner





I'm merging the code for the first right now and the second will not





take




long.




There is also this: https://issues.apache.org/jira
/browse/BEAM-1517

:





User




state in the Flink Streaming Runner is not garbage collected. It's




not a





regression from 0.5.0 where we simply didn't have this feature but





I'm





still somewhat uneasy about this.






On Tue, 28 Feb 2017 at 09:44 Jean-Baptiste Onofré 



wrote:






Fair enough.




I also try to merge https://github.com/apache/beam/pull/1739
asap.

Regards
JB

On 02/28/2017 09:34 AM, Amit Sela wrote:

I'd prefer we wait to merge https://github.com/apache/


beam/pull/2050





Shouldn't take long now..





On Tue, Feb 28, 2017 at 10:00 AM Sergio Fernández <

wik...@apache.org>





wrote:







Sounds good!



Ahmet, notice ASF has not current infrastructure to stage Python

Release





Candidates. Anyway we left unmanaged the Maven deploy lifecycle





for





the





Python SDK, but it should be discussed at some point.







On Mon, Feb 27, 2017 at 11:01 PM, Ahmet Altay








wrote:




Hi all,



It's been about a month since the last release. I would like

propose





starting the next release. There are no releasing blocking bugs





in





JIRA





[1]. Are there any release blocking issues I am missing?





Unless there is an objection I will volunteer to manage this

release.





This





will be the first release with Python content. In case there are


issues





with that it might be easier for me to resolve and document





those





as


part





of the release process.





Thank you,
Ahmet

[1]
https://issues.apache.org/jira/issues/?jql=project%20%
3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%
20fixVersion%20%3D%200.6.0%20ORDER%20BY%20due%20ASC%2C%
20priority%20DESC%2C%20created%20ASC





--
Sergio Fernández
Partner Technology Manager
Redlink Gmb