Re: [PROPOSAL] Switch from Guava futures vs Java 8 futures

2018-02-01 Thread Alexey Romanenko
+1, sounds great!

Regards,
Alexey

> On 2 Feb 2018, at 07:14, Thomas Weise  wrote:
> 
> +1
> 
> 
> On Thu, Feb 1, 2018 at 9:07 PM, Jean-Baptiste Onofré  > wrote:
> +1
> 
> Regards
> JB
> 
> On 02/01/2018 07:54 PM, Kenneth Knowles wrote:
> > Hi all,
> >
> > Luke, Thomas, and I had some in-person discussions about the use of Java 8
> > futures and Guava futures in the portability support code. I wanted to 
> > bring our
> > thoughts to the dev list for feedback.
> >
> > As background:
> >
> >  - Java 5+ "Future" lacks the main purpose of future, which is async 
> > chaining.
> >  - Guava introduced ListenableFuture to do real future-oriented programming
> >  - Java 8 added CompletionStage which is more-or-less the expected interface
> >  
> > It is still debatable whether Java got it right [1]. But since it is
> > standardized, doesn't need to be shaded, etc, it is worth trying to just 
> > use it
> > carefully in the right ways. So we thought to propose that we migrate most 
> > uses
> > of Guava futures to Java 8 futures.
> >
> > What do you think? Have we missed an important problem that would make this 
> > a
> > deal-breaker?
> >
> > Kenn
> >
> > [1]
> > e.g. 
> > https://stackoverflow.com/questions/38744943/listenablefuture-vs-completablefuture#comment72041244_39250452
> >  
> > 
> > and such discussions are likely to occur whenever you bring it up with 
> > someone
> > who cares a lot about futures :-)
> 
> --
> Jean-Baptiste Onofré
> jbono...@apache.org 
> http://blog.nanthrax.net 
> Talend - http://www.talend.com 
> 



Re: [PROPOSAL] Switch from Guava futures vs Java 8 futures

2018-02-01 Thread Thomas Weise
+1


On Thu, Feb 1, 2018 at 9:07 PM, Jean-Baptiste Onofré 
wrote:

> +1
>
> Regards
> JB
>
> On 02/01/2018 07:54 PM, Kenneth Knowles wrote:
> > Hi all,
> >
> > Luke, Thomas, and I had some in-person discussions about the use of Java
> 8
> > futures and Guava futures in the portability support code. I wanted to
> bring our
> > thoughts to the dev list for feedback.
> >
> > As background:
> >
> >  - Java 5+ "Future" lacks the main purpose of future, which is async
> chaining.
> >  - Guava introduced ListenableFuture to do real future-oriented
> programming
> >  - Java 8 added CompletionStage which is more-or-less the expected
> interface
> >
> > It is still debatable whether Java got it right [1]. But since it is
> > standardized, doesn't need to be shaded, etc, it is worth trying to just
> use it
> > carefully in the right ways. So we thought to propose that we migrate
> most uses
> > of Guava futures to Java 8 futures.
> >
> > What do you think? Have we missed an important problem that would make
> this a
> > deal-breaker?
> >
> > Kenn
> >
> > [1]
> > e.g. https://stackoverflow.com/questions/38744943/listenablefuture-vs-
> completablefuture#comment72041244_39250452
> > and such discussions are likely to occur whenever you bring it up with
> someone
> > who cares a lot about futures :-)
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [PROPOSAL] Switch from Guava futures vs Java 8 futures

2018-02-01 Thread Jean-Baptiste Onofré
+1

Regards
JB

On 02/01/2018 07:54 PM, Kenneth Knowles wrote:
> Hi all,
> 
> Luke, Thomas, and I had some in-person discussions about the use of Java 8
> futures and Guava futures in the portability support code. I wanted to bring 
> our
> thoughts to the dev list for feedback.
> 
> As background:
> 
>  - Java 5+ "Future" lacks the main purpose of future, which is async chaining.
>  - Guava introduced ListenableFuture to do real future-oriented programming
>  - Java 8 added CompletionStage which is more-or-less the expected interface
>  
> It is still debatable whether Java got it right [1]. But since it is
> standardized, doesn't need to be shaded, etc, it is worth trying to just use 
> it
> carefully in the right ways. So we thought to propose that we migrate most 
> uses
> of Guava futures to Java 8 futures.
> 
> What do you think? Have we missed an important problem that would make this a
> deal-breaker?
> 
> Kenn
> 
> [1]
> e.g. 
> https://stackoverflow.com/questions/38744943/listenablefuture-vs-completablefuture#comment72041244_39250452
> and such discussions are likely to occur whenever you bring it up with someone
> who cares a lot about futures :-)

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-02-01 Thread Kenneth Knowles
Nice. This sounds like a great idea (or two?) and goes along with what I
just started for futures.

Guava: filed https://issues.apache.org/jira/browse/BEAM-3606 and assigned
to Ismaël :-) and converted my futures thing to a subtask.

Specific things for our micro guava:

 - checkArgumentNotNull can throw the right exception
 - our own Optional because Java's is not Serializable
 - futures combinators since many are missing, especially Java's don't do
exceptions right

Protobuf: didn't file an issue because I'm not sure

I was wondering if pre-shading works. We really need to get rid of it from
public APIs in a 100% reliable way. It is also a problem for Dataflow. I
was wondering if one approach is to pre-shade gcpio-protobuf-java,
gcpio-protobuf-java-util, gcpio-grpc-java, etc. Anything that needs to take
a Message object. (and do the same for beam-model-protobuf-java since the
model bits have to depend on each other but nothing else).

Kenn

On Thu, Feb 1, 2018 at 1:56 AM, Ismaël Mejía  wrote:

> Huge +1 to get rid of Guava!
>
> This solves annoying dependency issues for some IOs and allow us to
> get rid of the shading that makes current jars bigger than they should
> be.
>
> We can create our own 'micro guava' package with some classes for
> things that are hard to migrate, or that we  prefer to still have like
> the check* methods for example. Given the size of the task we should
> probably divide it into subtasks, more important is to get rid of it
> for 'sdks/java/core'. We can then attack other areas afterwards.
>
> Other important idea would be to get rid of Protobuf in public APIs
> like GCPIO and to better shade it from leaking into the runners. An
> unexpected side effect of this is a leak of netty via gRPC/protobuf
> that is byting us for the Spark runner, but well that's worth a
> different discussion.
>
>
> On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau
>  wrote:
> > a map of list is fine and not a challenge we'll face long I hope ;)
> >
> >
> > Romain Manni-Bucau
> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn
> >
> > 2018-02-01 9:40 GMT+01:00 Reuven Lax :
> >>
> >> Not sure we'll be able to replace them all. Things like guava Table and
> >> Multimap don't have great replacements in Java8.
> >>
> >> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré  >
> >> wrote:
> >>>
> >>> +1, it was on my TODO for a while waiting the Java8 update.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
> >>> > Why not dropping guava for all beam codebase? With java 8 it is quite
> >>> > easy to do
> >>> > it and avoid a bunch of conflicts. Did it in 2 projects with quite a
> >>> > good result.
> >>> >
> >>> > Le 1 févr. 2018 06:50, "Lukasz Cwik"  >>> > > a écrit :
> >>> >
> >>> > Make sure to include the guava version in the artifact name so
> that
> >>> > we can
> >>> > have multiple vendored versions.
> >>> >
> >>> > On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles  >>> > > wrote:
> >>> >
> >>> > I didn't have time for this, but it just bit me. We
> definitely
> >>> > have
> >>> > Guava on the API surface of runner support code in ways that
> >>> > get
> >>> > incompatibly shaded. I will probably start "1a" by making a
> >>> > shaded
> >>> > library org.apache.beam:vendored-guava and starting to use
> it.
> >>> > It sounds
> >>> > like there is generally unanimous support for that much,
> >>> > anyhow.
> >>> >
> >>> > Kenn
> >>> >
> >>> > On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek
> >>> >  >>> > > wrote:
> >>> >
> >>> > Thanks Ismaël for bringing up this discussion again!
> >>> >
> >>> > I would be in favour of 1) and more specifically of 1a)
> >>> >
> >>> > Aljoscha
> >>> >
> >>> >
> >>> >> On 12. Dec 2017, at 18:56, Lukasz Cwik <
> lc...@google.com
> >>> >> > wrote:
> >>> >>
> >>> >> You can always run tests on post shaded artifacts
> instead
> >>> >> of the
> >>> >> compiled classes, it just requires us to change our
> maven
> >>> >> surefire
> >>> >> / gradle test configurations but it is true that most
> IDEs
> >>> >> would
> >>> >> behave better with a dependency jar unless you delegate
> >>> >> all the
> >>> >> build/test actions to the build system and then it won't
> >>> >> matter.
> >>> >>
> >>> >> On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles
> >>> >>  >>> >> > wrote:
> >>> >>
> >>> >> There's also, with additional overhead,
> >>> >>
> >>> >> 1a) A relocated and shipped package for each thing
> we
> 

Re: Replacing Python DirectRunner apply_* hooks with PTransformOverrides

2018-02-01 Thread Kenneth Knowles
+1 for removing apply_*

For the Java SDK, removing specialized intercepts was an important first
step towards the portability framework. I wonder if there is a way for the
Python SDK to leapfrog, taking advantage of some of the lessons that Java
learned a bit more painfully. Most pertinent I think is that if an SDK's
role is to construct a pipeline and ship the proto to a runner (service)
then overrides apply to a post-deserialization pipeline. The Java
DirectRunner does a proto round-trip to avoid accidentally depending on
things that are not really part of the pipeline. I would this crisp
abstraction enforcement would add even more value to Python.

Kenn

On Thu, Feb 1, 2018 at 5:21 PM, Charles Chen  wrote:

> In the Python DirectRunner, we currently use apply_* overrides to override
> the operation of the default .expand() operation for certain transforms.
> For example, GroupByKey has a special implementation in the DirectRunner,
> so we use an apply_* override hook to replace the implementation of
> GroupByKey.expand().
>
> However, this strategy has drawbacks. Because this override operation
> happens eagerly during graph construction, the pipeline graph is
> specialized and modified before a specific runner is bound to the
> pipeline's execution. This makes the pipeline graph non-portable and blocks
> full migration to using the Runner API pipeline representation in the
> DirectRunner.
>
> By contrast, the SDK's PTransformOverride mechanism allows the expression
> of matchers that operate on the unspecialized graph, replacing PTransforms
> as necessary to produce a DirectRunner-specialized pipeline graph for
> execution.
>
> I therefore propose to replace these eager apply_* overrides with
> PTransformOverrides that operate on the completely constructed graph.
>
> The JIRA issue is https://issues.apache.org/jira/browse/BEAM-3566, and
> I've prepared a candidate patch at https://github.com/apache/
> incubator-beam/pull/4529.
>
> Best,
> Charles
>


Replacing Python DirectRunner apply_* hooks with PTransformOverrides

2018-02-01 Thread Charles Chen
In the Python DirectRunner, we currently use apply_* overrides to override
the operation of the default .expand() operation for certain transforms.
For example, GroupByKey has a special implementation in the DirectRunner,
so we use an apply_* override hook to replace the implementation of
GroupByKey.expand().

However, this strategy has drawbacks. Because this override operation
happens eagerly during graph construction, the pipeline graph is
specialized and modified before a specific runner is bound to the
pipeline's execution. This makes the pipeline graph non-portable and blocks
full migration to using the Runner API pipeline representation in the
DirectRunner.

By contrast, the SDK's PTransformOverride mechanism allows the expression
of matchers that operate on the unspecialized graph, replacing PTransforms
as necessary to produce a DirectRunner-specialized pipeline graph for
execution.

I therefore propose to replace these eager apply_* overrides with
PTransformOverrides that operate on the completely constructed graph.

The JIRA issue is https://issues.apache.org/jira/browse/BEAM-3566, and I've
prepared a candidate patch at
https://github.com/apache/incubator-beam/pull/4529.

Best,
Charles


Re: Filesystems.copy and .rename behavior

2018-02-01 Thread Robert Bradshaw
You could add a step to delete all of dest before a barrier and the
step that does the rename as outlined. In that case, any dest file
that exists must be good.

On Thu, Feb 1, 2018 at 2:52 PM, Eugene Kirpichov  wrote:
> I think this is still unsafe in case exists(dst) (e.g. this is a re-run of a
> pipeline) but src is missing due to some bad reason. However it's probably
> better than what we have (e.g. we currently certainly don't perform checksum
> checks).
>
> On Thu, Feb 1, 2018 at 2:45 PM Udi Meiri  wrote:
>>
>> For GCS, I would do what I believe we already do.
>> rename(src, dst):
>> - if !exists(src) and exists(dst) return 0
>> - if !exists(src) and !exists(dst) return error
>> - if exists(src) and exists(dst) { if checksum(src) == checksum(dst)
>> return 0 else delete(dst) }
>> - Start a GCS copy from src to dst.
>> - Wait for GCS copy to complete.
>> - delete(src)
>>
>> For filesystems that don't have checksum() metadata, size() can be used
>> instead.
>>
>> I've opened a bug to track this:
>> https://issues.apache.org/jira/browse/BEAM-3600
>>
>> On Thu, Feb 1, 2018 at 2:25 PM Eugene Kirpichov 
>> wrote:
>>>
>>> Yes, IGNORE_MISSING_FILES is unsafe because it will - well - ignore files
>>> that are missing for more ominous reasons than just this being a non-first
>>> attempt at renaming src to dst. E.g. if there was a bug in constructing the
>>> filename to be renamed, or if we somehow messed up the order of rename vs
>>> cleanup, etc. - these situations with IGNORE_MISSING_FILES would lead to
>>> silent data loss (likely caught by unit tests though - so this is not a
>>> super serious issue).
>>>
>>> Basically I just can't think of a case when I was copying files and
>>> thinking "oh man, I wish it didn't give an error if the stuff I'm copying
>>> doesn't exist" - the option exists only because we couldn't come up with
>>> another way to implement idempotent rename on GCS.
>>>
>>> What's your idea of how a safe retryable GCS rename() could work?
>>>
>>> On Wed, Jan 31, 2018 at 6:45 PM Udi Meiri  wrote:

 Eugene, if I get this right, you're saying that IGNORE_MISSING_FILES is
 unsafe because it will skip (src, dst) pairs where neither exist? (it only
 looks if src exists)

 For GCS, we can construct a safe retryable rename() operation, assuming
 that copy() and delete() are atomic for a single file or pair of files.



 On Wed, Jan 31, 2018 at 4:00 PM Raghu Angadi  wrote:
>
> On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov
>  wrote:
>>
>> As far as I know, the current implementation of file sinks is the only
>> reason why the flag IGNORE_MISSING for copying even exists - there's no
>> other compelling reason to justify it. We implement "rename" as "copy, 
>> then
>> delete" (in a single DoFn), so for idempodency of this operation we need 
>> to
>> ignore the copying of a non-existent file.
>>
>> I think the right way to go would be to change the implementation of
>> renaming to have a @RequiresStableInput (or reshuffle) in the middle, so
>> it's made of 2 individually idempotent operations:
>> 1) copy, which would fail if input is missing, and would overwrite
>> output if it exists
>> -- reshuffle --
>> 2) delete, which would not fail if input is missing.
>
>
> Something like this is needed only in streaming, right?
>
> Raghu.
>
>>
>> That way first everything is copied (possibly via multiple attempts),
>> and then old files are deleted (possibly via multiple attempts).
>>
>> On Wed, Jan 31, 2018 at 2:26 PM Udi Meiri  wrote:
>>>
>>> I agree that overwriting is more in line with user expectations.
>>> I believe that the sink should not ignore errors from the filesystem
>>> layer. Instead, the FileSystem API should be more well defined.
>>> Examples: rename() and copy() should overwrite existing files at the
>>> destination, copy() should have an ignore_missing flag.
>>>
>>> On Wed, Jan 31, 2018 at 1:49 PM Raghu Angadi 
>>> wrote:

 Original mail mentions that output from second run of word_count is
 ignored. That does not seem as safe as ignoring error from a second 
 attempt
 of a step. How do we know second run didn't run on different output?
 Overwriting seems more accurate than ignoring. Does handling this 
 error at
 sink level distinguish between the two (another run vs second attempt)?


 On Wed, Jan 31, 2018 at 12:32 PM, Udi Meiri 
 wrote:
>
> Yeah, another round of refactoring is due to move the rename via
> copy+delete logic up to the file-based sink level.
>
>
> On Wed, 

RFC: Better event time and source watermark in KafkaIO

2018-02-01 Thread Raghu Angadi
Kafka supports server-side and client-side timestamps since version 0.10.1.
KafkaIO in Beam can provide much better watermark, especially for topics
with server-side timestamps. The default implementation currently just uses
processing time for event time and watermark, which is not very useful.

Wrote a short doc
[1]
about the proposal. Your feedback is welcome. I am planning to work on it,
and don't mind guiding if anyone else is interested (it is fairly
accessible for newcomers) .

TL;DR :
   *server-side timestamp* : It monotonically increases within a Kafka
partition. We can provide near perfect watermark : min(timestamp of latest
record consumed on a partition).
*client-side / custom timestamp* : Watermark is min(timestamp over last
few seconds) similar to PubsubIO in Beam. This is not great, but we will
let user provide tighter bounds or provide entirely own implementation.

Thanks,
Raghu.

[1]
https://docs.google.com/document/d/1DyWcLJpALRoUfvYUbiPCDVikYb_Xz2X7Co2aDUVVd4I/edit?usp=sharing


Re: Filesystems.copy and .rename behavior

2018-02-01 Thread Eugene Kirpichov
I think this is still unsafe in case exists(dst) (e.g. this is a re-run of
a pipeline) but src is missing due to some bad reason. However it's
probably better than what we have (e.g. we currently certainly don't
perform checksum checks).

On Thu, Feb 1, 2018 at 2:45 PM Udi Meiri  wrote:

> For GCS, I would do what I believe we already do.
> rename(src, dst):
> - if !exists(src) and exists(dst) return 0
> - if !exists(src) and !exists(dst) return error
> - if exists(src) and exists(dst) { if checksum(src) == checksum(dst)
> return 0 else delete(dst) }
> - Start a GCS copy from src to dst.
> - Wait for GCS copy to complete.
> - delete(src)
>
> For filesystems that don't have checksum() metadata, size() can be used
> instead.
>
> I've opened a bug to track this:
> https://issues.apache.org/jira/browse/BEAM-3600
>
> On Thu, Feb 1, 2018 at 2:25 PM Eugene Kirpichov 
> wrote:
>
>> Yes, IGNORE_MISSING_FILES is unsafe because it will - well - ignore files
>> that are missing for more ominous reasons than just this being a non-first
>> attempt at renaming src to dst. E.g. if there was a bug in constructing the
>> filename to be renamed, or if we somehow messed up the order of rename vs
>> cleanup, etc. - these situations with IGNORE_MISSING_FILES would lead to
>> silent data loss (likely caught by unit tests though - so this is not a
>> super serious issue).
>>
>> Basically I just can't think of a case when I was copying files and
>> thinking "oh man, I wish it didn't give an error if the stuff I'm copying
>> doesn't exist" - the option exists only because we couldn't come up with
>> another way to implement idempotent rename on GCS.
>>
>> What's your idea of how a safe retryable GCS rename() could work?
>>
>> On Wed, Jan 31, 2018 at 6:45 PM Udi Meiri  wrote:
>>
>>> Eugene, if I get this right, you're saying that IGNORE_MISSING_FILES is
>>> unsafe because it will skip (src, dst) pairs where neither exist? (it only
>>> looks if src exists)
>>>
>>> For GCS, we can construct a safe retryable rename() operation, assuming
>>> that copy() and delete() are atomic for a single file or pair of files.
>>>
>>>
>>>
>>> On Wed, Jan 31, 2018 at 4:00 PM Raghu Angadi  wrote:
>>>
 On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov  wrote:

> As far as I know, the current implementation of file sinks is the only
> reason why the flag IGNORE_MISSING for copying even exists - there's no
> other compelling reason to justify it. We implement "rename" as "copy, 
> then
> delete" (in a single DoFn), so for idempodency of this operation we need 
> to
> ignore the copying of a non-existent file.
>
> I think the right way to go would be to change the implementation of
> renaming to have a @RequiresStableInput (or reshuffle) in the middle, so
> it's made of 2 individually idempotent operations:
> 1) copy, which would fail if input is missing, and would overwrite
> output if it exists
> -- reshuffle --
> 2) delete, which would not fail if input is missing.
>

 Something like this is needed only in streaming, right?

 Raghu.


> That way first everything is copied (possibly via multiple attempts),
> and then old files are deleted (possibly via multiple attempts).
>
> On Wed, Jan 31, 2018 at 2:26 PM Udi Meiri  wrote:
>
>> I agree that overwriting is more in line with user expectations.
>> I believe that the sink should not ignore errors from the filesystem
>> layer. Instead, the FileSystem API should be more well defined.
>> Examples: rename() and copy() should overwrite existing files at the
>> destination, copy() should have an ignore_missing flag.
>>
>> On Wed, Jan 31, 2018 at 1:49 PM Raghu Angadi 
>> wrote:
>>
>>> Original mail mentions that output from second run of word_count is
>>> ignored. That does not seem as safe as ignoring error from a second 
>>> attempt
>>> of a step. How do we know second run didn't run on different output?
>>> Overwriting seems more accurate than ignoring. Does handling this error 
>>> at
>>> sink level distinguish between the two (another run vs second attempt)?
>>>
>>>
>>> On Wed, Jan 31, 2018 at 12:32 PM, Udi Meiri 
>>> wrote:
>>>
 Yeah, another round of refactoring is due to move the rename via
 copy+delete logic up to the file-based sink level.

 On Wed, Jan 31, 2018, 10:42 Chamikara Jayalath <
 chamik...@google.com> wrote:

> Good point. There's always the chance of step that performs final
> rename being retried. So we'll have to ignore this error at the sink 
> level.
> We don't necessarily have to do this at the FileSystem level though. I
> think the proper 

Re: Filesystems.copy and .rename behavior

2018-02-01 Thread Udi Meiri
For GCS, I would do what I believe we already do.
rename(src, dst):
- if !exists(src) and exists(dst) return 0
- if !exists(src) and !exists(dst) return error
- if exists(src) and exists(dst) { if checksum(src) == checksum(dst) return
0 else delete(dst) }
- Start a GCS copy from src to dst.
- Wait for GCS copy to complete.
- delete(src)

For filesystems that don't have checksum() metadata, size() can be used
instead.

I've opened a bug to track this:
https://issues.apache.org/jira/browse/BEAM-3600

On Thu, Feb 1, 2018 at 2:25 PM Eugene Kirpichov 
wrote:

> Yes, IGNORE_MISSING_FILES is unsafe because it will - well - ignore files
> that are missing for more ominous reasons than just this being a non-first
> attempt at renaming src to dst. E.g. if there was a bug in constructing the
> filename to be renamed, or if we somehow messed up the order of rename vs
> cleanup, etc. - these situations with IGNORE_MISSING_FILES would lead to
> silent data loss (likely caught by unit tests though - so this is not a
> super serious issue).
>
> Basically I just can't think of a case when I was copying files and
> thinking "oh man, I wish it didn't give an error if the stuff I'm copying
> doesn't exist" - the option exists only because we couldn't come up with
> another way to implement idempotent rename on GCS.
>
> What's your idea of how a safe retryable GCS rename() could work?
>
> On Wed, Jan 31, 2018 at 6:45 PM Udi Meiri  wrote:
>
>> Eugene, if I get this right, you're saying that IGNORE_MISSING_FILES is
>> unsafe because it will skip (src, dst) pairs where neither exist? (it only
>> looks if src exists)
>>
>> For GCS, we can construct a safe retryable rename() operation, assuming
>> that copy() and delete() are atomic for a single file or pair of files.
>>
>>
>>
>> On Wed, Jan 31, 2018 at 4:00 PM Raghu Angadi  wrote:
>>
>>> On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov 
>>> wrote:
>>>
 As far as I know, the current implementation of file sinks is the only
 reason why the flag IGNORE_MISSING for copying even exists - there's no
 other compelling reason to justify it. We implement "rename" as "copy, then
 delete" (in a single DoFn), so for idempodency of this operation we need to
 ignore the copying of a non-existent file.

 I think the right way to go would be to change the implementation of
 renaming to have a @RequiresStableInput (or reshuffle) in the middle, so
 it's made of 2 individually idempotent operations:
 1) copy, which would fail if input is missing, and would overwrite
 output if it exists
 -- reshuffle --
 2) delete, which would not fail if input is missing.

>>>
>>> Something like this is needed only in streaming, right?
>>>
>>> Raghu.
>>>
>>>
 That way first everything is copied (possibly via multiple attempts),
 and then old files are deleted (possibly via multiple attempts).

 On Wed, Jan 31, 2018 at 2:26 PM Udi Meiri  wrote:

> I agree that overwriting is more in line with user expectations.
> I believe that the sink should not ignore errors from the filesystem
> layer. Instead, the FileSystem API should be more well defined.
> Examples: rename() and copy() should overwrite existing files at the
> destination, copy() should have an ignore_missing flag.
>
> On Wed, Jan 31, 2018 at 1:49 PM Raghu Angadi 
> wrote:
>
>> Original mail mentions that output from second run of word_count is
>> ignored. That does not seem as safe as ignoring error from a second 
>> attempt
>> of a step. How do we know second run didn't run on different output?
>> Overwriting seems more accurate than ignoring. Does handling this error 
>> at
>> sink level distinguish between the two (another run vs second attempt)?
>>
>>
>> On Wed, Jan 31, 2018 at 12:32 PM, Udi Meiri  wrote:
>>
>>> Yeah, another round of refactoring is due to move the rename via
>>> copy+delete logic up to the file-based sink level.
>>>
>>> On Wed, Jan 31, 2018, 10:42 Chamikara Jayalath 
>>> wrote:
>>>
 Good point. There's always the chance of step that performs final
 rename being retried. So we'll have to ignore this error at the sink 
 level.
 We don't necessarily have to do this at the FileSystem level though. I
 think the proper behavior might be to raise an error for the rename at 
 the
 FileSystem level if the destination already exists (or source doesn't
 exist) while ignoring that error (and possibly logging a warning) at 
 the
 sink level.

 - Cham


 On Tue, Jan 30, 2018 at 6:47 PM Reuven Lax 
 wrote:

> I think the idea was to ignore "already exists" 

Re: Filesystems.copy and .rename behavior

2018-02-01 Thread Eugene Kirpichov
Yes, IGNORE_MISSING_FILES is unsafe because it will - well - ignore files
that are missing for more ominous reasons than just this being a non-first
attempt at renaming src to dst. E.g. if there was a bug in constructing the
filename to be renamed, or if we somehow messed up the order of rename vs
cleanup, etc. - these situations with IGNORE_MISSING_FILES would lead to
silent data loss (likely caught by unit tests though - so this is not a
super serious issue).

Basically I just can't think of a case when I was copying files and
thinking "oh man, I wish it didn't give an error if the stuff I'm copying
doesn't exist" - the option exists only because we couldn't come up with
another way to implement idempotent rename on GCS.

What's your idea of how a safe retryable GCS rename() could work?

On Wed, Jan 31, 2018 at 6:45 PM Udi Meiri  wrote:

> Eugene, if I get this right, you're saying that IGNORE_MISSING_FILES is
> unsafe because it will skip (src, dst) pairs where neither exist? (it only
> looks if src exists)
>
> For GCS, we can construct a safe retryable rename() operation, assuming
> that copy() and delete() are atomic for a single file or pair of files.
>
>
>
> On Wed, Jan 31, 2018 at 4:00 PM Raghu Angadi  wrote:
>
>> On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov 
>> wrote:
>>
>>> As far as I know, the current implementation of file sinks is the only
>>> reason why the flag IGNORE_MISSING for copying even exists - there's no
>>> other compelling reason to justify it. We implement "rename" as "copy, then
>>> delete" (in a single DoFn), so for idempodency of this operation we need to
>>> ignore the copying of a non-existent file.
>>>
>>> I think the right way to go would be to change the implementation of
>>> renaming to have a @RequiresStableInput (or reshuffle) in the middle, so
>>> it's made of 2 individually idempotent operations:
>>> 1) copy, which would fail if input is missing, and would overwrite
>>> output if it exists
>>> -- reshuffle --
>>> 2) delete, which would not fail if input is missing.
>>>
>>
>> Something like this is needed only in streaming, right?
>>
>> Raghu.
>>
>>
>>> That way first everything is copied (possibly via multiple attempts),
>>> and then old files are deleted (possibly via multiple attempts).
>>>
>>> On Wed, Jan 31, 2018 at 2:26 PM Udi Meiri  wrote:
>>>
 I agree that overwriting is more in line with user expectations.
 I believe that the sink should not ignore errors from the filesystem
 layer. Instead, the FileSystem API should be more well defined.
 Examples: rename() and copy() should overwrite existing files at the
 destination, copy() should have an ignore_missing flag.

 On Wed, Jan 31, 2018 at 1:49 PM Raghu Angadi 
 wrote:

> Original mail mentions that output from second run of word_count is
> ignored. That does not seem as safe as ignoring error from a second 
> attempt
> of a step. How do we know second run didn't run on different output?
> Overwriting seems more accurate than ignoring. Does handling this error at
> sink level distinguish between the two (another run vs second attempt)?
>
>
> On Wed, Jan 31, 2018 at 12:32 PM, Udi Meiri  wrote:
>
>> Yeah, another round of refactoring is due to move the rename via
>> copy+delete logic up to the file-based sink level.
>>
>> On Wed, Jan 31, 2018, 10:42 Chamikara Jayalath 
>> wrote:
>>
>>> Good point. There's always the chance of step that performs final
>>> rename being retried. So we'll have to ignore this error at the sink 
>>> level.
>>> We don't necessarily have to do this at the FileSystem level though. I
>>> think the proper behavior might be to raise an error for the rename at 
>>> the
>>> FileSystem level if the destination already exists (or source doesn't
>>> exist) while ignoring that error (and possibly logging a warning) at the
>>> sink level.
>>>
>>> - Cham
>>>
>>>
>>> On Tue, Jan 30, 2018 at 6:47 PM Reuven Lax  wrote:
>>>
 I think the idea was to ignore "already exists" errors. The reason
 being that any step in Beam can be executed multiple times, including 
 the
 rename step. If the rename step gets run twice, the second run should
 succeed vacuously.


 On Tue, Jan 30, 2018 at 6:19 PM, Udi Meiri 
 wrote:

> Hi,
> I've been working on HDFS code for the Python SDK and I've noticed
> some behaviors which are surprising. I wanted to know if these 
> behaviors
> are known and intended.
>
> 1. When renaming files during finalize_write, rename errors are
> ignored
> 

Re: [PROPOSAL] Switch from Guava futures vs Java 8 futures

2018-02-01 Thread Ismaël Mejía
+1

On Thu, Feb 1, 2018 at 9:53 PM, Romain Manni-Bucau
 wrote:
> +1 indeed
>
> Le 1 févr. 2018 21:34, "Eugene Kirpichov"  a écrit :
>>
>> Reducing dependency on Guava in favor of something Java-standard sounds
>> great, +1.
>>
>> On Thu, Feb 1, 2018 at 11:53 AM Reuven Lax  wrote:
>>>
>>> Unless there's something that doesn't work in Java 8 future, +1 to
>>> migrating.
>>>
>>> On Thu, Feb 1, 2018 at 10:54 AM, Kenneth Knowles  wrote:

 Hi all,

 Luke, Thomas, and I had some in-person discussions about the use of Java
 8 futures and Guava futures in the portability support code. I wanted to
 bring our thoughts to the dev list for feedback.

 As background:

  - Java 5+ "Future" lacks the main purpose of future, which is async
 chaining.
  - Guava introduced ListenableFuture to do real future-oriented
 programming
  - Java 8 added CompletionStage which is more-or-less the expected
 interface

 It is still debatable whether Java got it right [1]. But since it is
 standardized, doesn't need to be shaded, etc, it is worth trying to just 
 use
 it carefully in the right ways. So we thought to propose that we migrate
 most uses of Guava futures to Java 8 futures.

 What do you think? Have we missed an important problem that would make
 this a deal-breaker?

 Kenn

 [1] e.g.
 https://stackoverflow.com/questions/38744943/listenablefuture-vs-completablefuture#comment72041244_39250452
 and such discussions are likely to occur whenever you bring it up with
 someone who cares a lot about futures :-)
>>>
>>>
>


Re: [PROPOSAL] Switch from Guava futures vs Java 8 futures

2018-02-01 Thread Romain Manni-Bucau
+1 indeed

Le 1 févr. 2018 21:34, "Eugene Kirpichov"  a écrit :

> Reducing dependency on Guava in favor of something Java-standard sounds
> great, +1.
>
> On Thu, Feb 1, 2018 at 11:53 AM Reuven Lax  wrote:
>
>> Unless there's something that doesn't work in Java 8 future, +1 to
>> migrating.
>>
>> On Thu, Feb 1, 2018 at 10:54 AM, Kenneth Knowles  wrote:
>>
>>> Hi all,
>>>
>>> Luke, Thomas, and I had some in-person discussions about the use of Java
>>> 8 futures and Guava futures in the portability support code. I wanted to
>>> bring our thoughts to the dev list for feedback.
>>>
>>> As background:
>>>
>>>  - Java 5+ "Future" lacks the main purpose of future, which is async
>>> chaining.
>>>  - Guava introduced ListenableFuture to do real future-oriented
>>> programming
>>>  - Java 8 added CompletionStage which is more-or-less the expected
>>> interface
>>>
>>> It is still debatable whether Java got it right [1]. But since it is
>>> standardized, doesn't need to be shaded, etc, it is worth trying to just
>>> use it carefully in the right ways. So we thought to propose that we
>>> migrate most uses of Guava futures to Java 8 futures.
>>>
>>> What do you think? Have we missed an important problem that would make
>>> this a deal-breaker?
>>>
>>> Kenn
>>>
>>> [1] e.g. https://stackoverflow.com/questions/38744943/
>>> listenablefuture-vs-completablefuture#comment72041244_39250452 and such
>>> discussions are likely to occur whenever you bring it up with someone who
>>> cares a lot about futures :-)
>>>
>>
>>


Re: [PROPOSAL] Switch from Guava futures vs Java 8 futures

2018-02-01 Thread Reuven Lax
Unless there's something that doesn't work in Java 8 future, +1 to
migrating.

On Thu, Feb 1, 2018 at 10:54 AM, Kenneth Knowles  wrote:

> Hi all,
>
> Luke, Thomas, and I had some in-person discussions about the use of Java 8
> futures and Guava futures in the portability support code. I wanted to
> bring our thoughts to the dev list for feedback.
>
> As background:
>
>  - Java 5+ "Future" lacks the main purpose of future, which is async
> chaining.
>  - Guava introduced ListenableFuture to do real future-oriented programming
>  - Java 8 added CompletionStage which is more-or-less the expected
> interface
>
> It is still debatable whether Java got it right [1]. But since it is
> standardized, doesn't need to be shaded, etc, it is worth trying to just
> use it carefully in the right ways. So we thought to propose that we
> migrate most uses of Guava futures to Java 8 futures.
>
> What do you think? Have we missed an important problem that would make
> this a deal-breaker?
>
> Kenn
>
> [1] e.g. https://stackoverflow.com/questions/38744943/listenablefuture-vs-
> completablefuture#comment72041244_39250452 and such discussions are
> likely to occur whenever you bring it up with someone who cares a lot about
> futures :-)
>


[PROPOSAL] Switch from Guava futures vs Java 8 futures

2018-02-01 Thread Kenneth Knowles
Hi all,

Luke, Thomas, and I had some in-person discussions about the use of Java 8
futures and Guava futures in the portability support code. I wanted to
bring our thoughts to the dev list for feedback.

As background:

 - Java 5+ "Future" lacks the main purpose of future, which is async
chaining.
 - Guava introduced ListenableFuture to do real future-oriented programming
 - Java 8 added CompletionStage which is more-or-less the expected interface

It is still debatable whether Java got it right [1]. But since it is
standardized, doesn't need to be shaded, etc, it is worth trying to just
use it carefully in the right ways. So we thought to propose that we
migrate most uses of Guava futures to Java 8 futures.

What do you think? Have we missed an important problem that would make this
a deal-breaker?

Kenn

[1] e.g.
https://stackoverflow.com/questions/38744943/listenablefuture-vs-completablefuture#comment72041244_39250452
and such discussions are likely to occur whenever you bring it up with
someone who cares a lot about futures :-)


Re: Can Window PTransform drop tuples that violate allowed lateness?

2018-02-01 Thread Shen Li
Hi Kenn,

Thanks for the response.

Semantically, it is not causing any problem. I just want to make sure if
certain optimizations are valid. We want to drop an expired key and its
associated states when the highest boundary of all its windows falls behind
(watermark - allowedLateness). In the beginning, I am concerned about
states of the expired key could become in-time again when a downstream
Window transform re-assigns windows, which prevents non-aggregating
operators to discard any states. It could become costly when the
application sees an evolving set of keys.

Best,
Shen

On Wed, Jan 31, 2018 at 11:19 PM, Kenneth Knowles  wrote:

> On Mon, Jan 22, 2018 at 11:42 AM, Shen Li  wrote:
>
>> Hi Kenn,
>>
>> Thanks for the explanation.
>>
>> > So now elements are droppable if they belong to an expired window.
>>
>> Say I have two consecutive window transforms with FixedWindows WindowFn
>> (just an example, most likely won't appear in real pipeline). The first
>> windowFn says the element belongs to an expired window. But according to
>> the second windowFn, the element's window is not yet expired. In this case,
>> can the first Window transform drop the element?
>>
>
> Yes, it is permitted to drop the expired data at any point. The reason I
> think this is OK is that the runner also completely controls the watermark.
> So there is arbitrary runner-owned behavior in terms of dropping either
> way. It hasn't come up, since windows are hardly useful until you have an
> aggregation, where they provide the notion of completeness. Do you have an
> example in mind where this gets weird?
>
> Kenn
>
>
>
>
>> Best,
>> Shen
>>
>> On Mon, Jan 22, 2018 at 2:07 PM, Kenneth Knowles  wrote:
>>
>>> Hi Shen,
>>>
>>> This is a documentation issue. The Beam model switched from dropping
>>> individual elements to expiring windows. So now elements are droppable if
>>> they belong to an expired window. This works a little better with the
>>> purpose of windowing and allowed lateness: to say when an aggregation is
>>> "complete". Any element that manages to make it to an aggregation before
>>> the accumulator is expired is allowed to be included now and only after the
>>> whole window expires we drop any further incoming elements for that window.
>>>
>>> Kenn
>>>
>>> On Mon, Jan 22, 2018 at 10:52 AM, Shen Li  wrote:
>>>
 Hi,

 The Window#withAllowedLateness(Duration) doc says "Any elements that
 are later than this as decided by the system-maintained watermark will be
 dropped". Can the runner safely discard a tuple that violates the allowed
 lateness in the Window operator? Or does it have to drop it in the
 downstream GBK operator just in case that there could be another Window
 transform in between overriding the allowed lateness (or other
 configurations)?

 Thanks,
 Shen

>>>
>>>
>>
>


Re: [DISCUSS] State of the project

2018-02-01 Thread Etienne Chauchot



Le 31/01/2018 à 15:50, Kenneth Knowles a écrit :



On Wed, Jan 31, 2018 at 3:40 AM, Etienne Chauchot 
> wrote:


Thanks Kenn and Luke for your comments.

WDYT about my proposition (bellow) to add methods to the runner
api to enhance the coherence between the runners?

If I understand your point, I think I agree but maybe with even 
stronger opinions. This is somewhat related to my comment on the Ben 
thread about "runner == service". The methods in any programming 
language SDK should not be considered "the runner API" any more. A 
Beam runner should be expected to be a service hosting Beam's job 
management proto APIs. So there's already more methods than run() but 
we don't have any runners :-)

Yes +1 portability architecture helps a lot on that topic


WDYT about my other proposition (bellow) of trying to avoid having
validates runner tests that are specific to a runner like we have now?

Yes, I think ValidatesRunner tests should be independent of any 
runner. Is this just a convenient use of the Java annotation when it 
probably should just be a static call to PipelineOptions.setRunner? 
IMO that is just a bug.
Actually I was referring to that:  I noticed, for example, for spark, 
that the validates runner tests annotated @Category(StreamingTest.class) 
were specific to that runner. I guess this is because they need to use 
CreateStream to trigger streaming mode. IMHO it could be interesting to 
move these ones (and other existing ones in other runners) to their 
equivalent common tests. I don't know spark runner enough to see if it 
is feasible though.


If you consider the above point along with this, ValidatesRunner tests 
should be launched as jobs against a runner service. This matches 
things like SQL compliance tests.


A build starting from change to SDK X might build like this:

  (build) SDK X
  (build) DirectRunner X
  (build) ValidatesRunner suite X
  (test) SDK X (submit the NeedsRunner tests to DR X)
  (test) DR X (submit the VR suite X)
  (test) other runners, submit the VR suite X

The way this is organized into modules today is more history and 
convenience. You could easily imagine new ways to express this build 
in Maven or Gradle. Portability makes it clearer!



+1 for sure but it is more a long term shot, I was thinking more short term.
Thanks for your comments

Kenn


Thanks,

Etienne


Le 26/01/2018 à 21:34, Kenneth Knowles a écrit :

I also think that at a high level the success of Beam as a
project/community and as a piece of software depends on having
multiple viable runners with healthy set of users and
contributors. The pieces that are missing to me:

*User-focused comparison of runners (and IOs)*
+1 to Jesse's. Automated capability tests don't really help this.
Benchmarks will be part of the story but are worth very little on
their own. Focusing on these is just choosing to measure things
that are easy to measure instead of addressing what is important,
which is in the end almost always qualitative.

*Automated integration tests on clusters*
We do need to know that runners and IOs "work" in a basic yes/no
manner on every commit/release, beyond unit tests. I am not
really willing to strongly claim to a potential user that
something "works" without this level of automation.

*More uniform operational experiences*
Setting up your Spark/Flink/Apex deployment should be different.
Launching a Beam pipeline on it should not be.

*Portability: Any SDK on any runner*
We have now one SDK on master and one SDK on a dev branch that
both support portable execution somewhat. Unfortunately we have
no major open source runner that supports portability*. "Java on
any runner" is not compelling enough any more, if it ever was.



Reviews: I agree our response latency is too slow. I do not agree
that our quality bar is too high; I think we should raise it
*significantly*. Our codebase fails tests for long periods. Our
tests need to be green enough that we are comfortable blocking
merges *even for unrelated failures*. We should be able to cut a
release any time, modulo known blocker-level bugs.

Runner dev: I think Etienne's point about making it more uniform
to add features to all runners actually is quite important, since
the portability framework is a lot harder than "translate a Beam
ParDo to XYZ's FlatMap" where they are both Java. And even the
support code we've been building is not obvious to use and
probably won't be for the foreseeable future. This fits well into
the "Ben thread" on technical ideas so I'll comment there.

Kenn

*We do have a local batch-only portable runner in Python

On Fri, Jan 26, 2018 at 10:09 AM, Lukasz Cwik > wrote:

Etienne, for the cross runner coherence, 

[CANCEL][VOTE] Release 2.3.0, release candidate #1

2018-02-01 Thread Jean-Baptiste Onofré
Hi guys,

Especially due to BEAM-3587 & BEAM-3186 regressions, I cancel RC1.

We will cherry-pick fixes on release-2.3.0 branch.

I'm updating Jira right now. When the fixes will be cherry-picked, I will submit
a RC2 to vote.

Thanks !
Regards
JB

On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
> Hi everyone,
> 
> Please review and vote on the release candidate #1 for the version 2.3.0, as
> follows:
> 
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
> 
> 
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org [2],
> which is signed with the key with fingerprint C8282E76 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.3.0-RC1" [5],
> * website pull request listing the release and publishing the API reference
> manual [6].
> * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2].
> 
> The vote will be open for at least 72 hours. It is adopted by majority 
> approval,
> with at least 3 PMC affirmative votes.
> 
> Thanks,
> JB
> 
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12341608
> [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1026/
> [5] https://github.com/apache/beam/tree/v2.3.0-RC1
> [6] https://github.com/apache/beam-site/pull/381
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [VOTE] Release 2.3.0, release candidate #1

2018-02-01 Thread Jean-Baptiste Onofré
Fully agree. Good catch.

I cancel RC1 to prepare a RC2 including at least this fix.

Thanks !
Regards
JB

On 02/01/2018 04:11 PM, Aljoscha Krettek wrote:
> -1
> 
> I think the issue discovered with unbounded sources on Flink Streaming Runner 
> is a serious regression. Good news is that there is already a fix for that: 
> https://github.com/apache/beam/pull/4558/files
> 
> And BEAM-3587 also seems serious enough, IMHO.
> 
> Btw, BEAM-3186, which seems quite serious, was also finally figured out.
> 
> Best,
> Aljoscha
> 
>> On 1. Feb 2018, at 16:00, Jean-Baptiste Onofré  wrote:
>>
>> So, are you casting -1 vote ?
>>
>> I guess so.
>>
>> Regards
>> JB
>>
>> On 02/01/2018 03:53 PM, Ismaël Mejía wrote:
>>> Hi,
>>>
>>> I started to test the release with Nexmark and found three issues
>>> (from minor to more important):
>>>
>>> 1. Small issues to run Nexmark with the release (BEAM-3531 fixed,
>>> BEAM-3592 in PR):
>>>
>>> BEAM-3531 Nexmark failed with NPE with DEFAULT suite
>>> BEAM-3592 Spark-runner profile is broken on Nexmark after move to Spark 2.x
>>>
>>> 2. Flink is broken with Unbounded sources, big enough to deserve a new
>>> vote. Pred:
>>> BEAM-3589 Flink runner breaks with ClassCastException on UnboundedSource
>>>
>>> 3. Direct runner has a relatively big performance regression when
>>> dealing with UnboundedSources. in particular the impact on query 7 of
>>> Nexmark is considerable.
>>>
>>> Just with the small SMOKE suite in my machine I get:
>>>
>>> 
>>> Beam 2.2.0   Beam 2.3.0
>>>  Query  Runtime(sec) Runtime(sec)
>>> 
>>>     6.410.6
>>>  0001   5.110.2
>>>  0002   3.0 5.8
>>>  0003   3.8 6.2
>>>  0004   0.9 1.4
>>>  0005   5.811.4
>>>  0006   0.8 1.4
>>>  0007 193.8  1249.1
>>>  0008   3.9 6.9
>>>  0009   0.9 1.3
>>>  0010   6.4 8.2
>>>  0011   5.0 9.4
>>>  0012   4.7 9.1
>>>
>>> This can be reproduced by running this command:
>>>
>>> mvn exec:java -Dexec.mainClass=org.apache.beam.sdk.nexmark.Main
>>> -Pdirect-runner -Dexec.args="--runner=DirectRunner --suite=SMOKE
>>> --streaming=true --manageResources=false --monitorJobs=true
>>> --enforceEncodability=true --enforceImmutability=true" -pl
>>> 'sdks/java/nexmark'
>>>
>>> I think 2 and 3 deserve to be fixed or at least evaluated as important
>>> enough to cancel the vote. And if possible I would love to cherry-pick
>>> the Nexmark fixes for a future RC.
>>>
>>> On Thu, Feb 1, 2018 at 7:20 AM, Jean-Baptiste Onofré  
>>> wrote:
 Hi all,

 just a quick reminder about the vote process:

 1. Any vote can be changed during the vote period. A -1 vote has to be 
 argued
 (especially if there's not change to do in the project codebase).
 2. For convenience to the release manager, please inform if your vote is 
 binding
 or non-binding (the vote from PMC members are binding)
 3. It's not possible to "veto" a release: if we have at least 3 binding 
 votes,
 the vote can pass.
 4. Please, keep only vote in the thread. If you have some tests in 
 progress,
 please use another thread. It would be great if the thread only contains
 concrete votes.
 5. The vote duration can be extended on request.

 So, I'm extending this vote to 72 more hours to give us time to review
 especially the dataflow worker images test and the Flink TextIO potential 
 issue.

 Thanks !
 Regards
 JB

 On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
> Hi everyone,
>
> Please review and vote on the release candidate #1 for the version 2.3.0, 
> as
> follows:
>
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org 
> [2],
> which is signed with the key with fingerprint C8282E76 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.3.0-RC1" [5],
> * website pull request listing the release and publishing the API 
> reference
> manual [6].
> * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2].
>
> The vote will be open for at least 72 hours. It is adopted by majority 
> approval,
> with at least 3 PMC affirmative votes.
>
> Thanks,
> JB
>
> [1]
> 

Re: [VOTE] Release 2.3.0, release candidate #1

2018-02-01 Thread Aljoscha Krettek
-1

I think the issue discovered with unbounded sources on Flink Streaming Runner 
is a serious regression. Good news is that there is already a fix for that: 
https://github.com/apache/beam/pull/4558/files

And BEAM-3587 also seems serious enough, IMHO.

Btw, BEAM-3186, which seems quite serious, was also finally figured out.

Best,
Aljoscha

> On 1. Feb 2018, at 16:00, Jean-Baptiste Onofré  wrote:
> 
> So, are you casting -1 vote ?
> 
> I guess so.
> 
> Regards
> JB
> 
> On 02/01/2018 03:53 PM, Ismaël Mejía wrote:
>> Hi,
>> 
>> I started to test the release with Nexmark and found three issues
>> (from minor to more important):
>> 
>> 1. Small issues to run Nexmark with the release (BEAM-3531 fixed,
>> BEAM-3592 in PR):
>> 
>> BEAM-3531 Nexmark failed with NPE with DEFAULT suite
>> BEAM-3592 Spark-runner profile is broken on Nexmark after move to Spark 2.x
>> 
>> 2. Flink is broken with Unbounded sources, big enough to deserve a new
>> vote. Pred:
>> BEAM-3589 Flink runner breaks with ClassCastException on UnboundedSource
>> 
>> 3. Direct runner has a relatively big performance regression when
>> dealing with UnboundedSources. in particular the impact on query 7 of
>> Nexmark is considerable.
>> 
>> Just with the small SMOKE suite in my machine I get:
>> 
>> 
>> Beam 2.2.0   Beam 2.3.0
>>  Query  Runtime(sec) Runtime(sec)
>> 
>>     6.410.6
>>  0001   5.110.2
>>  0002   3.0 5.8
>>  0003   3.8 6.2
>>  0004   0.9 1.4
>>  0005   5.811.4
>>  0006   0.8 1.4
>>  0007 193.8  1249.1
>>  0008   3.9 6.9
>>  0009   0.9 1.3
>>  0010   6.4 8.2
>>  0011   5.0 9.4
>>  0012   4.7 9.1
>> 
>> This can be reproduced by running this command:
>> 
>> mvn exec:java -Dexec.mainClass=org.apache.beam.sdk.nexmark.Main
>> -Pdirect-runner -Dexec.args="--runner=DirectRunner --suite=SMOKE
>> --streaming=true --manageResources=false --monitorJobs=true
>> --enforceEncodability=true --enforceImmutability=true" -pl
>> 'sdks/java/nexmark'
>> 
>> I think 2 and 3 deserve to be fixed or at least evaluated as important
>> enough to cancel the vote. And if possible I would love to cherry-pick
>> the Nexmark fixes for a future RC.
>> 
>> On Thu, Feb 1, 2018 at 7:20 AM, Jean-Baptiste Onofré  
>> wrote:
>>> Hi all,
>>> 
>>> just a quick reminder about the vote process:
>>> 
>>> 1. Any vote can be changed during the vote period. A -1 vote has to be 
>>> argued
>>> (especially if there's not change to do in the project codebase).
>>> 2. For convenience to the release manager, please inform if your vote is 
>>> binding
>>> or non-binding (the vote from PMC members are binding)
>>> 3. It's not possible to "veto" a release: if we have at least 3 binding 
>>> votes,
>>> the vote can pass.
>>> 4. Please, keep only vote in the thread. If you have some tests in progress,
>>> please use another thread. It would be great if the thread only contains
>>> concrete votes.
>>> 5. The vote duration can be extended on request.
>>> 
>>> So, I'm extending this vote to 72 more hours to give us time to review
>>> especially the dataflow worker images test and the Flink TextIO potential 
>>> issue.
>>> 
>>> Thanks !
>>> Regards
>>> JB
>>> 
>>> On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
 Hi everyone,
 
 Please review and vote on the release candidate #1 for the version 2.3.0, 
 as
 follows:
 
 [ ] +1, Approve the release
 [ ] -1, Do not approve the release (please provide specific comments)
 
 
 The complete staging area is available for your review, which includes:
 * JIRA release notes [1],
 * the official Apache source release to be deployed to dist.apache.org [2],
 which is signed with the key with fingerprint C8282E76 [3],
 * all artifacts to be deployed to the Maven Central Repository [4],
 * source code tag "v2.3.0-RC1" [5],
 * website pull request listing the release and publishing the API reference
 manual [6].
 * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
 * Python artifacts are deployed along with the source release to the
 dist.apache.org [2].
 
 The vote will be open for at least 72 hours. It is adopted by majority 
 approval,
 with at least 3 PMC affirmative votes.
 
 Thanks,
 JB
 
 [1]
 https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12341608
 [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
 [3] https://dist.apache.org/repos/dist/release/beam/KEYS
 [4] https://repository.apache.org/content/repositories/orgapachebeam-1026/
 [5] 

Re: [VOTE] Release 2.3.0, release candidate #1

2018-02-01 Thread Jean-Baptiste Onofré
So, are you casting -1 vote ?

I guess so.

Regards
JB

On 02/01/2018 03:53 PM, Ismaël Mejía wrote:
> Hi,
> 
> I started to test the release with Nexmark and found three issues
> (from minor to more important):
> 
> 1. Small issues to run Nexmark with the release (BEAM-3531 fixed,
> BEAM-3592 in PR):
> 
> BEAM-3531 Nexmark failed with NPE with DEFAULT suite
> BEAM-3592 Spark-runner profile is broken on Nexmark after move to Spark 2.x
> 
> 2. Flink is broken with Unbounded sources, big enough to deserve a new
> vote. Pred:
> BEAM-3589 Flink runner breaks with ClassCastException on UnboundedSource
> 
> 3. Direct runner has a relatively big performance regression when
> dealing with UnboundedSources. in particular the impact on query 7 of
> Nexmark is considerable.
> 
> Just with the small SMOKE suite in my machine I get:
> 
> 
>  Beam 2.2.0   Beam 2.3.0
>   Query  Runtime(sec) Runtime(sec)
> 
>      6.410.6
>   0001   5.110.2
>   0002   3.0 5.8
>   0003   3.8 6.2
>   0004   0.9 1.4
>   0005   5.811.4
>   0006   0.8 1.4
>   0007 193.8  1249.1
>   0008   3.9 6.9
>   0009   0.9 1.3
>   0010   6.4 8.2
>   0011   5.0 9.4
>   0012   4.7 9.1
> 
> This can be reproduced by running this command:
> 
> mvn exec:java -Dexec.mainClass=org.apache.beam.sdk.nexmark.Main
> -Pdirect-runner -Dexec.args="--runner=DirectRunner --suite=SMOKE
> --streaming=true --manageResources=false --monitorJobs=true
> --enforceEncodability=true --enforceImmutability=true" -pl
> 'sdks/java/nexmark'
> 
> I think 2 and 3 deserve to be fixed or at least evaluated as important
> enough to cancel the vote. And if possible I would love to cherry-pick
> the Nexmark fixes for a future RC.
> 
> On Thu, Feb 1, 2018 at 7:20 AM, Jean-Baptiste Onofré  
> wrote:
>> Hi all,
>>
>> just a quick reminder about the vote process:
>>
>> 1. Any vote can be changed during the vote period. A -1 vote has to be argued
>> (especially if there's not change to do in the project codebase).
>> 2. For convenience to the release manager, please inform if your vote is 
>> binding
>> or non-binding (the vote from PMC members are binding)
>> 3. It's not possible to "veto" a release: if we have at least 3 binding 
>> votes,
>> the vote can pass.
>> 4. Please, keep only vote in the thread. If you have some tests in progress,
>> please use another thread. It would be great if the thread only contains
>> concrete votes.
>> 5. The vote duration can be extended on request.
>>
>> So, I'm extending this vote to 72 more hours to give us time to review
>> especially the dataflow worker images test and the Flink TextIO potential 
>> issue.
>>
>> Thanks !
>> Regards
>> JB
>>
>> On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
>>> Hi everyone,
>>>
>>> Please review and vote on the release candidate #1 for the version 2.3.0, as
>>> follows:
>>>
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>>
>>> The complete staging area is available for your review, which includes:
>>> * JIRA release notes [1],
>>> * the official Apache source release to be deployed to dist.apache.org [2],
>>> which is signed with the key with fingerprint C8282E76 [3],
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> * source code tag "v2.3.0-RC1" [5],
>>> * website pull request listing the release and publishing the API reference
>>> manual [6].
>>> * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
>>> * Python artifacts are deployed along with the source release to the
>>> dist.apache.org [2].
>>>
>>> The vote will be open for at least 72 hours. It is adopted by majority 
>>> approval,
>>> with at least 3 PMC affirmative votes.
>>>
>>> Thanks,
>>> JB
>>>
>>> [1]
>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12341608
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [4] https://repository.apache.org/content/repositories/orgapachebeam-1026/
>>> [5] https://github.com/apache/beam/tree/v2.3.0-RC1
>>> [6] https://github.com/apache/beam-site/pull/381
>>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [VOTE] Release 2.3.0, release candidate #1

2018-02-01 Thread Ismaël Mejía
Hi,

I started to test the release with Nexmark and found three issues
(from minor to more important):

1. Small issues to run Nexmark with the release (BEAM-3531 fixed,
BEAM-3592 in PR):

BEAM-3531 Nexmark failed with NPE with DEFAULT suite
BEAM-3592 Spark-runner profile is broken on Nexmark after move to Spark 2.x

2. Flink is broken with Unbounded sources, big enough to deserve a new
vote. Pred:
BEAM-3589 Flink runner breaks with ClassCastException on UnboundedSource

3. Direct runner has a relatively big performance regression when
dealing with UnboundedSources. in particular the impact on query 7 of
Nexmark is considerable.

Just with the small SMOKE suite in my machine I get:


 Beam 2.2.0   Beam 2.3.0
  Query  Runtime(sec) Runtime(sec)

     6.410.6
  0001   5.110.2
  0002   3.0 5.8
  0003   3.8 6.2
  0004   0.9 1.4
  0005   5.811.4
  0006   0.8 1.4
  0007 193.8  1249.1
  0008   3.9 6.9
  0009   0.9 1.3
  0010   6.4 8.2
  0011   5.0 9.4
  0012   4.7 9.1

This can be reproduced by running this command:

mvn exec:java -Dexec.mainClass=org.apache.beam.sdk.nexmark.Main
-Pdirect-runner -Dexec.args="--runner=DirectRunner --suite=SMOKE
--streaming=true --manageResources=false --monitorJobs=true
--enforceEncodability=true --enforceImmutability=true" -pl
'sdks/java/nexmark'

I think 2 and 3 deserve to be fixed or at least evaluated as important
enough to cancel the vote. And if possible I would love to cherry-pick
the Nexmark fixes for a future RC.

On Thu, Feb 1, 2018 at 7:20 AM, Jean-Baptiste Onofré  wrote:
> Hi all,
>
> just a quick reminder about the vote process:
>
> 1. Any vote can be changed during the vote period. A -1 vote has to be argued
> (especially if there's not change to do in the project codebase).
> 2. For convenience to the release manager, please inform if your vote is 
> binding
> or non-binding (the vote from PMC members are binding)
> 3. It's not possible to "veto" a release: if we have at least 3 binding votes,
> the vote can pass.
> 4. Please, keep only vote in the thread. If you have some tests in progress,
> please use another thread. It would be great if the thread only contains
> concrete votes.
> 5. The vote duration can be extended on request.
>
> So, I'm extending this vote to 72 more hours to give us time to review
> especially the dataflow worker images test and the Flink TextIO potential 
> issue.
>
> Thanks !
> Regards
> JB
>
> On 01/30/2018 09:04 AM, Jean-Baptiste Onofré wrote:
>> Hi everyone,
>>
>> Please review and vote on the release candidate #1 for the version 2.3.0, as
>> follows:
>>
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>>
>> The complete staging area is available for your review, which includes:
>> * JIRA release notes [1],
>> * the official Apache source release to be deployed to dist.apache.org [2],
>> which is signed with the key with fingerprint C8282E76 [3],
>> * all artifacts to be deployed to the Maven Central Repository [4],
>> * source code tag "v2.3.0-RC1" [5],
>> * website pull request listing the release and publishing the API reference
>> manual [6].
>> * Java artifacts were built with Maven 3.3.9 and Oracle JDK 1.8.0_111.
>> * Python artifacts are deployed along with the source release to the
>> dist.apache.org [2].
>>
>> The vote will be open for at least 72 hours. It is adopted by majority 
>> approval,
>> with at least 3 PMC affirmative votes.
>>
>> Thanks,
>> JB
>>
>> [1]
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12341608
>> [2] https://dist.apache.org/repos/dist/dev/beam/2.3.0/
>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> [4] https://repository.apache.org/content/repositories/orgapachebeam-1026/
>> [5] https://github.com/apache/beam/tree/v2.3.0-RC1
>> [6] https://github.com/apache/beam-site/pull/381
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


Re: drop scala....version from artifact ;)

2018-02-01 Thread Romain Manni-Bucau
Flink, Gearpump, Spark, and GCE provisioning are affected by this "issue".
Dropping it if we never manage 2 versions is nicer for end users IMHO but
I'm fine keeping it. Just would like to ensure it is uniform accross the
whole projet.


Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn 

2018-02-01 14:58 GMT+01:00 Aljoscha Krettek :

> I think Kafka IO doesn't have a transitive Scala dependency anymore
> because Kafka removed that from their client code a while ago.
>
> Best,
> Aljoscha
>
> > On 1. Feb 2018, at 14:48, Jean-Baptiste Onofré  wrote:
> >
> > I got your point Aljoscha. Flink runner is the only module using this
> suffix.
> >
> > Spark runner, Kafka IO, and others also have a scala dep but don't use
> the suffix.
> >
> > So, we have three options:
> > 1. We leave as it is right now
> > 2. We remove suffix from Flink runner
> > 3. We add suffix to other modules (Spark runner, Kafka IO, ...)
> >
> > Thoughts ?
> >
> > I'm OK to stay on 1 for now.
> >
> > Regards
> > JB
> >
> > On 02/01/2018 02:45 PM, Aljoscha Krettek wrote:
> >> I think it's not wise to remove the Scala suffix. When using the Flink
> Runner you have to make sure that the Scala version matches the Scala
> version of the Flink Cluster. And I think comparing the suffix of your
> flink-runner dependency and the suffix of your Flink dist is an easy way of
> doing that.
> >>
> >>
> >>> On 31. Jan 2018, at 16:55, Jean-Baptiste Onofré 
> wrote:
> >>>
> >>> Hi Romain,
> >>>
> >>> AFAIR only Flink runner uses scala version in the artifactId.
> >>>
> >>> +1 for me.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On 01/31/2018 04:45 PM, Romain Manni-Bucau wrote:
>  Hi guys
> 
>  since beam supports a single version of runners why not dropping the
> scala
>  version from the artifactId?
> 
>  ATM upgrades are painful cause you upgrade beam version+ runner
> artifactIds.
> 
>  wdyt?
> 
>  Romain Manni-Bucau
>  @rmannibucau  |  Blog
>   | Old Blog
>   | Github  rmannibucau> |
>  LinkedIn 
> >>>
> >>> --
> >>> Jean-Baptiste Onofré
> >>> jbono...@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >>
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
>
>


Re: drop scala....version from artifact ;)

2018-02-01 Thread Aljoscha Krettek
I think Kafka IO doesn't have a transitive Scala dependency anymore because 
Kafka removed that from their client code a while ago.

Best,
Aljoscha

> On 1. Feb 2018, at 14:48, Jean-Baptiste Onofré  wrote:
> 
> I got your point Aljoscha. Flink runner is the only module using this suffix.
> 
> Spark runner, Kafka IO, and others also have a scala dep but don't use the 
> suffix.
> 
> So, we have three options:
> 1. We leave as it is right now
> 2. We remove suffix from Flink runner
> 3. We add suffix to other modules (Spark runner, Kafka IO, ...)
> 
> Thoughts ?
> 
> I'm OK to stay on 1 for now.
> 
> Regards
> JB
> 
> On 02/01/2018 02:45 PM, Aljoscha Krettek wrote:
>> I think it's not wise to remove the Scala suffix. When using the Flink 
>> Runner you have to make sure that the Scala version matches the Scala 
>> version of the Flink Cluster. And I think comparing the suffix of your 
>> flink-runner dependency and the suffix of your Flink dist is an easy way of 
>> doing that.
>> 
>> 
>>> On 31. Jan 2018, at 16:55, Jean-Baptiste Onofré  wrote:
>>> 
>>> Hi Romain,
>>> 
>>> AFAIR only Flink runner uses scala version in the artifactId.
>>> 
>>> +1 for me.
>>> 
>>> Regards
>>> JB
>>> 
>>> On 01/31/2018 04:45 PM, Romain Manni-Bucau wrote:
 Hi guys
 
 since beam supports a single version of runners why not dropping the scala
 version from the artifactId?
 
 ATM upgrades are painful cause you upgrade beam version+ runner 
 artifactIds.
 
 wdyt?
 
 Romain Manni-Bucau
 @rmannibucau  |  Blog
  | Old Blog
  | Github 
  |
 LinkedIn 
>>> 
>>> -- 
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>> 
> 
> -- 
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com



Re: drop scala....version from artifact ;)

2018-02-01 Thread Jean-Baptiste Onofré
I got your point Aljoscha. Flink runner is the only module using this suffix.

Spark runner, Kafka IO, and others also have a scala dep but don't use the 
suffix.

So, we have three options:
1. We leave as it is right now
2. We remove suffix from Flink runner
3. We add suffix to other modules (Spark runner, Kafka IO, ...)

Thoughts ?

I'm OK to stay on 1 for now.

Regards
JB

On 02/01/2018 02:45 PM, Aljoscha Krettek wrote:
> I think it's not wise to remove the Scala suffix. When using the Flink Runner 
> you have to make sure that the Scala version matches the Scala version of the 
> Flink Cluster. And I think comparing the suffix of your flink-runner 
> dependency and the suffix of your Flink dist is an easy way of doing that.
> 
> 
>> On 31. Jan 2018, at 16:55, Jean-Baptiste Onofré  wrote:
>>
>> Hi Romain,
>>
>> AFAIR only Flink runner uses scala version in the artifactId.
>>
>> +1 for me.
>>
>> Regards
>> JB
>>
>> On 01/31/2018 04:45 PM, Romain Manni-Bucau wrote:
>>> Hi guys
>>>
>>> since beam supports a single version of runners why not dropping the scala
>>> version from the artifactId?
>>>
>>> ATM upgrades are painful cause you upgrade beam version+ runner 
>>> artifactIds.
>>>
>>> wdyt?
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github 
>>>  |
>>> LinkedIn 
>>
>> -- 
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: drop scala....version from artifact ;)

2018-02-01 Thread Aljoscha Krettek
I think it's not wise to remove the Scala suffix. When using the Flink Runner 
you have to make sure that the Scala version matches the Scala version of the 
Flink Cluster. And I think comparing the suffix of your flink-runner dependency 
and the suffix of your Flink dist is an easy way of doing that.


> On 31. Jan 2018, at 16:55, Jean-Baptiste Onofré  wrote:
> 
> Hi Romain,
> 
> AFAIR only Flink runner uses scala version in the artifactId.
> 
> +1 for me.
> 
> Regards
> JB
> 
> On 01/31/2018 04:45 PM, Romain Manni-Bucau wrote:
>> Hi guys
>> 
>> since beam supports a single version of runners why not dropping the scala
>> version from the artifactId?
>> 
>> ATM upgrades are painful cause you upgrade beam version+ runner 
>> artifactIds.
>> 
>> wdyt?
>> 
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github  
>> |
>> LinkedIn 
> 
> -- 
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com



Build failed in Jenkins: beam_PostRelease_NightlySnapshot #13

2018-02-01 Thread Apache Jenkins Server
See 


Changes:

[wcn] Update generated version of Fn API code.

[mariagh] Support argparse-style choices for ValueProvider

[robertwb] [BEAM-3490] Make runtime type checking code runner agnostic.

[github] get the query from configuration not options

[dariusz.aniszewski] use build $WORKSPACE as pkb temp_dir and update pip and 
setuptools in

[iemejia] [BEAM-3578] SQL module build breaks because of missing dependency

[kenn] google-java-format

[kenn] Fix Distinct null pointer error with speculative triggers

[kenn] Move TestCountingSource to appropriate location

[robertwb] Direct runner fixes.

[lcwik] [BEAM-2926] Add support for side inputs to the runner harness.

[kenn] Sickbay ApexRunner gradle WordCountIT

[kenn] Sickbay flakey KinesisReaderTest

[lcwik] [BEAM-3249] Make sure that all java projects package tests. Also package

[lcwik] [BEAM-3249] Do not assume build directory is within build/, use the

--
Started by timer
[EnvInject] - Loading node environment variables.
Building remotely on beam4 (beam) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/beam.git # timeout=10
Fetching upstream changes from https://github.com/apache/beam.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/beam.git 
 > +refs/heads/*:refs/remotes/origin/* 
 > +refs/pull/${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/*
 > git rev-parse origin/master^{commit} # timeout=10
Checking out Revision 82e5e944ba143c23acf377bfd7a850046be68ea7 (origin/master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 82e5e944ba143c23acf377bfd7a850046be68ea7
Commit message: "[BEAM-3249] Do not assume build directory is within build/, 
use the project defined build dir."
 > git rev-list 645e1b51e1c83ec4b0b54a774d6615624f3c42ed # timeout=10
Cleaning workspace
 > git rev-parse --verify HEAD # timeout=10
Resetting working tree
 > git reset --hard # timeout=10
 > git clean -fdx # timeout=10
[EnvInject] - Executing scripts and injecting environment variables after the 
SCM step.
[EnvInject] - Injecting as environment variables the properties content 
SPARK_LOCAL_IP=127.0.0.1

[EnvInject] - Variables injected successfully.
[beam_PostRelease_NightlySnapshot] $ /bin/bash -xe 
/tmp/jenkins5709761348709679950.sh
+ cd src/release
+ groovy quickstart-java-direct.groovy
/tmp/jenkins5709761348709679950.sh: line 2: groovy: command not found
Build step 'Execute shell' marked build as failure
Not sending mail to unregistered user kirpic...@google.com
Not sending mail to unregistered user git...@alasdairhodge.co.uk
Not sending mail to unregistered user mott...@gmail.com
Not sending mail to unregistered user xuming...@users.noreply.github.com
Not sending mail to unregistered user aromanenko@gmail.com
Not sending mail to unregistered user mari...@mariagh.svl.corp.google.com
Not sending mail to unregistered user pawel.pk.kaczmarc...@gmail.com
Not sending mail to unregistered user ke...@google.com
Not sending mail to unregistered user dariusz.aniszew...@polidea.com
Not sending mail to unregistered user w...@google.com
Not sending mail to unregistered user z...@giggles.nyc.corp.google.com
Not sending mail to unregistered user joey.bar...@gmail.com


Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-02-01 Thread Ismaël Mejía
Huge +1 to get rid of Guava!

This solves annoying dependency issues for some IOs and allow us to
get rid of the shading that makes current jars bigger than they should
be.

We can create our own 'micro guava' package with some classes for
things that are hard to migrate, or that we  prefer to still have like
the check* methods for example. Given the size of the task we should
probably divide it into subtasks, more important is to get rid of it
for 'sdks/java/core'. We can then attack other areas afterwards.

Other important idea would be to get rid of Protobuf in public APIs
like GCPIO and to better shade it from leaking into the runners. An
unexpected side effect of this is a leak of netty via gRPC/protobuf
that is byting us for the Spark runner, but well that's worth a
different discussion.


On Thu, Feb 1, 2018 at 10:08 AM, Romain Manni-Bucau
 wrote:
> a map of list is fine and not a challenge we'll face long I hope ;)
>
>
> Romain Manni-Bucau
> @rmannibucau |  Blog | Old Blog | Github | LinkedIn
>
> 2018-02-01 9:40 GMT+01:00 Reuven Lax :
>>
>> Not sure we'll be able to replace them all. Things like guava Table and
>> Multimap don't have great replacements in Java8.
>>
>> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré 
>> wrote:
>>>
>>> +1, it was on my TODO for a while waiting the Java8 update.
>>>
>>> Regards
>>> JB
>>>
>>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
>>> > Why not dropping guava for all beam codebase? With java 8 it is quite
>>> > easy to do
>>> > it and avoid a bunch of conflicts. Did it in 2 projects with quite a
>>> > good result.
>>> >
>>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" >> > > a écrit :
>>> >
>>> > Make sure to include the guava version in the artifact name so that
>>> > we can
>>> > have multiple vendored versions.
>>> >
>>> > On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles >> > > wrote:
>>> >
>>> > I didn't have time for this, but it just bit me. We definitely
>>> > have
>>> > Guava on the API surface of runner support code in ways that
>>> > get
>>> > incompatibly shaded. I will probably start "1a" by making a
>>> > shaded
>>> > library org.apache.beam:vendored-guava and starting to use it.
>>> > It sounds
>>> > like there is generally unanimous support for that much,
>>> > anyhow.
>>> >
>>> > Kenn
>>> >
>>> > On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek
>>> > >> > > wrote:
>>> >
>>> > Thanks Ismaël for bringing up this discussion again!
>>> >
>>> > I would be in favour of 1) and more specifically of 1a)
>>> >
>>> > Aljoscha
>>> >
>>> >
>>> >> On 12. Dec 2017, at 18:56, Lukasz Cwik >> >> > wrote:
>>> >>
>>> >> You can always run tests on post shaded artifacts instead
>>> >> of the
>>> >> compiled classes, it just requires us to change our maven
>>> >> surefire
>>> >> / gradle test configurations but it is true that most IDEs
>>> >> would
>>> >> behave better with a dependency jar unless you delegate
>>> >> all the
>>> >> build/test actions to the build system and then it won't
>>> >> matter.
>>> >>
>>> >> On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles
>>> >> >> >> > wrote:
>>> >>
>>> >> There's also, with additional overhead,
>>> >>
>>> >> 1a) A relocated and shipped package for each thing we
>>> >> want to
>>> >> relocate. I think this has also been tried outside
>>> >> Beam...
>>> >>
>>> >> Pros:
>>> >> * all the pros of 1) plus no bloat beyond what is
>>> >> necessary
>>> >> Cons:
>>> >> * abandons whitelist approach for public deps,
>>> >> reverting to
>>> >> blacklist approach for trouble things like guava, so a
>>> >> bit
>>> >> less principled
>>> >>
>>> >> For both 1) and 1a) I would add:
>>> >>
>>> >> Pros:
>>> >> * clearly readable dependency since code will `import
>>> >> org.apache.beam.private.guava21` and IDEs will
>>> >> understand it
>>> >> is a distinct lilbrary
>>> >> * can run tests on unpackaged classes, as long as deps
>>> >> are
>>> >> shaded or provided as jars
>>> >> * no mysterious action at a distance from inherited
>>> >> configuration
>>> >> Cons:
>>> >> * need to adjust imports
>>> >>
>>> >> Kenn
>>> >>
>>> >> On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik
>>> >> >> >>

Re: KafkaIO reading from latest offset when pipeline fails on FlinkRunner

2018-02-01 Thread Sushil Ks
Hi,
   Apologies for delay in my reply,

@Raghu Angadi
This checkpoints 20 mins, as you mentioned before any
checkpoint is created and if the pipeline restarts, it's reading from the
latest offset.

@Mingmin
Thanks a lot for sharing your learnings, However in case of any
*UserCodeException* while processing the element as part of ParDo after
materializing the window, the pipeline drops the unprocessed elements and
restarts. Is this expected from Beam?


On Wed, Jan 17, 2018 at 2:13 AM, Kenneth Knowles  wrote:

> Is there a JIRA filed for this? I think this discussion should live in a
> ticket.
>
> Kenn
>
> On Wed, Jan 10, 2018 at 11:00 AM, Mingmin Xu  wrote:
>
>> @Sushil, I have several jobs running on KafkaIO+FlinkRunner, hope my
>> experience can help you a bit.
>>
>> For short, `ENABLE_AUTO_COMMIT_CONFIG` doesn't meet your requirement, you
>> need to leverage exactly-once checkpoint/savepoint in Flink. The reason
>> is,  with `ENABLE_AUTO_COMMIT_CONFIG` KafkaIO commits offset after data is
>> read, and once job is restarted KafkaIO reads from last_committed_offset.
>>
>> In my jobs, I enable external(external should be optional I think?)
>> checkpoint on exactly-once mode in Flink cluster. When the job auto-restart
>> on failures it doesn't lost data. In case of manually redeploy the job, I
>> use savepoint to cancel and launch the job.
>>
>> Mingmin
>>
>> On Wed, Jan 10, 2018 at 10:34 AM, Raghu Angadi 
>> wrote:
>>
>>> How often does your pipeline checkpoint/snapshot? If the failure happens
>>> before the first checkpoint, the pipeline could restart without any state,
>>> in which case KafkaIO would read from latest offset. There is probably some
>>> way to verify if pipeline is restarting from a checkpoint.
>>>
>>> On Sun, Jan 7, 2018 at 10:57 PM, Sushil Ks  wrote:
>>>
 HI Aljoscha,
The issue is let's say I consumed 100 elements in 5
 mins Fixed Window with *GroupByKey* and later I applied *ParDO* for
 all those elements. If there is an issue while processing element 70 in
 *ParDo *and the pipeline restarts with *UserCodeException *it's
 skipping the rest 30 elements. Wanted to know if this is expected? In case
 if you still having doubt let me know will share a code snippet.

 Regards,
 Sushil Ks

>>>
>>>
>>
>>
>> --
>> 
>> Mingmin
>>
>
>


Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-02-01 Thread Romain Manni-Bucau
a map of list is fine and not a challenge we'll face long I hope ;)


Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn 

2018-02-01 9:40 GMT+01:00 Reuven Lax :

> Not sure we'll be able to replace them all. Things like guava Table and
> Multimap don't have great replacements in Java8.
>
> On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré 
> wrote:
>
>> +1, it was on my TODO for a while waiting the Java8 update.
>>
>> Regards
>> JB
>>
>> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
>> > Why not dropping guava for all beam codebase? With java 8 it is quite
>> easy to do
>> > it and avoid a bunch of conflicts. Did it in 2 projects with quite a
>> good result.
>> >
>> > Le 1 févr. 2018 06:50, "Lukasz Cwik" > > > a écrit :
>> >
>> > Make sure to include the guava version in the artifact name so that
>> we can
>> > have multiple vendored versions.
>> >
>> > On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles > > > wrote:
>> >
>> > I didn't have time for this, but it just bit me. We definitely
>> have
>> > Guava on the API surface of runner support code in ways that get
>> > incompatibly shaded. I will probably start "1a" by making a
>> shaded
>> > library org.apache.beam:vendored-guava and starting to use it.
>> It sounds
>> > like there is generally unanimous support for that much, anyhow.
>> >
>> > Kenn
>> >
>> > On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek <
>> aljos...@apache.org
>> > > wrote:
>> >
>> > Thanks Ismaël for bringing up this discussion again!
>> >
>> > I would be in favour of 1) and more specifically of 1a)
>> >
>> > Aljoscha
>> >
>> >
>> >> On 12. Dec 2017, at 18:56, Lukasz Cwik > >> > wrote:
>> >>
>> >> You can always run tests on post shaded artifacts instead
>> of the
>> >> compiled classes, it just requires us to change our maven
>> surefire
>> >> / gradle test configurations but it is true that most IDEs
>> would
>> >> behave better with a dependency jar unless you delegate
>> all the
>> >> build/test actions to the build system and then it won't
>> matter.
>> >>
>> >> On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles <
>> k...@google.com
>> >> > wrote:
>> >>
>> >> There's also, with additional overhead,
>> >>
>> >> 1a) A relocated and shipped package for each thing we
>> want to
>> >> relocate. I think this has also been tried outside
>> Beam...
>> >>
>> >> Pros:
>> >> * all the pros of 1) plus no bloat beyond what is
>> necessary
>> >> Cons:
>> >> * abandons whitelist approach for public deps,
>> reverting to
>> >> blacklist approach for trouble things like guava, so a
>> bit
>> >> less principled
>> >>
>> >> For both 1) and 1a) I would add:
>> >>
>> >> Pros:
>> >> * clearly readable dependency since code will `import
>> >> org.apache.beam.private.guava21` and IDEs will
>> understand it
>> >> is a distinct lilbrary
>> >> * can run tests on unpackaged classes, as long as deps
>> are
>> >> shaded or provided as jars
>> >> * no mysterious action at a distance from inherited
>> configuration
>> >> Cons:
>> >> * need to adjust imports
>> >>
>> >> Kenn
>> >>
>> >> On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik <
>> lc...@google.com
>> >> > wrote:
>> >>
>> >> I would suggest that either we use:
>> >> 1) A common deps package containing shaded
>> dependencies
>> >> allows for
>> >> Pros
>> >> * doesn't require the user to build an uber jar
>> >> Risks
>> >> * dependencies package will keep growing even if
>> something
>> >> is or isn't needed by all of Apache Beam leading
>> to a
>> >> large jar anyways negating any space savings
>> >>
>> >> 2) Shade within each module to a common location
>> like
>> >> org.apache.beam.relocated.guava
>> >> Pros
>> >> * you only get the shaded dependencies of the
>> things that
>> >>  

Re: [DISCUSS] [Java] Private shaded dependency uber jars

2018-02-01 Thread Reuven Lax
Not sure we'll be able to replace them all. Things like guava Table and
Multimap don't have great replacements in Java8.

On Wed, Jan 31, 2018 at 10:11 PM, Jean-Baptiste Onofré 
wrote:

> +1, it was on my TODO for a while waiting the Java8 update.
>
> Regards
> JB
>
> On 02/01/2018 06:56 AM, Romain Manni-Bucau wrote:
> > Why not dropping guava for all beam codebase? With java 8 it is quite
> easy to do
> > it and avoid a bunch of conflicts. Did it in 2 projects with quite a
> good result.
> >
> > Le 1 févr. 2018 06:50, "Lukasz Cwik"  > > a écrit :
> >
> > Make sure to include the guava version in the artifact name so that
> we can
> > have multiple vendored versions.
> >
> > On Wed, Jan 31, 2018 at 9:16 PM, Kenneth Knowles  > > wrote:
> >
> > I didn't have time for this, but it just bit me. We definitely
> have
> > Guava on the API surface of runner support code in ways that get
> > incompatibly shaded. I will probably start "1a" by making a
> shaded
> > library org.apache.beam:vendored-guava and starting to use it.
> It sounds
> > like there is generally unanimous support for that much, anyhow.
> >
> > Kenn
> >
> > On Wed, Dec 13, 2017 at 8:31 AM, Aljoscha Krettek <
> aljos...@apache.org
> > > wrote:
> >
> > Thanks Ismaël for bringing up this discussion again!
> >
> > I would be in favour of 1) and more specifically of 1a)
> >
> > Aljoscha
> >
> >
> >> On 12. Dec 2017, at 18:56, Lukasz Cwik  >> > wrote:
> >>
> >> You can always run tests on post shaded artifacts instead
> of the
> >> compiled classes, it just requires us to change our maven
> surefire
> >> / gradle test configurations but it is true that most IDEs
> would
> >> behave better with a dependency jar unless you delegate all
> the
> >> build/test actions to the build system and then it won't
> matter.
> >>
> >> On Mon, Dec 11, 2017 at 9:05 PM, Kenneth Knowles <
> k...@google.com
> >> > wrote:
> >>
> >> There's also, with additional overhead,
> >>
> >> 1a) A relocated and shipped package for each thing we
> want to
> >> relocate. I think this has also been tried outside
> Beam...
> >>
> >> Pros:
> >> * all the pros of 1) plus no bloat beyond what is
> necessary
> >> Cons:
> >> * abandons whitelist approach for public deps,
> reverting to
> >> blacklist approach for trouble things like guava, so a
> bit
> >> less principled
> >>
> >> For both 1) and 1a) I would add:
> >>
> >> Pros:
> >> * clearly readable dependency since code will `import
> >> org.apache.beam.private.guava21` and IDEs will
> understand it
> >> is a distinct lilbrary
> >> * can run tests on unpackaged classes, as long as deps
> are
> >> shaded or provided as jars
> >> * no mysterious action at a distance from inherited
> configuration
> >> Cons:
> >> * need to adjust imports
> >>
> >> Kenn
> >>
> >> On Mon, Dec 11, 2017 at 9:57 AM, Lukasz Cwik <
> lc...@google.com
> >> > wrote:
> >>
> >> I would suggest that either we use:
> >> 1) A common deps package containing shaded
> dependencies
> >> allows for
> >> Pros
> >> * doesn't require the user to build an uber jar
> >> Risks
> >> * dependencies package will keep growing even if
> something
> >> is or isn't needed by all of Apache Beam leading to
> a
> >> large jar anyways negating any space savings
> >>
> >> 2) Shade within each module to a common location
> like
> >> org.apache.beam.relocated.guava
> >> Pros
> >> * you only get the shaded dependencies of the
> things that
> >> are required
> >> * its one less dependency for users to manage
> >> Risks
> >> * requires an uber jar to be built to get the space
> >> savings (either by a user or a distribution of
> Apache
> >> Beam) otherwise we negate any space savings.
> >>
> >> If we either use a common relocation scheme or a
> >> dependencies jar then each relocation should
> 

Jenkins build is back to normal : beam_Release_NightlySnapshot #672

2018-02-01 Thread Apache Jenkins Server
See