Re: RequiresStableInput on Spark runner

2020-07-08 Thread Jozef Vilcek
My last question was more towards the graph translation for batch mode.

Should DoFn with @RequiresStableInput be translated/expanded in some
specific way (e.g. DoFn -> Reshuffle + DoFn) or is it not needed for batch?
Most runners fail in the presence of @RequiresStableInput for both batch
and streaming. I can not find a fail for Flink and Dataflow, but at the
same time, I can not find what those runners do with such DoFn.

On Tue, Jul 7, 2020 at 9:18 PM Kenneth Knowles  wrote:

> I hope someone who knows better than me can respond.
>
> A long time ago, the SparkRunner added a call to materialize() at every
> GroupByKey. This was to mimic Dataflow, since so many of the initial IO
> transforms relied on using shuffle to create stable inputs.
>
> The overall goal is to be able to remove these extra calls to
> materialize() and only include them when @RequiresStableInput.
>
> The intermediate state is to analyze whether input is already stable from
> materialize() and add another materialize() only if it is not stable.
>
> I don't know the current state of the SparkRunner. This may already have
> changed.
>
> Kenn
>
> On Thu, Jul 2, 2020 at 10:24 PM Jozef Vilcek 
> wrote:
>
>> I was trying to look for references on how other runners handle
>> @RequiresStableInput for batch cases, however I was not able to find any.
>> In Flink I can see added support for streaming case and in Dataflow I see
>> that support for the feature was turned off
>> https://github.com/apache/beam/pull/8065
>>
>> It seems to me that @RequiresStableInput is ignored for the batch case
>> and the runner relies on being able to recompute the whole job in the worst
>> case scenario.
>> Is this assumption correct?
>> Could I just change SparkRunner to crash on @RequiresStableInput
>> annotation for streaming mode and ignore it in batch?
>>
>>
>>
>> On Wed, Jul 1, 2020 at 10:27 AM Jozef Vilcek 
>> wrote:
>>
>>> We have a component which we use in streaming and batch jobs.
>>> Streaming we run on FlinkRunner and batch on SparkRunner. Recently we
>>> needed to add @RequiresStableInput to taht component because of streaming
>>> use-case. But now batch case crash on SparkRunner with
>>>
>>> Caused by: java.lang.UnsupportedOperationException: Spark runner currently 
>>> doesn't support @RequiresStableInput annotation.
>>> at 
>>> org.apache.beam.runners.core.construction.UnsupportedOverrideFactory.getReplacementTransform(UnsupportedOverrideFactory.java:58)
>>> at org.apache.beam.sdk.Pipeline.applyReplacement(Pipeline.java:556)
>>> at org.apache.beam.sdk.Pipeline.replace(Pipeline.java:292)
>>> at org.apache.beam.sdk.Pipeline.replaceAll(Pipeline.java:210)
>>> at org.apache.beam.runners.spark.SparkRunner.run(SparkRunner.java:168)
>>> at org.apache.beam.runners.spark.SparkRunner.run(SparkRunner.java:90)
>>> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:315)
>>> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:301)
>>> at 
>>> com.sizmek.dp.dsp.pipeline.driver.PipelineDriver$$anonfun$1.apply(PipelineDriver.scala:42)
>>> at 
>>> com.sizmek.dp.dsp.pipeline.driver.PipelineDriver$$anonfun$1.apply(PipelineDriver.scala:35)
>>> at scala.util.Try$.apply(Try.scala:192)
>>> at 
>>> com.dp.pipeline.driver.PipelineDriver$class.main(PipelineDriver.scala:35)
>>>
>>>
>>> We are using Beam 2.19.0. Is the @RequiresStableInput problematic to
>>> support for both streaming and batch use-case? What are the options here?
>>> https://issues.apache.org/jira/browse/BEAM-5358
>>>
>>>


Errorprone plugin fails for release branches <2.22.0

2020-07-08 Thread Alexey Romanenko
Hello,

Some days ago I noticed that I can’t build the project from old release 
branches . For example, I wanted to build and run Spark Job Server from 
“release-2.20.0” branch and it failed:

./gradlew :runners:spark:job-server:runShadow —stacktrace

* Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task 
':model:pipeline:compileJava’.
…
Caused by: org.gradle.internal.UncheckedException: 
java.lang.ClassNotFoundException: 
com.google.errorprone.ErrorProneCompiler$Builder
…


I experienced the same issue for “release-2.19.0” and  “release-2.21.0” 
branches, I didn’t check older branches but seems it’s a global issue for 
“net.ltgt.gradle:gradle-errorprone-plugin:0.0.13".

This is already known issue and it was fixed for 2.22.0 [1] a while ago. By 
applying a fix from [2] on top of previous branch, for example, 
“release-2.20.0” branch I’ve managed to build it. Though, the problem for old 
branches (<2.22.0) is still there - it’s not possible to build them right after 
checkout without applying the fix.

So, there are two questions:

1. Is anyone aware why the old static version of gradle-errorprone-plugin fails 
for the branches that were successfully built before?
2. Do we have to fix it for release branches <2.22.0 (either cherry-pick the 
fix for 2.22.0 or somehow else if it’s possible)?

[1] https://issues.apache.org/jira/browse/BEAM-10263 

[2] https://github.com/apache/beam/pull/11527 




Re: Errorprone plugin fails for release branches <2.22.0

2020-07-08 Thread Maximilian Michels

Hi Alexey,

I also came across this issue when building a custom Beam version. I 
applied the same fix (https://github.com/apache/beam/pull/11527) which 
you have mentioned.


It appears that the Maven dependencies changed or are no longer 
available which causes the missing class files.


+1 for backporting the fix to the release branches.

Cheers,
Max

On 08.07.20 11:36, Alexey Romanenko wrote:

Hello,

Some days ago I noticed that I can’t build the project from old release 
branches . For example, I wanted to build and run Spark Job Server from 
“release-2.20.0” branch and it failed:


./gradlew :runners:spark:job-server:runShadow —stacktrace

* Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task 
':model:pipeline:compileJava’.

…
Caused by: org.gradle.internal.UncheckedException: 
java.lang.ClassNotFoundException: 
com.google.errorprone.ErrorProneCompiler$Builder

…


I experienced the same issue for “release-2.19.0” and  “release-2.21.0” 
branches, I didn’t check older branches but seems it’s a global issue 
for “net.ltgt.gradle:gradle-errorprone-plugin:0.0.13".


This is already known issue and it was fixed for 2.22.0 [1] a while ago. 
By applying a fix from [2] on top of previous branch, for example, 
“release-2.20.0” branch I’ve managed to build it. Though, the problem 
for old branches (<2.22.0) is still there - it’s not possible to build 
them right after checkout without applying the fix.


So, there are two questions:

1. Is anyone aware why the old static version of 
gradle-errorprone-plugin fails for the branches that were successfully 
built before?
2. Do we have to fix it for release branches <2.22.0 (either cherry-pick 
the fix for 2.22.0 or somehow else if it’s possible)?


[1] https://issues.apache.org/jira/browse/BEAM-10263
[2] https://github.com/apache/beam/pull/11527



Re: RequiresStableInput on Spark runner

2020-07-08 Thread Maximilian Michels
Correct, for batch we rely on re-running the entire job which will 
produce stable input within each run.


For streaming, the Flink Runner buffers all input to a 
@RequiresStableInput DoFn until a checkpoint is complete, only then it 
processes the buffered data. Dataflow effectively does the same by going 
through the Shuffle service which produces a consistent result.


-Max

On 08.07.20 11:08, Jozef Vilcek wrote:

My last question was more towards the graph translation for batch mode.

Should DoFn with @RequiresStableInput be translated/expanded in some 
specific way (e.g. DoFn -> Reshuffle + DoFn) or is it not needed for batch?
Most runners fail in the presence of @RequiresStableInput for both batch 
and streaming. I can not find a fail for Flink and Dataflow, but at the 
same time, I can not find what those runners do with such DoFn.


On Tue, Jul 7, 2020 at 9:18 PM Kenneth Knowles > wrote:


I hope someone who knows better than me can respond.

A long time ago, the SparkRunner added a call to materialize() at
every GroupByKey. This was to mimic Dataflow, since so many of the
initial IO transforms relied on using shuffle to create stable inputs.

The overall goal is to be able to remove these extra calls to
materialize() and only include them when @RequiresStableInput.

The intermediate state is to analyze whether input is already stable
from materialize() and add another materialize() only if it is not
stable.

I don't know the current state of the SparkRunner. This may already
have changed.

Kenn

On Thu, Jul 2, 2020 at 10:24 PM Jozef Vilcek mailto:jozo.vil...@gmail.com>> wrote:

I was trying to look for references on how other runners handle
@RequiresStableInput for batch cases, however I was not able to
find any.
In Flink I can see added support for streaming case and in
Dataflow I see that support for the feature was turned off
https://github.com/apache/beam/pull/8065

It seems to me that @RequiresStableInput is ignored for the
batch case and the runner relies on being able to recompute the
whole job in the worst case scenario.
Is this assumption correct?
Could I just change SparkRunner to crash on @RequiresStableInput
annotation for streaming mode and ignore it in batch?



On Wed, Jul 1, 2020 at 10:27 AM Jozef Vilcek
mailto:jozo.vil...@gmail.com>> wrote:

We have a component which we use in streaming and batch
jobs. Streaming we run on FlinkRunner and batch on
SparkRunner. Recently we needed to add @RequiresStableInput
to taht component because of streaming use-case. But now
batch case crash on SparkRunner with

Caused by: java.lang.UnsupportedOperationException: Spark runner 
currently doesn't support @RequiresStableInput annotation.
at 
org.apache.beam.runners.core.construction.UnsupportedOverrideFactory.getReplacementTransform(UnsupportedOverrideFactory.java:58)
at 
org.apache.beam.sdk.Pipeline.applyReplacement(Pipeline.java:556)
at org.apache.beam.sdk.Pipeline.replace(Pipeline.java:292)
at org.apache.beam.sdk.Pipeline.replaceAll(Pipeline.java:210)
at 
org.apache.beam.runners.spark.SparkRunner.run(SparkRunner.java:168)
at 
org.apache.beam.runners.spark.SparkRunner.run(SparkRunner.java:90)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:315)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:301)
at 
com.sizmek.dp.dsp.pipeline.driver.PipelineDriver$$anonfun$1.apply(PipelineDriver.scala:42)
at 
com.sizmek.dp.dsp.pipeline.driver.PipelineDriver$$anonfun$1.apply(PipelineDriver.scala:35)
at scala.util.Try$.apply(Try.scala:192)
at 
com.dp.pipeline.driver.PipelineDriver$class.main(PipelineDriver.scala:35)


We are using Beam 2.19.0. Is the @RequiresStableInput
problematic to support for both streaming and batch
use-case? What are the options here?
https://issues.apache.org/jira/browse/BEAM-5358



NanosInstant not being recognised by BigQueryIO.Write

2020-07-08 Thread Robert.Butcher
Hi All,

I am posting this to the dev (as opposed to user channel) as I believe it will 
be of interest to the those working on either Schemas or BigQuery

I have a pipeline based on BEAM 2.22 that is ingesting data into BigQuery.  
Internally I am using protobuf for my domain model and the associated schema 
support.

My intention is to make use of the useBeamSchema() method to both auto-generate 
the BigQuery table schema and to provide row conversion on write.  (The idea is 
to have true schema-first development very much in keeping with Alex's original 
ProtoBEAM concept).

The issue I've hit is around treatment of google.protobuf.Timestamp fields.  
The schema conversion seems to map these to the correct logical type: 
org.apache.beam.sdk.schemas.logicaltypes.NanosInstant, however this isn't 
recognised by BigQueryIO.Write.  Specifically the BigQueryUtils.toTableSchema() 
method throws a NullPointerException.  This seems to be due to the fact that 
there is no entry for NanosInstant in the BEAM_TO_BIG_QUERY_LOGICAL_MAPPING map.

Is this a known issue?  Is there a workaround?

I appreciate that google.protobuf.Timestamp supports nanosecond-level precision 
so cannot be converted directly to the BEAM schema type of DATETIME without 
loss of precision.  However, I believe use cases for nanosecond precision are 
rare.  Would it not be better to convert directly to DATETIME according to the 
principle of least confusion?

Are there any plans to extend the range of types both within protobuf and the 
BEAM schema to match the richer type set within BigQuery (DATE, DATETIME, 
TIMESTAMP)?  I would expect the combination of protobuf/BEAM/BigQuery to be a 
common one (especially within GCP) and it would be nice as a developer to have 
a greater range of options.

Kind regards,

Rob

Robert Butcher
Technical Architect | Foundry/SRS | NatWest Markets
WeWork, 10 Devonshire Square, London, EC2M 4AE
Mobile +44 (0) 7414 730866

This email is classified as CONFIDENTIAL unless otherwise stated.



This communication and any attachments are confidential and intended solely for 
the addressee. If you are not the intended recipient please advise us 
immediately and delete it. Unless specifically stated in the message or 
otherwise indicated, you may not duplicate, redistribute or forward this 
message and any attachments are not intended for distribution to, or use by any 
person or entity in any jurisdiction or country where such distribution or use 
would be contrary to local law or regulation. NatWest Markets Plc  or any 
affiliated entity ("NatWest Markets") accepts no responsibility for any changes 
made to this message after it was sent.

Unless otherwise specifically indicated, the contents of this communication and 
its attachments are for information purposes only and should not be regarded as 
an offer or solicitation to buy or sell a product or service, confirmation of 
any transaction, a valuation, indicative price or an official statement. 
Trading desks may have a position or interest that is inconsistent with any 
views expressed in this message. In evaluating the information contained in 
this message, you should know that it could have been previously provided to 
other clients and/or internal NatWest Markets personnel, who could have already 
acted on it.

NatWest Markets cannot provide absolute assurances that all electronic 
communications (sent or received) are secure, error free, not corrupted, 
incomplete or virus free and/or that they will not be lost, mis-delivered, 
destroyed, delayed or intercepted/decrypted by others. Therefore NatWest 
Markets disclaims all liability with regards to electronic communications (and 
the contents therein) if they are corrupted, lost destroyed, delayed, 
incomplete, mis-delivered, intercepted, decrypted or otherwise misappropriated 
by others.

Any electronic communication that is conducted within or through NatWest 
Markets systems will be subject to being archived, monitored and produced to 
regulators and in litigation in accordance with NatWest Markets’ policy and 
local laws, rules and regulations. Unless expressly prohibited by local law, 
electronic communications may be archived in countries other than the country 
in which you are located, and may be treated in accordance with the laws and 
regulations of the country of each individual included in the entire chain.

Copyright NatWest Markets Plc. All rights reserved. See 
https://www.nwm.com/disclaimer for further risk disclosure.


Re: RequiresStableInput on Spark runner

2020-07-08 Thread Jozef Vilcek
Would it then be safe to enable the same behavior for Spark batch? I can
create a JIRA and patch for this, if there is no other reason to not to
do so

On Wed, Jul 8, 2020 at 11:51 AM Maximilian Michels  wrote:

> Correct, for batch we rely on re-running the entire job which will
> produce stable input within each run.
>
> For streaming, the Flink Runner buffers all input to a
> @RequiresStableInput DoFn until a checkpoint is complete, only then it
> processes the buffered data. Dataflow effectively does the same by going
> through the Shuffle service which produces a consistent result.
>
> -Max
>
> On 08.07.20 11:08, Jozef Vilcek wrote:
> > My last question was more towards the graph translation for batch mode.
> >
> > Should DoFn with @RequiresStableInput be translated/expanded in some
> > specific way (e.g. DoFn -> Reshuffle + DoFn) or is it not needed for
> batch?
> > Most runners fail in the presence of @RequiresStableInput for both batch
> > and streaming. I can not find a fail for Flink and Dataflow, but at the
> > same time, I can not find what those runners do with such DoFn.
> >
> > On Tue, Jul 7, 2020 at 9:18 PM Kenneth Knowles  > > wrote:
> >
> > I hope someone who knows better than me can respond.
> >
> > A long time ago, the SparkRunner added a call to materialize() at
> > every GroupByKey. This was to mimic Dataflow, since so many of the
> > initial IO transforms relied on using shuffle to create stable
> inputs.
> >
> > The overall goal is to be able to remove these extra calls to
> > materialize() and only include them when @RequiresStableInput.
> >
> > The intermediate state is to analyze whether input is already stable
> > from materialize() and add another materialize() only if it is not
> > stable.
> >
> > I don't know the current state of the SparkRunner. This may already
> > have changed.
> >
> > Kenn
> >
> > On Thu, Jul 2, 2020 at 10:24 PM Jozef Vilcek  > > wrote:
> >
> > I was trying to look for references on how other runners handle
> > @RequiresStableInput for batch cases, however I was not able to
> > find any.
> > In Flink I can see added support for streaming case and in
> > Dataflow I see that support for the feature was turned off
> > https://github.com/apache/beam/pull/8065
> >
> > It seems to me that @RequiresStableInput is ignored for the
> > batch case and the runner relies on being able to recompute the
> > whole job in the worst case scenario.
> > Is this assumption correct?
> > Could I just change SparkRunner to crash on @RequiresStableInput
> > annotation for streaming mode and ignore it in batch?
> >
> >
> >
> > On Wed, Jul 1, 2020 at 10:27 AM Jozef Vilcek
> > mailto:jozo.vil...@gmail.com>> wrote:
> >
> > We have a component which we use in streaming and batch
> > jobs. Streaming we run on FlinkRunner and batch on
> > SparkRunner. Recently we needed to add @RequiresStableInput
> > to taht component because of streaming use-case. But now
> > batch case crash on SparkRunner with
> >
> > Caused by: java.lang.UnsupportedOperationException: Spark
> runner currently doesn't support @RequiresStableInput annotation.
> >   at
> org.apache.beam.runners.core.construction.UnsupportedOverrideFactory.getReplacementTransform(UnsupportedOverrideFactory.java:58)
> >   at
> org.apache.beam.sdk.Pipeline.applyReplacement(Pipeline.java:556)
> >   at org.apache.beam.sdk.Pipeline.replace(Pipeline.java:292)
> >   at
> org.apache.beam.sdk.Pipeline.replaceAll(Pipeline.java:210)
> >   at
> org.apache.beam.runners.spark.SparkRunner.run(SparkRunner.java:168)
> >   at
> org.apache.beam.runners.spark.SparkRunner.run(SparkRunner.java:90)
> >   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:315)
> >   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:301)
> >   at
> com.sizmek.dp.dsp.pipeline.driver.PipelineDriver$$anonfun$1.apply(PipelineDriver.scala:42)
> >   at
> com.sizmek.dp.dsp.pipeline.driver.PipelineDriver$$anonfun$1.apply(PipelineDriver.scala:35)
> >   at scala.util.Try$.apply(Try.scala:192)
> >   at
> com.dp.pipeline.driver.PipelineDriver$class.main(PipelineDriver.scala:35)
> >
> >
> > We are using Beam 2.19.0. Is the @RequiresStableInput
> > problematic to support for both streaming and batch
> > use-case? What are the options here?
> > https://issues.apache.org/jira/browse/BEAM-5358
> >
>


Beam Dependency Check Report (2020-07-08)

2020-07-08 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
cachetools
3.1.1
4.1.1
2019-12-23
2020-07-08BEAM-9017
chromedriver-binary
83.0.4103.39.0
84.0.4147.30.0
None
2020-07-08BEAM-10426
google-cloud-datastore
1.7.4
1.12.0
2019-05-27
2020-04-13BEAM-8443
google-cloud-dlp
0.13.0
1.0.0
2020-06-29
2020-06-29BEAM-10344
google-cloud-pubsub
1.0.2
1.6.1
2019-12-23
2020-07-08BEAM-5539
google-cloud-spanner
1.13.0
1.17.1
2020-02-17
2020-06-29BEAM-10345
google-cloud-vision
0.42.0
1.0.0
2020-03-24
2020-03-24BEAM-9581
mock
2.0.0
3.0.5
2019-05-20
2019-05-20BEAM-7369
mypy-protobuf
1.18
1.23
2020-03-24
2020-06-29BEAM-10346
oauth2client
3.0.0
4.1.3
2018-12-10
2018-12-10BEAM-6089
PyHamcrest
1.10.1
2.0.2
2020-01-20
2020-07-08BEAM-9155
pytest
4.6.11
5.4.3
None
2020-07-08BEAM-8606
tenacity
5.1.5
6.2.0
2019-11-11
2020-06-29BEAM-8607
High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
com.datastax.cassandra:cassandra-driver-core
3.8.0
4.0.0
2019-10-29
2019-03-18BEAM-8674
com.esotericsoftware:kryo
4.0.2
5.0.0-RC6
2018-03-20
2020-05-16BEAM-5809
com.esotericsoftware.kryo:kryo
2.21
2.24.0
2013-02-27
2014-05-04BEAM-5574
com.github.ben-manes.versions:com.github.ben-manes.versions.gradle.plugin
0.20.0
0.28.0
2019-02-11
2020-02-24BEAM-6645
com.github.spotbugs:spotbugs
3.1.12
4.0.6
2019-03-01
2020-06-23BEAM-7792
com.github.spotbugs:spotbugs-annotations
3.1.12
4.0.6
2019-03-01
2020-06-23BEAM-6951
com.google.api:gax
1.54.0
1.57.1
2020-02-27
2020-07-07BEAM-10348
com.google.api.grpc:grpc-google-cloud-pubsub-v1
1.85.1
1.89.0
2020-03-09
2020-06-09BEAM-8677
com.google.api.grpc:grpc-google-common-protos
1.12.0
1.18.0
2018-06-29
2020-05-04BEAM-8633
com.google.api.grpc:proto-google-cloud-bigquerystorage-v1beta1
0.85.1
0.100.0
2020-01-08
2020-06-23BEAM-8678
com.google.api.grpc:proto-google-cloud-bigtable-v2
1.9.1
1.13.0
2020-01-10
2020-05-27BEAM-8679
com.google.api.grpc:proto-google-cloud-pubsub-v1
1.85.1
1.89.0
2020-03-09
2020-06-09BEAM-8681
com.google.api.grpc:proto-google-cloud-spanner-admin-database-v1
1.49.1
1.58.0
2020-01-28
2020-07-07BEAM-8682
com.google.apis:google-api-services-bigquery
v2-rev20191211-1.30.9
v2-rev20200617-1.30.9
2020-03-05
2020-07-01BEAM-8684
com.google.apis:google-api-services-clouddebugger
v2-rev20200313-1.30.9
v2-rev20200501-1.30.9
2020-03-24
2020-05-13BEAM-8750
com.google.apis:google-api-services-cloudresourcemanager
v1-rev20200311-1.30.9
v2-rev20200617-1.30.9
2020-03-13
2020-06-30BEAM-8751
com.google.apis:google-api-services-dataflow
v1b3-rev20200305-1.30.9
v1beta3-rev12-1.20.0
2020-03-19
2015-04-29BEAM-8752
com.google.apis:google-api-services-healthcare
v1beta1-rev20200525-1.30.9
v1-rev20200612-1.30.9
2020-06-04
2020-06-30BEAM-10349
com.google.apis:google-api-services-pubsub
v1-rev20200312-1.30.9
v1-rev20200616-1.30.9
2020-03-24
2020-06-30BEAM-8753
com.google.apis:google-api-services-storage
v1-rev20200226-1.30.9
v1-rev20200611-1.30.9
2020-03-16
2020-06-30BEAM-8754
com.google.auto.service:auto-service
1.0-rc6
1.0-rc7
2019-07-16
2020-05-13BEAM-5541
com.google.auto.service:auto-service-annotations
1.0-rc6
1.0-rc7
2019-07-16
2020-05-13BEAM-10350
com.google.cloud:google-cloud-bigquery
1.108.0
1.116.3
2020-02-28
2020-06-18BEAM-8687
com.google.cloud:google-

KinesisIO Tests - are they run anywhere?

2020-07-08 Thread Piotr Szuberski
I'm writing KinesisIO external transform with python wrapper and I found that 
the tests aren't executed anywhere in Jenkins. Am I wrong or there is a reason 
for that?


Re: beam submit TFX on yarn

2020-07-08 Thread Kyle Weaver
Beam Python does not yet work with Spark on yarn. See
https://issues.apache.org/jira/browse/BEAM-8970 for details.

On Tue, Jul 7, 2020 at 8:52 PM sxqjq  wrote:

>
> I forget, java can use spark-submit commit,but I use Python language
>
>
>
> - 原始邮件 -
>
>
> *发件人:*sxqjq
>
> *发送时间:*2020-07-08 10:33:49
>
> *收件人:*dev
>
> *主 题:*beam submit TFX on yarn
>
> dear:
>
>
>I have tested TFX(tensorflow extended) to submit spark (standalone
> mode) through beam,
>
> and I want to test how TFX to submit spark (yarn mode) through beam,How do
> i do that?
>
>
> I look forward to hearing from you,thank you!
>
>
> kevin
>
>
>
>
>
> 
>
>
>
>
>
>
>
>


Re: [PROPOSAL] Preparing for Beam 2.23.0 release

2020-07-08 Thread Kyle Weaver
> I may need help with a Samza ValidatesRunner failure [1]. It has been
failing since at least June 24 [2].

Looks like a duplicate of https://issues.apache.org/jira/browse/BEAM-10025.

> 1. Did this issue come up during earlier releases?

Yes, this affected the 2.21 and 2.22 releases. tl;dr it was a newly added
test, so it most likely does not represent a regression.

On Tue, Jul 7, 2020 at 8:54 PM Valentyn Tymofieiev 
wrote:

> There are some test failures on the release branch that need to be
> addressed.
>
> I may need help with a Samza ValidatesRunner failure [1]. It has been
> failing since at least June 24 [2].
>
> 1. Did this issue come up during earlier releases?
> 2. Do we know why this suite is not visible in Beam metrics dashboard[3]?
> If it was, it would have been easier to see when it started failing.
>
> Can Samza runner maintainers please advise?
>
> Thank you.
>
> [1] https://issues.apache.org/jira/browse/BEAM-10424
> [2]
> https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/
> [3]
> http://metrics.beam.apache.org/d/D81lW0pmk/post-commit-test-reliability?orgId=1
>
>
> On Tue, Jul 7, 2020 at 8:01 PM Valentyn Tymofieiev 
> wrote:
>
>> So far the release process is going well.
>>
>> Currently running several release validation suites[1], and plan to send
>> out an RC once these are completed.
>>
>> Here is how the community can help:
>>
>> 1. If there is a test suite that you've added recently, but did not add
>> it to .test-infra/jenkins/README.md
>> ,
>> feel free to trigger it on the PR[1], and add the trigger command to
>> README.MD.
>> 2. Please add any noteworthy changes to the change log [3].
>> 3. I have received several inquiries off-the-list about
>> cherry-picking changes to the release branch (which would be blocking
>> further steps in the release process), so I would like to remind the
>> current guideline[4]: *An issue should not block the release if the
>> problem exists in the current released version or is a bug in new
>> functionality that does not exist in the current released version. It
>> should be a blocker if the bug is a regression between the currently
>> released version and the release in progress and has no easy workaround.*
>>
>> Thanks,
>> Valentyn
>>
>> [1] https://github.com/apache/beam/pull/12194
>> [2]
>> https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md
>> [3] https://github.com/apache/beam/blob/master/CHANGES.md
>> [4]
>> https://beam.apache.org/contribute/release-guide/#4-triage-release-blocking-issues-in-jira
>> 
>>
>>
>> On Mon, Jul 6, 2020 at 9:47 AM Ahmet Altay  wrote:
>>
>>> Done.
>>>
>>> On Wed, Jul 1, 2020 at 5:40 PM Valentyn Tymofieiev 
>>> wrote:
>>>
 Can somebody please add my pypi username (tvalentyn) to the list of
 apache-beam maintainers on PyPi: https://pypi.org/project/apache-beam/
  ?
 Thank you!

 On Wed, Jul 1, 2020 at 4:51 PM Valentyn Tymofieiev 
 wrote:

> Release branch has been cut.
>
> As a reminder, please do not merge commits into the release branch
> directly, instead, loop in the release manager if any cherry-picks are
> required.
>
> Thank you.
>
> On Wed, Jul 1, 2020 at 9:25 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Great, thank you Tobiasz, I will take a look.
>>
>> On Wed, Jul 1, 2020 at 7:27 AM Tobiasz Kędzierski <
>> tobiasz.kedzier...@polidea.com> wrote:
>>
>>> Hi,
>>>
>>> I've just created PR introducing usage of GH-Actions to release
>>> process
>>> https://github.com/apache/beam/pull/12150
>>>
>>> Let me know what you think, maybe you have some suggestions on what
>>> may be improved.
>>>
>>> BR
>>> Tobiasz
>>>
>>> On Wed, Jul 1, 2020 at 6:08 AM Ahmet Altay  wrote:
>>>
 Valentyn,

 +tobiasz.kedzier...@polidea.com  added
 a github action for building python wheel files (
 https://github.com/apache/beam/pull/11877). You should be able to
 build the wheel files using this github action instead of using the
 beam-wheels repo and Travis. Please give it a try during the release
 process.

 Ahmet

 On Tue, Jun 23, 2020 at 7:27 PM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Friendly reminder that the release cut is slated next week.
>
> If you are aware of *release-blocking* issues, please open a JIRA
> and set the "Fix version" to be 2.23.0.
>
> Please do not set "Fix version" for open non-blocking issues,
> instead set "Fix version" once the issue is actually reso

Re: KinesisIO Tests - are they run anywhere?

2020-07-08 Thread Alexey Romanenko
If you mean Java KinesisIO tests, then unit tests are running on Jenkins [1] 
and ITs are not running since it requires AWS credentials that we don’t have 
dedicated to Beam for the moment.

In the same time, you can run KinesisIOIT with your own credentials, like we do 
in Talend (a company that I work for).

[1] 
https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/12209/testReport/org.apache.beam.sdk.io.kinesis/
 


> On 8 Jul 2020, at 13:11, Piotr Szuberski  wrote:
> 
> I'm writing KinesisIO external transform with python wrapper and I found that 
> the tests aren't executed anywhere in Jenkins. Am I wrong or there is a 
> reason for that?



Re: Errorprone plugin fails for release branches <2.22.0

2020-07-08 Thread Alexey Romanenko
Hi Max,

I’m +1 for back porting as well but that seems quite complicated since we 
distribute release source code from https://archive.apache.org/
Perhaps, we should just warn users about this issue and how to workaround it.

Any other ideas?

> On 8 Jul 2020, at 11:46, Maximilian Michels  wrote:
> 
> Hi Alexey,
> 
> I also came across this issue when building a custom Beam version. I applied 
> the same fix (https://github.com/apache/beam/pull/11527) which you have 
> mentioned.
> 
> It appears that the Maven dependencies changed or are no longer available 
> which causes the missing class files.
> 
> +1 for backporting the fix to the release branches.
> 
> Cheers,
> Max
> 
> On 08.07.20 11:36, Alexey Romanenko wrote:
>> Hello,
>> Some days ago I noticed that I can’t build the project from old release 
>> branches . For example, I wanted to build and run Spark Job Server from 
>> “release-2.20.0” branch and it failed:
>> ./gradlew :runners:spark:job-server:runShadow —stacktrace
>> * Exception is:
>> org.gradle.api.tasks.TaskExecutionException: Execution failed for task 
>> ':model:pipeline:compileJava’.
>> …
>> Caused by: org.gradle.internal.UncheckedException: 
>> java.lang.ClassNotFoundException: 
>> com.google.errorprone.ErrorProneCompiler$Builder
>> …
>> I experienced the same issue for “release-2.19.0” and  “release-2.21.0” 
>> branches, I didn’t check older branches but seems it’s a global issue for 
>> “net.ltgt.gradle:gradle-errorprone-plugin:0.0.13".
>> This is already known issue and it was fixed for 2.22.0 [1] a while ago. By 
>> applying a fix from [2] on top of previous branch, for example, 
>> “release-2.20.0” branch I’ve managed to build it. Though, the problem for 
>> old branches (<2.22.0) is still there - it’s not possible to build them 
>> right after checkout without applying the fix.
>> So, there are two questions:
>> 1. Is anyone aware why the old static version of gradle-errorprone-plugin 
>> fails for the branches that were successfully built before?
>> 2. Do we have to fix it for release branches <2.22.0 (either cherry-pick the 
>> fix for 2.22.0 or somehow else if it’s possible)?
>> [1] https://issues.apache.org/jira/browse/BEAM-10263
>> [2] https://github.com/apache/beam/pull/11527



Re: [PROPOSAL] Preparing for Beam 2.23.0 release

2020-07-08 Thread Valentyn Tymofieiev
Thank you, Kyle!

On Wed, Jul 8, 2020 at 10:03 AM Kyle Weaver  wrote:

> > I may need help with a Samza ValidatesRunner failure [1]. It has been
> failing since at least June 24 [2].
>
> Looks like a duplicate of https://issues.apache.org/jira/browse/BEAM-10025
> .
>
> > 1. Did this issue come up during earlier releases?
>
> Yes, this affected the 2.21 and 2.22 releases. tl;dr it was a newly added
> test, so it most likely does not represent a regression.
>
> On Tue, Jul 7, 2020 at 8:54 PM Valentyn Tymofieiev 
> wrote:
>
>> There are some test failures on the release branch that need to be
>> addressed.
>>
>> I may need help with a Samza ValidatesRunner failure [1]. It has been
>> failing since at least June 24 [2].
>>
>> 1. Did this issue come up during earlier releases?
>> 2. Do we know why this suite is not visible in Beam metrics dashboard[3]?
>> If it was, it would have been easier to see when it started failing.
>>
>> Can Samza runner maintainers please advise?
>>
>> Thank you.
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-10424
>> [2]
>> https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/
>> [3]
>> http://metrics.beam.apache.org/d/D81lW0pmk/post-commit-test-reliability?orgId=1
>>
>>
>> On Tue, Jul 7, 2020 at 8:01 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> So far the release process is going well.
>>>
>>> Currently running several release validation suites[1], and plan to send
>>> out an RC once these are completed.
>>>
>>> Here is how the community can help:
>>>
>>> 1. If there is a test suite that you've added recently, but did not add
>>> it to .test-infra/jenkins/README.md
>>> ,
>>> feel free to trigger it on the PR[1], and add the trigger command to
>>> README.MD.
>>> 2. Please add any noteworthy changes to the change log [3].
>>> 3. I have received several inquiries off-the-list about
>>> cherry-picking changes to the release branch (which would be blocking
>>> further steps in the release process), so I would like to remind the
>>> current guideline[4]: *An issue should not block the release if the
>>> problem exists in the current released version or is a bug in new
>>> functionality that does not exist in the current released version. It
>>> should be a blocker if the bug is a regression between the currently
>>> released version and the release in progress and has no easy workaround.*
>>>
>>> Thanks,
>>> Valentyn
>>>
>>> [1] https://github.com/apache/beam/pull/12194
>>> [2]
>>> https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md
>>> [3] https://github.com/apache/beam/blob/master/CHANGES.md
>>> [4]
>>> https://beam.apache.org/contribute/release-guide/#4-triage-release-blocking-issues-in-jira
>>> 
>>>
>>>
>>> On Mon, Jul 6, 2020 at 9:47 AM Ahmet Altay  wrote:
>>>
 Done.

 On Wed, Jul 1, 2020 at 5:40 PM Valentyn Tymofieiev 
 wrote:

> Can somebody please add my pypi username (tvalentyn) to the list of
> apache-beam maintainers on PyPi: https://pypi.org/project/apache-beam/
>  ?
> Thank you!
>
> On Wed, Jul 1, 2020 at 4:51 PM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Release branch has been cut.
>>
>> As a reminder, please do not merge commits into the release branch
>> directly, instead, loop in the release manager if any cherry-picks are
>> required.
>>
>> Thank you.
>>
>> On Wed, Jul 1, 2020 at 9:25 AM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> Great, thank you Tobiasz, I will take a look.
>>>
>>> On Wed, Jul 1, 2020 at 7:27 AM Tobiasz Kędzierski <
>>> tobiasz.kedzier...@polidea.com> wrote:
>>>
 Hi,

 I've just created PR introducing usage of GH-Actions to release
 process
 https://github.com/apache/beam/pull/12150

 Let me know what you think, maybe you have some suggestions on what
 may be improved.

 BR
 Tobiasz

 On Wed, Jul 1, 2020 at 6:08 AM Ahmet Altay 
 wrote:

> Valentyn,
>
> +tobiasz.kedzier...@polidea.com  added
> a github action for building python wheel files (
> https://github.com/apache/beam/pull/11877). You should be able to
> build the wheel files using this github action instead of using the
> beam-wheels repo and Travis. Please give it a try during the release
> process.
>
> Ahmet
>
> On Tue, Jun 23, 2020 at 7:27 PM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Friendly reminder that the release cut is slated next week.
>>
>> If you are aware of

Re: Errorprone plugin fails for release branches <2.22.0

2020-07-08 Thread Pablo Estrada
Ah that's annoying that a dependency would be removed from maven. I thought
that was not meant to happen? This must be an issue happening for many
other projects...
Why is errorprone a dependency anyway?

To fix on previous release branches, we would need to make a new release,
is it not? Since hashes would change..

On Wed, Jul 8, 2020 at 10:21 AM Alexey Romanenko 
wrote:

> Hi Max,
>
> I’m +1 for back porting as well but that seems quite complicated since we
> distribute release source code from https://archive.apache.org/
> Perhaps, we should just warn users about this issue and how to workaround
> it.
>
> Any other ideas?
>
> > On 8 Jul 2020, at 11:46, Maximilian Michels  wrote:
> >
> > Hi Alexey,
> >
> > I also came across this issue when building a custom Beam version. I
> applied the same fix (https://github.com/apache/beam/pull/11527) which
> you have mentioned.
> >
> > It appears that the Maven dependencies changed or are no longer
> available which causes the missing class files.
> >
> > +1 for backporting the fix to the release branches.
> >
> > Cheers,
> > Max
> >
> > On 08.07.20 11:36, Alexey Romanenko wrote:
> >> Hello,
> >> Some days ago I noticed that I can’t build the project from old release
> branches . For example, I wanted to build and run Spark Job Server from
> “release-2.20.0” branch and it failed:
> >> ./gradlew :runners:spark:job-server:runShadow —stacktrace
> >> * Exception is:
> >> org.gradle.api.tasks.TaskExecutionException: Execution failed for task
> ':model:pipeline:compileJava’.
> >> …
> >> Caused by: org.gradle.internal.UncheckedException:
> java.lang.ClassNotFoundException:
> com.google.errorprone.ErrorProneCompiler$Builder
> >> …
> >> I experienced the same issue for “release-2.19.0” and  “release-2.21.0”
> branches, I didn’t check older branches but seems it’s a global issue for
> “net.ltgt.gradle:gradle-errorprone-plugin:0.0.13".
> >> This is already known issue and it was fixed for 2.22.0 [1] a while
> ago. By applying a fix from [2] on top of previous branch, for example,
> “release-2.20.0” branch I’ve managed to build it. Though, the problem for
> old branches (<2.22.0) is still there - it’s not possible to build them
> right after checkout without applying the fix.
> >> So, there are two questions:
> >> 1. Is anyone aware why the old static version of
> gradle-errorprone-plugin fails for the branches that were successfully
> built before?
> >> 2. Do we have to fix it for release branches <2.22.0 (either
> cherry-pick the fix for 2.22.0 or somehow else if it’s possible)?
> >> [1] https://issues.apache.org/jira/browse/BEAM-10263
> >> [2] https://github.com/apache/beam/pull/11527
>
>


Re: Errorprone plugin fails for release branches <2.22.0

2020-07-08 Thread Kyle Weaver
> To fix on previous release branches, we would need to make a new release,
is it not? Since hashes would change..

Would it be alright to patch the release branches on Github and leave the
released source as-is? Github release branches themselves aren't release
artifacts, so I think it should be okay to patch them without making a new
release.

On Wed, Jul 8, 2020 at 11:59 AM Pablo Estrada  wrote:

> Ah that's annoying that a dependency would be removed from maven. I
> thought that was not meant to happen? This must be an issue happening for
> many other projects...
> Why is errorprone a dependency anyway?
>
> To fix on previous release branches, we would need to make a new release,
> is it not? Since hashes would change..
>
> On Wed, Jul 8, 2020 at 10:21 AM Alexey Romanenko 
> wrote:
>
>> Hi Max,
>>
>> I’m +1 for back porting as well but that seems quite complicated since we
>> distribute release source code from https://archive.apache.org/
>> Perhaps, we should just warn users about this issue and how to workaround
>> it.
>>
>> Any other ideas?
>>
>> > On 8 Jul 2020, at 11:46, Maximilian Michels  wrote:
>> >
>> > Hi Alexey,
>> >
>> > I also came across this issue when building a custom Beam version. I
>> applied the same fix (https://github.com/apache/beam/pull/11527) which
>> you have mentioned.
>> >
>> > It appears that the Maven dependencies changed or are no longer
>> available which causes the missing class files.
>> >
>> > +1 for backporting the fix to the release branches.
>> >
>> > Cheers,
>> > Max
>> >
>> > On 08.07.20 11:36, Alexey Romanenko wrote:
>> >> Hello,
>> >> Some days ago I noticed that I can’t build the project from old
>> release branches . For example, I wanted to build and run Spark Job Server
>> from “release-2.20.0” branch and it failed:
>> >> ./gradlew :runners:spark:job-server:runShadow —stacktrace
>> >> * Exception is:
>> >> org.gradle.api.tasks.TaskExecutionException: Execution failed for task
>> ':model:pipeline:compileJava’.
>> >> …
>> >> Caused by: org.gradle.internal.UncheckedException:
>> java.lang.ClassNotFoundException:
>> com.google.errorprone.ErrorProneCompiler$Builder
>> >> …
>> >> I experienced the same issue for “release-2.19.0” and
>> “release-2.21.0” branches, I didn’t check older branches but seems it’s a
>> global issue for “net.ltgt.gradle:gradle-errorprone-plugin:0.0.13".
>> >> This is already known issue and it was fixed for 2.22.0 [1] a while
>> ago. By applying a fix from [2] on top of previous branch, for example,
>> “release-2.20.0” branch I’ve managed to build it. Though, the problem for
>> old branches (<2.22.0) is still there - it’s not possible to build them
>> right after checkout without applying the fix.
>> >> So, there are two questions:
>> >> 1. Is anyone aware why the old static version of
>> gradle-errorprone-plugin fails for the branches that were successfully
>> built before?
>> >> 2. Do we have to fix it for release branches <2.22.0 (either
>> cherry-pick the fix for 2.22.0 or somehow else if it’s possible)?
>> >> [1] https://issues.apache.org/jira/browse/BEAM-10263
>> >> [2] https://github.com/apache/beam/pull/11527
>>
>>


Re: Errorprone plugin fails for release branches <2.22.0

2020-07-08 Thread Kenneth Knowles
On Wed, Jul 8, 2020 at 12:07 PM Kyle Weaver  wrote:

> > To fix on previous release branches, we would need to make a new
> release, is it not? Since hashes would change..
>
> Would it be alright to patch the release branches on Github and leave the
> released source as-is? Github release branches themselves aren't release
> artifacts, so I think it should be okay to patch them without making a new
> release.
>

Yea. There are tags for the exact hashes that RCs were built from. The
release branch is fine to get new commits, and then if anyone wants to
build a patch release they will get those commits.

Kenn


> On Wed, Jul 8, 2020 at 11:59 AM Pablo Estrada  wrote:
>
>> Ah that's annoying that a dependency would be removed from maven. I
>> thought that was not meant to happen? This must be an issue happening for
>> many other projects...
>> Why is errorprone a dependency anyway?
>>
>> To fix on previous release branches, we would need to make a new release,
>> is it not? Since hashes would change..
>>
>> On Wed, Jul 8, 2020 at 10:21 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> Hi Max,
>>>
>>> I’m +1 for back porting as well but that seems quite complicated since
>>> we distribute release source code from https://archive.apache.org/
>>> Perhaps, we should just warn users about this issue and how to
>>> workaround it.
>>>
>>> Any other ideas?
>>>
>>> > On 8 Jul 2020, at 11:46, Maximilian Michels  wrote:
>>> >
>>> > Hi Alexey,
>>> >
>>> > I also came across this issue when building a custom Beam version. I
>>> applied the same fix (https://github.com/apache/beam/pull/11527) which
>>> you have mentioned.
>>> >
>>> > It appears that the Maven dependencies changed or are no longer
>>> available which causes the missing class files.
>>> >
>>> > +1 for backporting the fix to the release branches.
>>> >
>>> > Cheers,
>>> > Max
>>> >
>>> > On 08.07.20 11:36, Alexey Romanenko wrote:
>>> >> Hello,
>>> >> Some days ago I noticed that I can’t build the project from old
>>> release branches . For example, I wanted to build and run Spark Job Server
>>> from “release-2.20.0” branch and it failed:
>>> >> ./gradlew :runners:spark:job-server:runShadow —stacktrace
>>> >> * Exception is:
>>> >> org.gradle.api.tasks.TaskExecutionException: Execution failed for
>>> task ':model:pipeline:compileJava’.
>>> >> …
>>> >> Caused by: org.gradle.internal.UncheckedException:
>>> java.lang.ClassNotFoundException:
>>> com.google.errorprone.ErrorProneCompiler$Builder
>>> >> …
>>> >> I experienced the same issue for “release-2.19.0” and
>>> “release-2.21.0” branches, I didn’t check older branches but seems it’s a
>>> global issue for “net.ltgt.gradle:gradle-errorprone-plugin:0.0.13".
>>> >> This is already known issue and it was fixed for 2.22.0 [1] a while
>>> ago. By applying a fix from [2] on top of previous branch, for example,
>>> “release-2.20.0” branch I’ve managed to build it. Though, the problem for
>>> old branches (<2.22.0) is still there - it’s not possible to build them
>>> right after checkout without applying the fix.
>>> >> So, there are two questions:
>>> >> 1. Is anyone aware why the old static version of
>>> gradle-errorprone-plugin fails for the branches that were successfully
>>> built before?
>>> >> 2. Do we have to fix it for release branches <2.22.0 (either
>>> cherry-pick the fix for 2.22.0 or somehow else if it’s possible)?
>>> >> [1] https://issues.apache.org/jira/browse/BEAM-10263
>>> >> [2] https://github.com/apache/beam/pull/11527
>>>
>>>


Re: Errorprone plugin fails for release branches <2.22.0

2020-07-08 Thread Ismaël Mejía
I still don't understand how this happened. Was the dependency hosted
in other place?

Dependencies CAN NOT be removed from central to avoid these issues.
https://central.sonatype.org/articles/2014/Feb/06/can-i-change-a-component-on-central/

The question is where was this dependency coming from? and how can our
build be so brittle to be broken on a 'optional' dependency,
error-prone.
How can we prevent this from happening in the future?


On Wed, Jul 8, 2020 at 8:59 PM Pablo Estrada  wrote:
>
> Ah that's annoying that a dependency would be removed from maven. I thought 
> that was not meant to happen? This must be an issue happening for many other 
> projects...
> Why is errorprone a dependency anyway?
>
> To fix on previous release branches, we would need to make a new release, is 
> it not? Since hashes would change..
>
> On Wed, Jul 8, 2020 at 10:21 AM Alexey Romanenko  
> wrote:
>>
>> Hi Max,
>>
>> I’m +1 for back porting as well but that seems quite complicated since we 
>> distribute release source code from https://archive.apache.org/
>> Perhaps, we should just warn users about this issue and how to workaround it.
>>
>> Any other ideas?
>>
>> > On 8 Jul 2020, at 11:46, Maximilian Michels  wrote:
>> >
>> > Hi Alexey,
>> >
>> > I also came across this issue when building a custom Beam version. I 
>> > applied the same fix (https://github.com/apache/beam/pull/11527) which you 
>> > have mentioned.
>> >
>> > It appears that the Maven dependencies changed or are no longer available 
>> > which causes the missing class files.
>> >
>> > +1 for backporting the fix to the release branches.
>> >
>> > Cheers,
>> > Max
>> >
>> > On 08.07.20 11:36, Alexey Romanenko wrote:
>> >> Hello,
>> >> Some days ago I noticed that I can’t build the project from old release 
>> >> branches . For example, I wanted to build and run Spark Job Server from 
>> >> “release-2.20.0” branch and it failed:
>> >> ./gradlew :runners:spark:job-server:runShadow —stacktrace
>> >> * Exception is:
>> >> org.gradle.api.tasks.TaskExecutionException: Execution failed for task 
>> >> ':model:pipeline:compileJava’.
>> >> …
>> >> Caused by: org.gradle.internal.UncheckedException: 
>> >> java.lang.ClassNotFoundException: 
>> >> com.google.errorprone.ErrorProneCompiler$Builder
>> >> …
>> >> I experienced the same issue for “release-2.19.0” and  “release-2.21.0” 
>> >> branches, I didn’t check older branches but seems it’s a global issue for 
>> >> “net.ltgt.gradle:gradle-errorprone-plugin:0.0.13".
>> >> This is already known issue and it was fixed for 2.22.0 [1] a while ago. 
>> >> By applying a fix from [2] on top of previous branch, for example, 
>> >> “release-2.20.0” branch I’ve managed to build it. Though, the problem for 
>> >> old branches (<2.22.0) is still there - it’s not possible to build them 
>> >> right after checkout without applying the fix.
>> >> So, there are two questions:
>> >> 1. Is anyone aware why the old static version of gradle-errorprone-plugin 
>> >> fails for the branches that were successfully built before?
>> >> 2. Do we have to fix it for release branches <2.22.0 (either cherry-pick 
>> >> the fix for 2.22.0 or somehow else if it’s possible)?
>> >> [1] https://issues.apache.org/jira/browse/BEAM-10263
>> >> [2] https://github.com/apache/beam/pull/11527
>>


Season of Docs Interest

2020-07-08 Thread Sharon Lin
Hi Aizhamal,

I'm a 4th year bachelors student at MIT studying computer science, and I'm
interested in working with Apache Beam for Season of Docs! I recognize that
it's close to the application deadline, but I'm an avid user of Apache
Spark and would really love to help with documenting tools for developers.

I'm interested in working on the update of the runner comparison page /
capability matrix. I've set up Spark and Dataflow before, and I believe I
have the necessary background to get started on the deliverables once the
program begins.

I've attached my resume if that's helpful. Thanks, and I hope to work with
you!

Best,
Sharon Lin
Department of EECS
Massachusetts Institute of Technology


Sharon Lin Resume.pdf
Description: Adobe PDF document


Re: Errorprone plugin fails for release branches <2.22.0

2020-07-08 Thread Kenneth Knowles
I believe the hosting is https://plugins.gradle.org/m2/

On Wed, Jul 8, 2020 at 12:33 PM Ismaël Mejía  wrote:

> I still don't understand how this happened. Was the dependency hosted
> in other place?
>
> Dependencies CAN NOT be removed from central to avoid these issues.
>
> https://central.sonatype.org/articles/2014/Feb/06/can-i-change-a-component-on-central/
>
> The question is where was this dependency coming from? and how can our
> build be so brittle to be broken on a 'optional' dependency,
> error-prone.
> How can we prevent this from happening in the future?
>
>
> On Wed, Jul 8, 2020 at 8:59 PM Pablo Estrada  wrote:
> >
> > Ah that's annoying that a dependency would be removed from maven. I
> thought that was not meant to happen? This must be an issue happening for
> many other projects...
> > Why is errorprone a dependency anyway?
> >
> > To fix on previous release branches, we would need to make a new
> release, is it not? Since hashes would change..
> >
> > On Wed, Jul 8, 2020 at 10:21 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
> >>
> >> Hi Max,
> >>
> >> I’m +1 for back porting as well but that seems quite complicated since
> we distribute release source code from https://archive.apache.org/
> >> Perhaps, we should just warn users about this issue and how to
> workaround it.
> >>
> >> Any other ideas?
> >>
> >> > On 8 Jul 2020, at 11:46, Maximilian Michels  wrote:
> >> >
> >> > Hi Alexey,
> >> >
> >> > I also came across this issue when building a custom Beam version. I
> applied the same fix (https://github.com/apache/beam/pull/11527) which
> you have mentioned.
> >> >
> >> > It appears that the Maven dependencies changed or are no longer
> available which causes the missing class files.
> >> >
> >> > +1 for backporting the fix to the release branches.
> >> >
> >> > Cheers,
> >> > Max
> >> >
> >> > On 08.07.20 11:36, Alexey Romanenko wrote:
> >> >> Hello,
> >> >> Some days ago I noticed that I can’t build the project from old
> release branches . For example, I wanted to build and run Spark Job Server
> from “release-2.20.0” branch and it failed:
> >> >> ./gradlew :runners:spark:job-server:runShadow —stacktrace
> >> >> * Exception is:
> >> >> org.gradle.api.tasks.TaskExecutionException: Execution failed for
> task ':model:pipeline:compileJava’.
> >> >> …
> >> >> Caused by: org.gradle.internal.UncheckedException:
> java.lang.ClassNotFoundException:
> com.google.errorprone.ErrorProneCompiler$Builder
> >> >> …
> >> >> I experienced the same issue for “release-2.19.0” and
> “release-2.21.0” branches, I didn’t check older branches but seems it’s a
> global issue for “net.ltgt.gradle:gradle-errorprone-plugin:0.0.13".
> >> >> This is already known issue and it was fixed for 2.22.0 [1] a while
> ago. By applying a fix from [2] on top of previous branch, for example,
> “release-2.20.0” branch I’ve managed to build it. Though, the problem for
> old branches (<2.22.0) is still there - it’s not possible to build them
> right after checkout without applying the fix.
> >> >> So, there are two questions:
> >> >> 1. Is anyone aware why the old static version of
> gradle-errorprone-plugin fails for the branches that were successfully
> built before?
> >> >> 2. Do we have to fix it for release branches <2.22.0 (either
> cherry-pick the fix for 2.22.0 or somehow else if it’s possible)?
> >> >> [1] https://issues.apache.org/jira/browse/BEAM-10263
> >> >> [2] https://github.com/apache/beam/pull/11527
> >>
>


Beam Summit Status Report - 7/8

2020-07-08 Thread Brittany Hermann
Hi folks,

I wanted to provide you with the Beam Summit Status report from today's
meeting. If you would like to join the next public meeting on Wednesday,
July 22nd at 11:30 AM PST please let me know and I will send a calendar
invite over to you!

Also don't forget to register for the Summit !

https://docs.google.com/document/d/11PXOBUbeldgPqz6OlTswCal6SxyX76Bb_ZVKBdwsd7o/edit?usp=sharing

Have a great day!

-- 

Brittany Hermann

Open Source Program Manager (Provided by Adecco Staffing)

1190 Bordeaux Drive , Building 4, Sunnyvale, CA 94089



[no subject]

2020-07-08 Thread Emily Ye
Greetings, dev@beam! Just wanted to introduce myself - I'm a SWE at Google who 
will be contributing to Beam going forward. I'm pretty new to the data 
processing space but I'm excited to learn, and will probably be asking lots of 
questions here. Looking forward to getting to know the community! 

- Emily

 


Finer-grained test runs?

2020-07-08 Thread Kenneth Knowles
Hi all,

I wanted to start a discussion about getting finer grained test execution
more focused on particular artifacts/modules. In particular, I want to
gather the downsides and impossibilities. So I will make a proposal that
people can disagree with easily.

Context: job_PreCommit_Java is a monolithic job that...

 - takes 40-50 minutes
 - runs tests of maybe a bit under 100 modules
 - executes over 10k tests
 - runs on any change to model/, sdks/java/, runners/, examples/java/,
examples/kotlin/, release/ (only exception is SQL)
 - is pretty flaky (because it conflates so many independent test flakes,
mostly runners and IOs)

See a scan at https://scans.gradle.com/s/dnuo4o245d2fw/timeline?sort=longest

Proposal: Eliminate monolithic job and break into finer-grained jobs that
operate on two principles:

1. Test run should be focused on validating one artifact or a specific
integration of other artifacts.
2. Test run should trigger only on things that could affect the validity of
that artifact.

For example, a starting point is to separate:

 - core SDK
 - runner helper libs
 - each runner
 - each extension
 - each IO

Benefits:

 - changing an IO or runner would not trigger the 20 minutes of core SDK
tests
 - changing a runner would not trigger the long IO local integration tests
 - changing the core SDK could potentially not run as many tests in
presubmit, but maybe it would and they would be separately reported results
with clear flakiness signal

There are 72 build.gradle files under sdks/java/ and 30 under runners/.
They don't all require a separate job. But still there are enough that it
is worth automation. Does anyone know of what options we might have? It
does not even have to be in Jenkins. We could have one "test the things"
Jenkins job if the underlying tool (Gradle) could resolve what needs to be
run. Caching is not sufficient in my experience.

(there are other quick fix alternatives to shrinking this time, but I want
to focus on bigger picture)

Kenn


Re: Finer-grained test runs?

2020-07-08 Thread Brian Hulette
> We could have one "test the things" Jenkins job if the underlying tool
(Gradle) could resolve what needs to be run.

I think this would be much better. Otherwise it seems our Jenkins
definitions are just duplicating information that's already stored in the
build.gradle files which seems error-prone, especially for tests validating
combinations of artifacts. I did some quick searching and came across [1].
It doesn't look like the project has had a lot of recent activity, but it
claims to do what we need:

> The plugin will generate new tasks on the root project for each task
provided on the configuration with the following pattern
${taskName}ChangedModules.
> These generated tasks will run the changedModules task to get the list of
changed modules and for each one will call the given task.

Of course this would only really help us with java tests as gradle doesn't
know much about the structure of dependencies within the python (and go?)
SDK.

Brian

[1] https://github.com/ismaeldivita/change-tracker-plugin

On Wed, Jul 8, 2020 at 3:29 PM Kenneth Knowles  wrote:

> Hi all,
>
> I wanted to start a discussion about getting finer grained test execution
> more focused on particular artifacts/modules. In particular, I want to
> gather the downsides and impossibilities. So I will make a proposal that
> people can disagree with easily.
>
> Context: job_PreCommit_Java is a monolithic job that...
>
>  - takes 40-50 minutes
>  - runs tests of maybe a bit under 100 modules
>  - executes over 10k tests
>  - runs on any change to model/, sdks/java/, runners/, examples/java/,
> examples/kotlin/, release/ (only exception is SQL)
>  - is pretty flaky (because it conflates so many independent test flakes,
> mostly runners and IOs)
>
> See a scan at
> https://scans.gradle.com/s/dnuo4o245d2fw/timeline?sort=longest
>
> Proposal: Eliminate monolithic job and break into finer-grained jobs that
> operate on two principles:
>
> 1. Test run should be focused on validating one artifact or a specific
> integration of other artifacts.
> 2. Test run should trigger only on things that could affect the validity
> of that artifact.
>
> For example, a starting point is to separate:
>
>  - core SDK
>  - runner helper libs
>  - each runner
>  - each extension
>  - each IO
>
> Benefits:
>
>  - changing an IO or runner would not trigger the 20 minutes of core SDK
> tests
>  - changing a runner would not trigger the long IO local integration tests
>  - changing the core SDK could potentially not run as many tests in
> presubmit, but maybe it would and they would be separately reported results
> with clear flakiness signal
>
> There are 72 build.gradle files under sdks/java/ and 30 under runners/.
> They don't all require a separate job. But still there are enough that it
> is worth automation. Does anyone know of what options we might have? It
> does not even have to be in Jenkins. We could have one "test the things"
> Jenkins job if the underlying tool (Gradle) could resolve what needs to be
> run. Caching is not sufficient in my experience.
>
> (there are other quick fix alternatives to shrinking this time, but I want
> to focus on bigger picture)
>
> Kenn
>


Re: Finer-grained test runs?

2020-07-08 Thread Kenneth Knowles
That's a good start. It is new enough and with few enough commits that I'd
want to do some thorough experimentation. Our build is complex enough with
a lot of ad hoc coding that we might end up maintaining whatever we
choose...

In my ideal scenario the list of "what else to test" would be manually
editable, or even strictly opt-in. Automatically testing everything that
might be affected quickly runs into scaling problems too. It could make
sense in post-commit but less so in pre-commit.

Kenn

On Wed, Jul 8, 2020 at 3:50 PM Brian Hulette  wrote:

> > We could have one "test the things" Jenkins job if the underlying tool
> (Gradle) could resolve what needs to be run.
>
> I think this would be much better. Otherwise it seems our Jenkins
> definitions are just duplicating information that's already stored in the
> build.gradle files which seems error-prone, especially for tests validating
> combinations of artifacts. I did some quick searching and came across [1].
> It doesn't look like the project has had a lot of recent activity, but it
> claims to do what we need:
>
> > The plugin will generate new tasks on the root project for each task
> provided on the configuration with the following pattern
> ${taskName}ChangedModules.
> > These generated tasks will run the changedModules task to get the list
> of changed modules and for each one will call the given task.
>
> Of course this would only really help us with java tests as gradle doesn't
> know much about the structure of dependencies within the python (and go?)
> SDK.
>
> Brian
>
> [1] https://github.com/ismaeldivita/change-tracker-plugin
>
> On Wed, Jul 8, 2020 at 3:29 PM Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> I wanted to start a discussion about getting finer grained test execution
>> more focused on particular artifacts/modules. In particular, I want to
>> gather the downsides and impossibilities. So I will make a proposal that
>> people can disagree with easily.
>>
>> Context: job_PreCommit_Java is a monolithic job that...
>>
>>  - takes 40-50 minutes
>>  - runs tests of maybe a bit under 100 modules
>>  - executes over 10k tests
>>  - runs on any change to model/, sdks/java/, runners/, examples/java/,
>> examples/kotlin/, release/ (only exception is SQL)
>>  - is pretty flaky (because it conflates so many independent test flakes,
>> mostly runners and IOs)
>>
>> See a scan at
>> https://scans.gradle.com/s/dnuo4o245d2fw/timeline?sort=longest
>>
>> Proposal: Eliminate monolithic job and break into finer-grained jobs that
>> operate on two principles:
>>
>> 1. Test run should be focused on validating one artifact or a specific
>> integration of other artifacts.
>> 2. Test run should trigger only on things that could affect the validity
>> of that artifact.
>>
>> For example, a starting point is to separate:
>>
>>  - core SDK
>>  - runner helper libs
>>  - each runner
>>  - each extension
>>  - each IO
>>
>> Benefits:
>>
>>  - changing an IO or runner would not trigger the 20 minutes of core SDK
>> tests
>>  - changing a runner would not trigger the long IO local integration tests
>>  - changing the core SDK could potentially not run as many tests in
>> presubmit, but maybe it would and they would be separately reported results
>> with clear flakiness signal
>>
>> There are 72 build.gradle files under sdks/java/ and 30 under runners/.
>> They don't all require a separate job. But still there are enough that it
>> is worth automation. Does anyone know of what options we might have? It
>> does not even have to be in Jenkins. We could have one "test the things"
>> Jenkins job if the underlying tool (Gradle) could resolve what needs to be
>> run. Caching is not sufficient in my experience.
>>
>> (there are other quick fix alternatives to shrinking this time, but I
>> want to focus on bigger picture)
>>
>> Kenn
>>
>


Re: Finer-grained test runs?

2020-07-08 Thread Luke Cwik
I'm not sure that breaking it up will be significantly faster since each
module needs to build its ancestors and run tests of itself and all of its
descendants which isn't a trivial amount of work. We have only so many
executors and with the increased number of jobs, won't we just be waiting
for queued jobs to start? I agree that we would have better visibility
though in github and also in Jenkins.

Fixing flaky tests would help improve our test signal as well. Not many
willing people here though but could be less work then building and
maintaining so many different jobs.


On Wed, Jul 8, 2020 at 4:13 PM Kenneth Knowles  wrote:

> That's a good start. It is new enough and with few enough commits that I'd
> want to do some thorough experimentation. Our build is complex enough with
> a lot of ad hoc coding that we might end up maintaining whatever we
> choose...
>
> In my ideal scenario the list of "what else to test" would be manually
> editable, or even strictly opt-in. Automatically testing everything that
> might be affected quickly runs into scaling problems too. It could make
> sense in post-commit but less so in pre-commit.
>
> Kenn
>
> On Wed, Jul 8, 2020 at 3:50 PM Brian Hulette  wrote:
>
>> > We could have one "test the things" Jenkins job if the underlying tool
>> (Gradle) could resolve what needs to be run.
>>
>> I think this would be much better. Otherwise it seems our Jenkins
>> definitions are just duplicating information that's already stored in the
>> build.gradle files which seems error-prone, especially for tests validating
>> combinations of artifacts. I did some quick searching and came across [1].
>> It doesn't look like the project has had a lot of recent activity, but it
>> claims to do what we need:
>>
>> > The plugin will generate new tasks on the root project for each task
>> provided on the configuration with the following pattern
>> ${taskName}ChangedModules.
>> > These generated tasks will run the changedModules task to get the list
>> of changed modules and for each one will call the given task.
>>
>> Of course this would only really help us with java tests as gradle
>> doesn't know much about the structure of dependencies within the python
>> (and go?) SDK.
>>
>> Brian
>>
>> [1] https://github.com/ismaeldivita/change-tracker-plugin
>>
>> On Wed, Jul 8, 2020 at 3:29 PM Kenneth Knowles  wrote:
>>
>>> Hi all,
>>>
>>> I wanted to start a discussion about getting finer grained test
>>> execution more focused on particular artifacts/modules. In particular, I
>>> want to gather the downsides and impossibilities. So I will make a proposal
>>> that people can disagree with easily.
>>>
>>> Context: job_PreCommit_Java is a monolithic job that...
>>>
>>>  - takes 40-50 minutes
>>>  - runs tests of maybe a bit under 100 modules
>>>  - executes over 10k tests
>>>  - runs on any change to model/, sdks/java/, runners/, examples/java/,
>>> examples/kotlin/, release/ (only exception is SQL)
>>>  - is pretty flaky (because it conflates so many independent test
>>> flakes, mostly runners and IOs)
>>>
>>> See a scan at
>>> https://scans.gradle.com/s/dnuo4o245d2fw/timeline?sort=longest
>>>
>>> Proposal: Eliminate monolithic job and break into finer-grained jobs
>>> that operate on two principles:
>>>
>>> 1. Test run should be focused on validating one artifact or a specific
>>> integration of other artifacts.
>>> 2. Test run should trigger only on things that could affect the validity
>>> of that artifact.
>>>
>>> For example, a starting point is to separate:
>>>
>>>  - core SDK
>>>  - runner helper libs
>>>  - each runner
>>>  - each extension
>>>  - each IO
>>>
>>> Benefits:
>>>
>>>  - changing an IO or runner would not trigger the 20 minutes of core SDK
>>> tests
>>>  - changing a runner would not trigger the long IO local integration
>>> tests
>>>  - changing the core SDK could potentially not run as many tests in
>>> presubmit, but maybe it would and they would be separately reported results
>>> with clear flakiness signal
>>>
>>> There are 72 build.gradle files under sdks/java/ and 30 under runners/.
>>> They don't all require a separate job. But still there are enough that it
>>> is worth automation. Does anyone know of what options we might have? It
>>> does not even have to be in Jenkins. We could have one "test the things"
>>> Jenkins job if the underlying tool (Gradle) could resolve what needs to be
>>> run. Caching is not sufficient in my experience.
>>>
>>> (there are other quick fix alternatives to shrinking this time, but I
>>> want to focus on bigger picture)
>>>
>>> Kenn
>>>
>>


Re: Finer-grained test runs?

2020-07-08 Thread Robert Bradshaw
On Wed, Jul 8, 2020 at 4:44 PM Luke Cwik  wrote:
>
> I'm not sure that breaking it up will be significantly faster since each 
> module needs to build its ancestors and run tests of itself and all of its 
> descendants which isn't a trivial amount of work. We have only so many 
> executors and with the increased number of jobs, won't we just be waiting for 
> queued jobs to start?

I think that depends on how many fewer tests we could run (or rerun)
for the average PR. (It would also be nice if we could share build
artifacts across executors (is there something like ccache for
javac?), but maybe that's too far-fetched?)

> I agree that we would have better visibility though in github and also in 
> Jenkins.

I do have to say having to scroll through a huge number of github
checks is not always an improvement.

> Fixing flaky tests would help improve our test signal as well. Not many 
> willing people here though but could be less work then building and 
> maintaining so many different jobs.

+1

> On Wed, Jul 8, 2020 at 4:13 PM Kenneth Knowles  wrote:
>>
>> That's a good start. It is new enough and with few enough commits that I'd 
>> want to do some thorough experimentation. Our build is complex enough with a 
>> lot of ad hoc coding that we might end up maintaining whatever we choose...
>>
>> In my ideal scenario the list of "what else to test" would be manually 
>> editable, or even strictly opt-in. Automatically testing everything that 
>> might be affected quickly runs into scaling problems too. It could make 
>> sense in post-commit but less so in pre-commit.
>>
>> Kenn
>>
>> On Wed, Jul 8, 2020 at 3:50 PM Brian Hulette  wrote:
>>>
>>> > We could have one "test the things" Jenkins job if the underlying tool 
>>> > (Gradle) could resolve what needs to be run.
>>>
>>> I think this would be much better. Otherwise it seems our Jenkins 
>>> definitions are just duplicating information that's already stored in the 
>>> build.gradle files which seems error-prone, especially for tests validating 
>>> combinations of artifacts. I did some quick searching and came across [1]. 
>>> It doesn't look like the project has had a lot of recent activity, but it 
>>> claims to do what we need:
>>>
>>> > The plugin will generate new tasks on the root project for each task 
>>> > provided on the configuration with the following pattern 
>>> > ${taskName}ChangedModules.
>>> > These generated tasks will run the changedModules task to get the list of 
>>> > changed modules and for each one will call the given task.
>>>
>>> Of course this would only really help us with java tests as gradle doesn't 
>>> know much about the structure of dependencies within the python (and go?) 
>>> SDK.
>>>
>>> Brian
>>>
>>> [1] https://github.com/ismaeldivita/change-tracker-plugin
>>>
>>> On Wed, Jul 8, 2020 at 3:29 PM Kenneth Knowles  wrote:

 Hi all,

 I wanted to start a discussion about getting finer grained test execution 
 more focused on particular artifacts/modules. In particular, I want to 
 gather the downsides and impossibilities. So I will make a proposal that 
 people can disagree with easily.

 Context: job_PreCommit_Java is a monolithic job that...

  - takes 40-50 minutes
  - runs tests of maybe a bit under 100 modules
  - executes over 10k tests
  - runs on any change to model/, sdks/java/, runners/, examples/java/, 
 examples/kotlin/, release/ (only exception is SQL)
  - is pretty flaky (because it conflates so many independent test flakes, 
 mostly runners and IOs)

 See a scan at 
 https://scans.gradle.com/s/dnuo4o245d2fw/timeline?sort=longest

 Proposal: Eliminate monolithic job and break into finer-grained jobs that 
 operate on two principles:

 1. Test run should be focused on validating one artifact or a specific 
 integration of other artifacts.
 2. Test run should trigger only on things that could affect the validity 
 of that artifact.

 For example, a starting point is to separate:

  - core SDK
  - runner helper libs
  - each runner
  - each extension
  - each IO

 Benefits:

  - changing an IO or runner would not trigger the 20 minutes of core SDK 
 tests
  - changing a runner would not trigger the long IO local integration tests
  - changing the core SDK could potentially not run as many tests in 
 presubmit, but maybe it would and they would be separately reported 
 results with clear flakiness signal

 There are 72 build.gradle files under sdks/java/ and 30 under runners/. 
 They don't all require a separate job. But still there are enough that it 
 is worth automation. Does anyone know of what options we might have? It 
 does not even have to be in Jenkins. We could have one "test the things" 
 Jenkins job if the underlying tool (Gradle) could resolve what needs to be 
 run. Caching is not sufficient in 

Re: Request for Java PR review

2020-07-08 Thread Rui Wang
I didn't hear that has been changed so assume it's still only committers
that can see suggested reviewers, thus picking up someone based on the
source code history could be the feasible solution for non-committers.

An improvement though could be when picking up someone from the history,
you can check [1] to pick up a committer from that history. So that
committer can request reviewers on the right hand side.


[1]: https://projects.apache.org/committee.html?beam

-Rui

On Tue, Jul 7, 2020 at 2:42 PM Brian Hulette  wrote:

> One thing that's a bit annoying is non-committers can't see the
> suggestions on the right-hand side, and they also don't have permission to
> set a reviewer. At least that was the case before I was a committer, I
> can't confirm if it's still true.
>
> On Thu, Jul 2, 2020 at 11:04 AM Robert Bradshaw 
> wrote:
>
>> Thanks for your contributions. For future reference, picking a reviewer
>> by adding a comment R: @some-username (e.g. based on the source code
>> history of the files in question) or in the suggestions for reviewers on
>> the right hand side can help get things moving quicker.
>>
>> On Tue, Jun 23, 2020 at 5:44 PM Chamikara Jayalath 
>> wrote:
>>
>>> Thanks. I'm taking a look.
>>>
>>> On Tue, Jun 23, 2020 at 3:07 AM Niel Markwick  wrote:
>>>
 Hey devs...

 I have 3 PRs sitting waiting for a code review to fix potential bugs
 (and improve memory use) in SpannerIO. 2 small, and one quite large -- I
 would really like these to be in 2.23...

 https://github.com/apache/beam/pulls/nielm

 Would someone be willing to have a look?

 Thanks!

 --
 
 * •  **Niel Markwick*
 * •  *Cloud Solutions Architect
 
 * •  *Google Belgium
 * •  *ni...@google.com
 * •  *+32 2 894 6771


 Google Belgium NV/SA, Steenweg op Etterbeek 180, 1040 Brussel, Belgie. 
 RPR: 0878.065.378

 If you have received this communication by mistake, please don't
 forward it to anyone else (it may contain confidential or privileged
 information), please erase all copies of it, including all attachments, and
 please let the sender know it went to the wrong person. Thanks

>>>


Re: Request for Java PR review

2020-07-08 Thread Robert Bradshaw
Yeah, the fact that not everyone can see suggested reviewers is annoying.

Mostly I just wanted to call out that if you have a PR and haven't gotten
feedback on it, it's totally kosher to as someone specifically to be a
reviewer, and this can often get the ball rolling quicker. (Pinging the
list is fine as well, especially when you're starting out.)

On Wed, Jul 8, 2020 at 4:59 PM Rui Wang  wrote:

> I didn't hear that has been changed so assume it's still only committers
> that can see suggested reviewers, thus picking up someone based on the
> source code history could be the feasible solution for non-committers.
>
> An improvement though could be when picking up someone from the history,
> you can check [1] to pick up a committer from that history. So that
> committer can request reviewers on the right hand side.
>
>
> [1]: https://projects.apache.org/committee.html?beam
>
> -Rui
>
> On Tue, Jul 7, 2020 at 2:42 PM Brian Hulette  wrote:
>
>> One thing that's a bit annoying is non-committers can't see the
>> suggestions on the right-hand side, and they also don't have permission to
>> set a reviewer. At least that was the case before I was a committer, I
>> can't confirm if it's still true.
>>
>> On Thu, Jul 2, 2020 at 11:04 AM Robert Bradshaw 
>> wrote:
>>
>>> Thanks for your contributions. For future reference, picking a reviewer
>>> by adding a comment R: @some-username (e.g. based on the source code
>>> history of the files in question) or in the suggestions for reviewers on
>>> the right hand side can help get things moving quicker.
>>>
>>> On Tue, Jun 23, 2020 at 5:44 PM Chamikara Jayalath 
>>> wrote:
>>>
 Thanks. I'm taking a look.

 On Tue, Jun 23, 2020 at 3:07 AM Niel Markwick  wrote:

> Hey devs...
>
> I have 3 PRs sitting waiting for a code review to fix potential bugs
> (and improve memory use) in SpannerIO. 2 small, and one quite large -- I
> would really like these to be in 2.23...
>
> https://github.com/apache/beam/pulls/nielm
>
> Would someone be willing to have a look?
>
> Thanks!
>
> --
> 
> * •  **Niel Markwick*
> * •  *Cloud Solutions Architect
> 
> * •  *Google Belgium
> * •  *ni...@google.com
> * •  *+32 2 894 6771
>
>
> Google Belgium NV/SA, Steenweg op Etterbeek 180, 1040 Brussel, Belgie. 
> RPR: 0878.065.378
>
> If you have received this communication by mistake, please don't
> forward it to anyone else (it may contain confidential or privileged
> information), please erase all copies of it, including all attachments, 
> and
> please let the sender know it went to the wrong person. Thanks
>



Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-07-08 Thread Robert Bradshaw
OK, I'm +0 on this change. Using the PTransform as an element is
probably better than duplicating the full API on another interface,
and think it's worth getting this ublocked. This will require a Read2
if we have to add options in a upgrade-compatible way.

On Tue, Jul 7, 2020 at 3:19 PM Luke Cwik  wrote:
>
> Robert, you're correct in your understanding that the Read PTransform would 
> be encoded via the schema coder.
>
> Kenn, different serializers are ok as long as the output coder can 
> encode/decode the output type. Different watermark fns are also ok since it 
> is about computing the watermark for each individual source and won't impact 
> the watermark computed by other sources. Watermark advancement will still be 
> held back by the source that is furthest behind and still has the same 
> problems when a user chooses a watermark fn that was incompatible with the 
> windowing strategy for producing output (e.g. global window + default trigger 
> + streaming pipeline).
>
> Both are pretty close so if we started from scratch then it could go either 
> way but we aren't starting from scratch (I don't think a Beam 3.0 is likely 
> to happen in the next few years as there isn't enough stuff that we want to 
> remove vs the amount of stuff we would gain).
>
> On Tue, Jul 7, 2020 at 2:57 PM Kenneth Knowles  wrote:
>>
>> On Tue, Jul 7, 2020 at 2:24 PM Robert Bradshaw  wrote:
>>>
>>> On Tue, Jul 7, 2020 at 2:06 PM Luke Cwik  wrote:
>>> >
>>> > Robert, the intent is that the Read object would use a schema coder and 
>>> > for XLang purposes would be no different then a POJO.
>>>
>>> Just to clarify, you're saying that the Read PTransform would be
>>> encoded via the schema coder? That still feels a bit odd (and
>>> specificically if we were designing IO from scratch rather than
>>> adapting to what already exists would we choose to use PTransforms as
>>> elements?) but would solve the cross language issue.
>>
>>
>> I like this question. If we were designing from scratch, what would we do? 
>> Would we encourage users to feed Create.of(SourceDescriptor) into ReadAll? 
>> We would probably provide a friendly wrapper for reading one static thing, 
>> and call it Read. But it would probably have an API like 
>> Read.from(SourceDescriptor), thus eliminating duplicate documentation and 
>> boilerplate that Luke described while keeping the separation that Brian 
>> described and clarity around xlang environments. But I'm +0 on whatever has 
>> momentum. I think the main downside is the weirdness around 
>> serializers/watermarkFn/etc on Read. I am not sure how much this will cause 
>> users problems. It would be very ambitious of them to produce a 
>> PCollection where they had different fns per element...
>>
>> Kenn
>>
>>>
>>> > The issue of how to deal with closures applies to both equally and that 
>>> > is why I suggested to favor using data over closures. Once there is an 
>>> > implementation for how to deal with UDFs in an XLang world, this guidance 
>>> > can change.
>>> >
>>> > Kenn, I did mean specifying an enum that the XLang expansion service 
>>> > would return a serialized blob of code. The XLang expansion service is 
>>> > responsible for returning an environment that contains all the necessary 
>>> > dependencies to execute the transforms and the serialized blob of code 
>>> > and hence would be a non-issue for the caller.
>>> >
>>> > From reviewing the SDF Kafka PR, the reduction in maintenance is 
>>> > definitely there (100s of lines of duplicated boilerplate and 
>>> > documentation).
>>> >
>>> > What are the next steps to get a resolution on this?
>>> >
>>> > On Thu, Jul 2, 2020 at 10:38 AM Robert Bradshaw  
>>> > wrote:
>>> >>
>>> >> On Thu, Jul 2, 2020 at 10:26 AM Kenneth Knowles  wrote:
>>> >>>
>>> >>>
>>> >>> On Wed, Jul 1, 2020 at 4:17 PM Eugene Kirpichov  
>>> >>> wrote:
>>> 
>>>  Kenn - I don't mean an enum of common closures, I mean expressing 
>>>  closures in a restricted sub-language such as the language of SQL 
>>>  expressions.
>>> >>>
>>> >>>
>>> >>> My lack of clarity: enums was my phrasing of Luke's item 1). I 
>>> >>> understood what you meant. I think either a set of well-known closures 
>>> >>> or a tiny sublanguage could add value.
>>> >>>
>>> 
>>>  That would only work if there is a portable way to interpret SQL 
>>>  expressions, but if there isn't, maybe there should be - for the sake 
>>>  of, well, expressing closures portably. Of course these would be 
>>>  closures that only work with rows - but that seems powerful enough for 
>>>  many if not most purposes.
>>> >>>
>>> >>>
>>> >>> You can choose a SQL dialect or choose the tiniest subset just for this 
>>> >>> purpose and go with it. But when the data type going in or out of the 
>>> >>> lambda are e.g. some Java or Python object then what? One idea is to 
>>> >>> always require these to be rows. But if you can really get away with a 
>>> >>> dependency-free context-free lamb

Re: Finer-grained test runs?

2020-07-08 Thread Kenneth Knowles
I like your use of "ancestor" and "descendant". I will adopt it.

On Wed, Jul 8, 2020 at 4:53 PM Robert Bradshaw  wrote:

> On Wed, Jul 8, 2020 at 4:44 PM Luke Cwik  wrote:
> >
> > I'm not sure that breaking it up will be significantly faster since each
> module needs to build its ancestors and run tests of itself and all of its
> descendants which isn't a trivial amount of work. We have only so many
> executors and with the increased number of jobs, won't we just be waiting
> for queued jobs to start?



I think that depends on how many fewer tests we could run (or rerun)
> for the average PR. (It would also be nice if we could share build
> artifacts across executors (is there something like ccache for
> javac?), but maybe that's too far-fetched?)
>

Robert: The gradle cache should remain valid across runs, I think... my
latest understanding was that it was a robust up-to-date check (aka not
`make`). We may have messed this up, as I am not seeing as much caching as
I would expect nor as much as I see locally. We had to do some tweaking in
the maven days to put the .m2 directory outside of the realm wiped for each
new build. Maybe we are clobbering the Gradle cache too. That might
actually make most builds so fast we do not care about my proposal.

Luke: I am not sure if you are replying to my email or to Brian's.

If Brian's: it does not result in redundant build (if plugin works) since
it would be one Gradle build process. But it does do a full build if you
touch something at the root of the ancestry tree like core SDK or model. I
would like to avoid automatically testing descendants if we can, since
things like Nexmark and most IOs are not sensitive to the vast majority of
model or core SDK changes. Runners are borderline.

If mine: you could assume my proposal is like Brian's but with full
isolated Jenkins builds. This would be strictly worse, since it would add
redundant builds of ancestors. I am assuming that you always run a separate
Jenkins job for every descendant. Still, many modules have fewer
descendants. And they do not trigger all the way up to the root and down to
all descendants of the root.

>From a community perspective, extensions and IOs are the most likely use
case for newcomers. For the person who comes to add or improve FooIO, it is
not a good experience to hit a flake in RabbitMqIO or JdbcIO or
DataflowRunner or FlinkRunner flakes.

I think the plugin Brian mentioned is only a start. It would be even better
for each module to have an opt-in list of descendants to test on precommit.
This works well with a rollback-first strategy on post-commit. We can then
replay the PR while triggering the postcommits that failed.

> I agree that we would have better visibility though in github and also in
> Jenkins.
>
> I do have to say having to scroll through a huge number of github
> checks is not always an improvement.
>

+1 but OTOH the gradle scan is sometimes too fine grained or associates
logs oddly (I skip the Jenkins status page almost always)


> > Fixing flaky tests would help improve our test signal as well. Not many
> willing people here though but could be less work then building and
> maintaining so many different jobs.
>
> +1
>

I agree with fixing flakes, but I want to treat the occurrence and
resolution of flakiness as standard operations. Just as bug counts increase
continuously as a project grows, so will overall flakiness. Separating
flakiness signals will help to prioritize which flakes to address.

Kenn


> > On Wed, Jul 8, 2020 at 4:13 PM Kenneth Knowles  wrote:
> >>
> >> That's a good start. It is new enough and with few enough commits that
> I'd want to do some thorough experimentation. Our build is complex enough
> with a lot of ad hoc coding that we might end up maintaining whatever we
> choose...
> >>
> >> In my ideal scenario the list of "what else to test" would be manually
> editable, or even strictly opt-in. Automatically testing everything that
> might be affected quickly runs into scaling problems too. It could make
> sense in post-commit but less so in pre-commit.
> >>
> >> Kenn
> >>
> >> On Wed, Jul 8, 2020 at 3:50 PM Brian Hulette 
> wrote:
> >>>
> >>> > We could have one "test the things" Jenkins job if the underlying
> tool (Gradle) could resolve what needs to be run.
> >>>
> >>> I think this would be much better. Otherwise it seems our Jenkins
> definitions are just duplicating information that's already stored in the
> build.gradle files which seems error-prone, especially for tests validating
> combinations of artifacts. I did some quick searching and came across [1].
> It doesn't look like the project has had a lot of recent activity, but it
> claims to do what we need:
> >>>
> >>> > The plugin will generate new tasks on the root project for each task
> provided on the configuration with the following pattern
> ${taskName}ChangedModules.
> >>> > These generated tasks will run the changedModules task to get the
> list of changed modules and for each one