Re: SQS source

2018-07-18 Thread Lukasz Cwik
On Wed, Jul 18, 2018 at 3:30 PM John Rudolf Lewis 
wrote:

> I need an SQS source for my project that is using beam. A brief search did
> not turn up any in-progress work in this area. Please point me to the right
> repo if I missed it.
>

To my knowledge there is none and nobody has marked it in progress on
https://beam.apache.org/documentation/io/built-in/. It would be good to
create a JIRA issue on https://issues.apache.org/ and send a PR to add SQS
to the inprogress list referencing your JIRA. I added you as a contributor
in JIRA so you should be able to assign yourself to any issues that you
create.


> Assuming there is no in-progress effort, I would like to contribute an
> Amazon SQS source. I have a few questions before I begin.
>

Great, note that this is a good starting point for authoring an IO
transform: https://beam.apache.org/documentation/io/authoring-overview/


>
> It seems that the current AWS code is split into two different modules:
> sdk/java/io/amazon-web-services which contains the S3FileSystem,
> AwsOptions, etc, and sdk/java/io/kinesis which contains an unbounded source
> based on a kinesis topic. I'd like to add this source to the
> amazon-web-services module since I'd like to depend on AwsOptions. Does
> adding this source to the amazon-web-services module make sense?
>

Putting it inside of amazon-web-services makes a lot of sense. The Google
connectors all live within the one package and there has been discussion to
consolidate all the AWS stuff under amazon-web-services.


> Also, the kinesis source looks a touch more complex than other sources.
> Both the JMS and AMQP sources look like better examples to follow. Which
> existing source would be the best to model this contribution after?
>

Some of it has to do with how many ways a source can be read and how
complicated the watermark tracking but it would be best if the IO authors
comment on implementation details.


> If anyone has put some thoughts into this, or better yet some code, I'd
> appreciate hearing from you.
>
> Thanks!
>
>


SQS source

2018-07-18 Thread John Rudolf Lewis
I need an SQS source for my project that is using beam. A brief search did
not turn up any in-progress work in this area. Please point me to the right
repo if I missed it.

Assuming there is no in-progress effort, I would like to contribute an
Amazon SQS source. I have a few questions before I begin.

It seems that the current AWS code is split into two different modules:
sdk/java/io/amazon-web-services which contains the S3FileSystem,
AwsOptions, etc, and sdk/java/io/kinesis which contains an unbounded source
based on a kinesis topic. I'd like to add this source to the
amazon-web-services module since I'd like to depend on AwsOptions. Does
adding this source to the amazon-web-services module make sense?

Also, the kinesis source looks a touch more complex than other sources.
Both the JMS and AMQP sources look like better examples to follow. Which
existing source would be the best to model this contribution after?

If anyone has put some thoughts into this, or better yet some code, I'd
appreciate hearing from you.

Thanks!


Re: Vendoring / Shading Protobuf and gRPC

2018-07-18 Thread Ankur Goenka
FYI: The beam version just got updated today to 2.7-SNAPSHOT
.
This require re-adding the dependency jars for 2.7-SNAPSHOT version to the
projects.

On Wed, Jul 18, 2018 at 1:22 PM Lukasz Cwik  wrote:

> Are you delegating your test runs to Gradle or using Intellij's test
> runner?
>
> I have been delegating my test runs to Gradle and that has been working
> well for me. The platform test runner fails due to the missing classes as
> you mentioned.
>
> I also spent two hours trying to muck around with the Gradle Idea plugin
> to get it to use the shaded jars instead of the output classes directly
> without much success. Reached out to the Gradle community forum[1] to see
> what they say.
>
> 1:
> https://discuss.gradle.org/t/how-to-get-intellij-to-use-module-output-jars-instead-of-output-classes/27794
>
> On Wed, Jul 18, 2018 at 11:06 AM Thomas Weise  wrote:
>
>> Thanks for fixing the duplicate root issue, PR is merged.
>>
>> I cannot run RemoteExecutionTest from the same commit (Intellij
>> v2018.1.6). Here are some of the class loading errors before I give up
>> adding dependencies manually :(
>>
>> java.lang.NoClassDefFoundError:
>> org/apache/beam/vendor/grpc/v1/io/grpc/BindableService
>> java.lang.NoClassDefFoundError: com/google/protobuf/ByteString
>> java.lang.NoClassDefFoundError:
>> org/apache/beam/vendor/sdk/v2/sdk/extensions/protobuf/ByteStringCoder
>> Caused by: java.lang.ClassNotFoundException: net.bytebuddy.NamingStrategy
>>
>> Our Intellij results are all over the board, possibly due to different
>> versions or other settings/plugins? It will be important to sort it out to
>> make the contributor experience smoother.
>>
>> Thanks,
>> Thomas
>>
>>
>> On Wed, Jul 18, 2018 at 9:29 AM Lukasz Cwik  wrote:
>>
>>> Ismael, the SDK should perform the pipeline translation to proto because
>>> I expect the flow to be:
>>> User Code -> SDK -> Proto Translation -> Job API -> Runner
>>> I don't expect "runners" to live within the users process anymore
>>> (excluding the direct runner). There will be one portable "runner" and it
>>> will be responsible for communicating with the job management APIs. It
>>> shouldn't be called a runner but for backwards compatibility it will behave
>>> like a runner does today. Flink/Spark/... will all live on the other side
>>> of the job management API.
>>>
>>> Thomas, I can run RemoteExectuionTest from commit
>>> ae2bebaf8b277e99840fa63f1b95d828f2093d16 without needing to modify the
>>> project/module structure in Intellij. Adding the jars manually only helps
>>> with code completion.
>>>
>>> https://github.com/apache/beam/pull/5977 works around the duplicate
>>> content root issue in Intellij. I have also run into the -Werror issue
>>> occasionally and don't know any fix or why it gets triggered as it doesn't
>>> happen to me all the time.
>>>
>>> On Tue, Jul 17, 2018 at 7:01 PM Thomas Weise  wrote:
>>>
 Thanks, the classpath order matters indeed.

 Still not able to run RemoteExecutionTest, but I was able to get the
 Flink portable test to work by adding the following to the *top* of
 the dependency list of *beam-runners-flink_2.11_test*


 vendor/sdks-java-extensions-protobuf/build/libs/beam-vendor-sdks-java-extensions-protobuf-2.6.0-SNAPSHOT.jar
 model/fn-execution/build/libs/beam-model-fn-execution-2.6.0-SNAPSHOT.jar


 On Tue, Jul 17, 2018 at 6:00 PM Ankur Goenka  wrote:

> Yes, I am able to run it.
>
> For tests, you also need to add dependencies to
> ":beam-runners-java-fn-execution/beam-runners-java-fn-execution_*test*"
> module.
>
> Also, I only added
> :beam-model-job-management-2.6.0-SNAPSHOT.jar
> :beam-model-fn-execution-2.6.0-SNAPSHOT.jar
> to the dependencies manually so not sure if you want to add
> io.grpc:grpc-core:1.12.0 and com.google.protobuf:protobuf-java:3.5.1
> to the dependencies.
>
> Note, you need to move them up in the dependencies list.
>
>
> On Tue, Jul 17, 2018 at 5:54 PM Thomas Weise  wrote:
>
>> Are you able to
>> run org.apache.beam.runners.fnexecution.control.RemoteExecutionTest from
>> within Intellij ?
>>
>> I can get the compile errors to disappear by adding
>> beam-model-job-management-2.6.0-SNAPSHOT.jar, io.grpc:grpc-core:1.12.0
>> and com.google.protobuf:protobuf-java:3.5.1
>>
>> Running the test still fails since other dependencies are missing.
>>
>>
>> On Tue, Jul 17, 2018 at 4:02 PM Ankur Goenka 
>> wrote:
>>
>>> For reference:
>>> I was able to make intellij work with the master by doing following
>>> steps
>>>
>>>1. Remove module :beam:vendor-sdks-java-extensions-protobuf from
>>>intellij.
>>>2. Adding
>>>
>>> :beam-model-fn-execution/build/libs/beam-model-fn-execution-2.6.0-SNAPSHOT.jar
>>>and 
>>> 

Re: [BEAM-4814] Add client configuration to aws options #5983

2018-07-18 Thread Ismaël Mejía
Hello,

Since I have been changing options related parts of S3 I can take a
look, so I will take the R: I will also probably ping Jacob Marble
(the author of S3FileSystem for additional comments), thanks for
bringing this contribution.


On Wed, Jul 18, 2018 at 10:43 PM John Rudolf Lewis  wrote:
>
> I just submitted a PR, my first for this project: 
> https://github.com/apache/beam/pull/5983
>
> It enables one to use the S3FileSystem from behind a filewall where you need 
> an outbound proxy configured.
>
> I want to be able to TextIO.write().to("s3://mybucket/myfile.txt") but I 
> can't unless I configure a proxy.
>
> With this PR you can specify the proxy configuration either via command line:
>
> --clientConfiguration{"proxyHost":"hostname","proxyPort":1234,"proxyUsername":"username","proxyPassword":"password"}
>
> Or in code:
>
> PipelineOptions options = 
> PipelineOptionsFactory.fromArgs(args).withValidation().create();
> ClientConfiguration clientConfiguration = new ClientConfiguration();
> clientConfiguration.setProxyHost("hostname");
> clientConfiguration.setProxyPort(1234);
> clientConfiguration.setProxyUsername("username");
> clientConfiguration.setProxyPassword("password");
> options.as(AwsOptions.class).setClientConfiguration(clientConfiguration);
>
> The PR auto selected jbonofre, lukecwik, and charmikaramj as reviewers when i 
> created it. The contribution guide suggested that I ask here to see who else 
> I should add as a reviewer.
>
> Please let me know who I should add as reviewers, or any other changes I 
> should make.
>
> Thanks!!


Re: Ordered PCollection

2018-07-18 Thread yifangchen
Thanks, Lukasz, that helps!

On 2018/07/18 19:44:50, Lukasz Cwik  wrote: 
> Apache Beam has no concept of an ordered PCollection. The most common
> solution is to use a combiner where you sort your values yourself using N
> dummy keys and then partitioning the output based upon the dummy key.
> Data -> PairWithNumberIn[0,N] -> Combine(sort values using custom combiner)
> -> PartitionByKey --> WriteForKey0
> 
>   \-> WriteForKey1
> 
>   ...
> 
>   \-> WriteForKeyN
> Note that if you want to write all the data to a single file, you'll have
> memory issues with your combiner and have poor performance since you'll
> have a single sorter and writer.
> 
> There has been some previous discussions[1] with about ordering with
> stricter constraints then general ordering that may apply for your use case
> though and would be worthwhile to take a look at.
> 
> 1: https://lists.apache.org/list.html?u...@beam.apache.org:lte=18M:ordering
> 
> On Wed, Jul 18, 2018 at 11:46 AM Allie Chen  wrote:
> 
> > Greetings!
> >
> > I have a quick question. Is there an OrderedPCollection concept in Python
> > SDK? Say I have a PCollection of objects that I am going to write to a
> > file, but I have to keep them in a certain order. Sorting them within one
> > worker is just too costly. Is there a more efficient way?
> >
> > Thanks for your help!
> >
> > Allie
> >
> 


[BEAM-4814] Add client configuration to aws options #5983

2018-07-18 Thread John Rudolf Lewis
I just submitted a PR, my first for this project:
https://github.com/apache/beam/pull/5983

It enables one to use the S3FileSystem from behind a filewall where you
need an outbound proxy configured.

I want to be able to TextIO.write().to("s3://mybucket/myfile.txt") but I
can't unless I configure a proxy.

With this PR you can specify the proxy configuration either via command
line:

--clientConfiguration{"proxyHost":"hostname","proxyPort":1234,"proxyUsername":"username","proxyPassword":"password"}

Or in code:

PipelineOptions options =
PipelineOptionsFactory.fromArgs(args).withValidation().create();
ClientConfiguration clientConfiguration = new ClientConfiguration();
clientConfiguration.setProxyHost("hostname");
clientConfiguration.setProxyPort(1234);
clientConfiguration.setProxyUsername("username");
clientConfiguration.setProxyPassword("password");
options.as(AwsOptions.class).setClientConfiguration(clientConfiguration);

The PR auto selected jbonofre, lukecwik, and charmikaramj as reviewers when
i created it. The contribution guide suggested that I ask here to see who
else I should add as a reviewer.

Please let me know who I should add as reviewers, or any other changes I
should make.

Thanks!!


Re: Vendoring / Shading Protobuf and gRPC

2018-07-18 Thread Lukasz Cwik
Are you delegating your test runs to Gradle or using Intellij's test runner?

I have been delegating my test runs to Gradle and that has been working
well for me. The platform test runner fails due to the missing classes as
you mentioned.

I also spent two hours trying to muck around with the Gradle Idea plugin to
get it to use the shaded jars instead of the output classes directly
without much success. Reached out to the Gradle community forum[1] to see
what they say.

1:
https://discuss.gradle.org/t/how-to-get-intellij-to-use-module-output-jars-instead-of-output-classes/27794

On Wed, Jul 18, 2018 at 11:06 AM Thomas Weise  wrote:

> Thanks for fixing the duplicate root issue, PR is merged.
>
> I cannot run RemoteExecutionTest from the same commit (Intellij
> v2018.1.6). Here are some of the class loading errors before I give up
> adding dependencies manually :(
>
> java.lang.NoClassDefFoundError:
> org/apache/beam/vendor/grpc/v1/io/grpc/BindableService
> java.lang.NoClassDefFoundError: com/google/protobuf/ByteString
> java.lang.NoClassDefFoundError:
> org/apache/beam/vendor/sdk/v2/sdk/extensions/protobuf/ByteStringCoder
> Caused by: java.lang.ClassNotFoundException: net.bytebuddy.NamingStrategy
>
> Our Intellij results are all over the board, possibly due to different
> versions or other settings/plugins? It will be important to sort it out to
> make the contributor experience smoother.
>
> Thanks,
> Thomas
>
>
> On Wed, Jul 18, 2018 at 9:29 AM Lukasz Cwik  wrote:
>
>> Ismael, the SDK should perform the pipeline translation to proto because
>> I expect the flow to be:
>> User Code -> SDK -> Proto Translation -> Job API -> Runner
>> I don't expect "runners" to live within the users process anymore
>> (excluding the direct runner). There will be one portable "runner" and it
>> will be responsible for communicating with the job management APIs. It
>> shouldn't be called a runner but for backwards compatibility it will behave
>> like a runner does today. Flink/Spark/... will all live on the other side
>> of the job management API.
>>
>> Thomas, I can run RemoteExectuionTest from commit
>> ae2bebaf8b277e99840fa63f1b95d828f2093d16 without needing to modify the
>> project/module structure in Intellij. Adding the jars manually only helps
>> with code completion.
>>
>> https://github.com/apache/beam/pull/5977 works around the duplicate
>> content root issue in Intellij. I have also run into the -Werror issue
>> occasionally and don't know any fix or why it gets triggered as it doesn't
>> happen to me all the time.
>>
>> On Tue, Jul 17, 2018 at 7:01 PM Thomas Weise  wrote:
>>
>>> Thanks, the classpath order matters indeed.
>>>
>>> Still not able to run RemoteExecutionTest, but I was able to get the
>>> Flink portable test to work by adding the following to the *top* of the
>>> dependency list of *beam-runners-flink_2.11_test*
>>>
>>>
>>> vendor/sdks-java-extensions-protobuf/build/libs/beam-vendor-sdks-java-extensions-protobuf-2.6.0-SNAPSHOT.jar
>>> model/fn-execution/build/libs/beam-model-fn-execution-2.6.0-SNAPSHOT.jar
>>>
>>>
>>> On Tue, Jul 17, 2018 at 6:00 PM Ankur Goenka  wrote:
>>>
 Yes, I am able to run it.

 For tests, you also need to add dependencies to
 ":beam-runners-java-fn-execution/beam-runners-java-fn-execution_*test*"
 module.

 Also, I only added
 :beam-model-job-management-2.6.0-SNAPSHOT.jar
 :beam-model-fn-execution-2.6.0-SNAPSHOT.jar
 to the dependencies manually so not sure if you want to add
 io.grpc:grpc-core:1.12.0 and com.google.protobuf:protobuf-java:3.5.1 to
 the dependencies.

 Note, you need to move them up in the dependencies list.


 On Tue, Jul 17, 2018 at 5:54 PM Thomas Weise  wrote:

> Are you able to
> run org.apache.beam.runners.fnexecution.control.RemoteExecutionTest from
> within Intellij ?
>
> I can get the compile errors to disappear by adding
> beam-model-job-management-2.6.0-SNAPSHOT.jar, io.grpc:grpc-core:1.12.0
> and com.google.protobuf:protobuf-java:3.5.1
>
> Running the test still fails since other dependencies are missing.
>
>
> On Tue, Jul 17, 2018 at 4:02 PM Ankur Goenka 
> wrote:
>
>> For reference:
>> I was able to make intellij work with the master by doing following
>> steps
>>
>>1. Remove module :beam:vendor-sdks-java-extensions-protobuf from
>>intellij.
>>2. Adding
>>
>> :beam-model-fn-execution/build/libs/beam-model-fn-execution-2.6.0-SNAPSHOT.jar
>>and 
>> :beam-model-job-management/build/libs/beam-model-job-management-2.6.0-SNAPSHOT.jar
>>to the appropriate modules at the top of the dependency list.
>>
>>
>> On Tue, Jul 17, 2018 at 2:29 PM Thomas Weise  wrote:
>>
>>> Adding the external jar in Intellij (2018.1) currently fails due to
>>> a duplicate source directory 
>>> (sdks/java/extensions/protobuf/src/main/java).
>>>

Re: Ordered PCollection

2018-07-18 Thread Lukasz Cwik
Apache Beam has no concept of an ordered PCollection. The most common
solution is to use a combiner where you sort your values yourself using N
dummy keys and then partitioning the output based upon the dummy key.
Data -> PairWithNumberIn[0,N] -> Combine(sort values using custom combiner)
-> PartitionByKey --> WriteForKey0

  \-> WriteForKey1

  ...

  \-> WriteForKeyN
Note that if you want to write all the data to a single file, you'll have
memory issues with your combiner and have poor performance since you'll
have a single sorter and writer.

There has been some previous discussions[1] with about ordering with
stricter constraints then general ordering that may apply for your use case
though and would be worthwhile to take a look at.

1: https://lists.apache.org/list.html?u...@beam.apache.org:lte=18M:ordering

On Wed, Jul 18, 2018 at 11:46 AM Allie Chen  wrote:

> Greetings!
>
> I have a quick question. Is there an OrderedPCollection concept in Python
> SDK? Say I have a PCollection of objects that I am going to write to a
> file, but I have to keep them in a certain order. Sorting them within one
> worker is just too costly. Is there a more efficient way?
>
> Thanks for your help!
>
> Allie
>


Jenkins does not catch failed tests in Master Branch

2018-07-18 Thread Rui Wang
Hi,

I encountered tests failures when I ran "./gradlew
:beam-sdks-java-extensions-sql-jdbc:test" in master branch on Mac. What was
consistently failing was tests in JdbcJarTest class. However, it seems like
Beam master branch is not troubled by it and pending PRs are getting green
java checks.

In the meantime, I noticed one of my open PR (
https://github.com/apache/beam/pull/5969) actually made unit tests in
JdbcJarTest run and had the right test failures. If you can check the PR,
you can find that I added a few dependencies to sql-jdbc module in gradle
file.

My guess now is, my change to the gradle file cleaned gradle cache and make
gradle run those unit tests. Those tests seem not running in each "run java
precommit". Is there anyone who may know why it is happening?


-Rui


Ordered PCollection

2018-07-18 Thread Allie Chen
Greetings!

I have a quick question. Is there an OrderedPCollection concept in Python
SDK? Say I have a PCollection of objects that I am going to write to a
file, but I have to keep them in a certain order. Sorting them within one
worker is just too costly. Is there a more efficient way?

Thanks for your help!

Allie


Dependency Ownership Followup

2018-07-18 Thread Yifan Zou
Hi,

Thanks all for signing up the spreadsheet

and taking the ownership of Beam dependencies. We integrated the owners
information into the Beam code base via some yaml files (#5964
). Those files will be used in
our new Auto-JIRA tool (in process) for bug assignments.

Some dependencies still don't have owners. You are always encouraged to
signup with you JIRA username to take the ownership and help the community
to maintain the dependency. In addition, if new dependencies are introduced
into Beam, please update the corresponding dependency owners files.

Dependency owners file: https://github.com/apache/beam/tree/master/ownership

Thanks again.

Regards.
Yifan


Re: Vendoring / Shading Protobuf and gRPC

2018-07-18 Thread Thomas Weise
Thanks for fixing the duplicate root issue, PR is merged.

I cannot run RemoteExecutionTest from the same commit (Intellij v2018.1.6).
Here are some of the class loading errors before I give up adding
dependencies manually :(

java.lang.NoClassDefFoundError:
org/apache/beam/vendor/grpc/v1/io/grpc/BindableService
java.lang.NoClassDefFoundError: com/google/protobuf/ByteString
java.lang.NoClassDefFoundError:
org/apache/beam/vendor/sdk/v2/sdk/extensions/protobuf/ByteStringCoder
Caused by: java.lang.ClassNotFoundException: net.bytebuddy.NamingStrategy

Our Intellij results are all over the board, possibly due to different
versions or other settings/plugins? It will be important to sort it out to
make the contributor experience smoother.

Thanks,
Thomas


On Wed, Jul 18, 2018 at 9:29 AM Lukasz Cwik  wrote:

> Ismael, the SDK should perform the pipeline translation to proto because I
> expect the flow to be:
> User Code -> SDK -> Proto Translation -> Job API -> Runner
> I don't expect "runners" to live within the users process anymore
> (excluding the direct runner). There will be one portable "runner" and it
> will be responsible for communicating with the job management APIs. It
> shouldn't be called a runner but for backwards compatibility it will behave
> like a runner does today. Flink/Spark/... will all live on the other side
> of the job management API.
>
> Thomas, I can run RemoteExectuionTest from commit
> ae2bebaf8b277e99840fa63f1b95d828f2093d16 without needing to modify the
> project/module structure in Intellij. Adding the jars manually only helps
> with code completion.
>
> https://github.com/apache/beam/pull/5977 works around the duplicate
> content root issue in Intellij. I have also run into the -Werror issue
> occasionally and don't know any fix or why it gets triggered as it doesn't
> happen to me all the time.
>
> On Tue, Jul 17, 2018 at 7:01 PM Thomas Weise  wrote:
>
>> Thanks, the classpath order matters indeed.
>>
>> Still not able to run RemoteExecutionTest, but I was able to get the
>> Flink portable test to work by adding the following to the *top* of the
>> dependency list of *beam-runners-flink_2.11_test*
>>
>>
>> vendor/sdks-java-extensions-protobuf/build/libs/beam-vendor-sdks-java-extensions-protobuf-2.6.0-SNAPSHOT.jar
>> model/fn-execution/build/libs/beam-model-fn-execution-2.6.0-SNAPSHOT.jar
>>
>>
>> On Tue, Jul 17, 2018 at 6:00 PM Ankur Goenka  wrote:
>>
>>> Yes, I am able to run it.
>>>
>>> For tests, you also need to add dependencies to
>>> ":beam-runners-java-fn-execution/beam-runners-java-fn-execution_*test*"
>>> module.
>>>
>>> Also, I only added
>>> :beam-model-job-management-2.6.0-SNAPSHOT.jar
>>> :beam-model-fn-execution-2.6.0-SNAPSHOT.jar
>>> to the dependencies manually so not sure if you want to add
>>> io.grpc:grpc-core:1.12.0 and com.google.protobuf:protobuf-java:3.5.1 to
>>> the dependencies.
>>>
>>> Note, you need to move them up in the dependencies list.
>>>
>>>
>>> On Tue, Jul 17, 2018 at 5:54 PM Thomas Weise  wrote:
>>>
 Are you able to
 run org.apache.beam.runners.fnexecution.control.RemoteExecutionTest from
 within Intellij ?

 I can get the compile errors to disappear by adding
 beam-model-job-management-2.6.0-SNAPSHOT.jar, io.grpc:grpc-core:1.12.0
 and com.google.protobuf:protobuf-java:3.5.1

 Running the test still fails since other dependencies are missing.


 On Tue, Jul 17, 2018 at 4:02 PM Ankur Goenka  wrote:

> For reference:
> I was able to make intellij work with the master by doing following
> steps
>
>1. Remove module :beam:vendor-sdks-java-extensions-protobuf from
>intellij.
>2. Adding
>
> :beam-model-fn-execution/build/libs/beam-model-fn-execution-2.6.0-SNAPSHOT.jar
>and 
> :beam-model-job-management/build/libs/beam-model-job-management-2.6.0-SNAPSHOT.jar
>to the appropriate modules at the top of the dependency list.
>
>
> On Tue, Jul 17, 2018 at 2:29 PM Thomas Weise  wrote:
>
>> Adding the external jar in Intellij (2018.1) currently fails due to a
>> duplicate source directory (sdks/java/extensions/protobuf/src/main/java).
>>
>> The build as such also fails, with:  error: warnings found and
>> -Werror specified
>>
>> Ismaël found removing
>> https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L538
>> as workaround.
>>
>>
>> On Thu, Jul 12, 2018 at 1:55 PM Ismaël Mejía 
>> wrote:
>>
>>> Seems reasonable, but why exactly may we need the model (or protobuf
>>> related things) in the future in the SDK ? wasn’t it supposed to be
>>> translated into the Pipeline proto representation via the runners
>>> (and
>>> in this case the dep reside in the runner side) ?
>>> On Thu, Jul 12, 2018 at 2:50 AM Lukasz Cwik 
>>> wrote:
>>> >
>>> > Got a fix[1] for Andrews issue 

Re: [PROPOSAL] Prepare Beam 2.6.0 release

2018-07-18 Thread Pablo Estrada
Hello all!
I've cut the release branch (release-2.6.0), with some help from Ahmet and
Boyuan. From now on, please cherry-pick 2.6.0 blockers into the branch.
Now we start stabilizing it.

Thanks!

-P.

On Tue, Jul 17, 2018 at 9:34 PM Jean-Baptiste Onofré 
wrote:

> Hi Pablo,
>
> I'm investigating this issue, but it's a little long process.
>
> So, I propose you start with the release process,  cutting the branch,
> and then, I will create a cherry-pick PR for this one.
>
> Regards
> JB
>
> On 17/07/2018 20:19, Pablo Estrada wrote:
> > Checking once more:
> > What does the communitythink we should do
> > about https://issues.apache.org/jira/browse/BEAM-4750 ? Should I bump it
> > to 2.7.0?
> > Best
> > -P.
> >
> > On Fri, Jul 13, 2018 at 5:15 PM Ahmet Altay  > > wrote:
> >
> > Update:  https://issues.apache.org/jira/browse/BEAM-4784 is not a
> > release blocker, details in the JIRA issue.
> >
> > On Fri, Jul 13, 2018 at 11:12 AM, Thomas Weise  > > wrote:
> >
> > Can one of our Python experts please take a look
> > at https://issues.apache.org/jira/browse/BEAM-4784 and advise if
> > this should be addressed for the release?
> >
> > Thanks,
> > Thomas
> >
> >
> > On Fri, Jul 13, 2018 at 11:02 AM Ahmet Altay  > > wrote:
> >
> >
> >
> > On Fri, Jul 13, 2018 at 10:48 AM, Pablo Estrada
> > mailto:pabl...@google.com>> wrote:
> >
> > Hi all,
> > I've triaged most issues marked for 2.6.0 release. I've
> > localized two that need a decision / attention:
> >
> > - https://issues.apache.org/jira/browse/BEAM-4417 -
> > Bigquery IO Numeric Datatype Support. Cham is not
> > available to fix this at the moment, but this is a
> > critical issue. Is anyone able to tackle this / should
> > we bump this to next release?
> >
> >
> > I bumped this to the next release. I think Cham will be the
> > best person to address it when he is back. And with the
> > regular release cadence, it would not be delayed by much.
> >
> >
> >
> > - https://issues.apache.org/jira/browse/BEAM-4750 -
> > Performance degradation due to some safeguards in
> > beam-sdks-java-core. JB, are you looking to fix this?
> > Should we bump? I had the impression that it was an easy
> > fix, but I'm not sure.
> >
> > If you're aware of any other issue that needs to be
> > included as a release blocker, please report it to me.
> > Best
> > -P.
> >
> > On Thu, Jul 12, 2018 at 2:15 AM Etienne Chauchot
> > mailto:echauc...@apache.org>>
> wrote:
> >
> > +1,
> >
> > Thanks for volunteering Pablo, thanks also to have
> > caught tickets that I forgot to close :)
> >
> > Etienne
> >
> > Le mercredi 11 juillet 2018 à 12:55 -0700, Alan
> > Myrvold a écrit :
> >> +1 Thanks for volunteering, Pablo
> >>
> >> On Wed, Jul 11, 2018 at 11:49 AM Jason Kuster
> >>  >> > wrote:
> >>> +1 sounds great
> >>>
> >>> On Wed, Jul 11, 2018 at 11:06 AM Thomas Weise
> >>> mailto:t...@apache.org>> wrote:
>  +1
> 
>  Thanks for volunteering, Pablo!
> 
>  On Mon, Jul 9, 2018 at 9:56 PM Jean-Baptiste
>  Onofré   > wrote:
> > +1
> >
> > I planned to send the proposal as well ;)
> >
> > Regards
> > JB
> >
> > On 09/07/2018 23:16, Pablo Estrada wrote:
> > > Hello everyone!
> > >
> > > As per the previously agreed-upon schedule
> > for Beam releases, the
> > > process for the 2.6.0 Beam release should
> > start on July 17th.
> > >
> > > I volunteer to perform this release.
> > >
> > > Here is the schedule that I have in mind:
> > >
> > > - We start triaging JIRA issues this week.
> > > - I will cut a release branch on July 17.
> > > - After July 17, any blockers will need to be
> > 

Re: [ANNOUNCEMENT] Nexmark included to the CI

2018-07-18 Thread Anton Kedin
These dashboards look great!

Can publish the links to the dashboards somewhere, for better visibility?
E.g. in the jenkins website / emails, or the wiki.

Regards,
Anton

On Wed, Jul 18, 2018 at 10:08 AM Andrew Pilloud  wrote:

> Hi Etienne,
>
> I've been asking around and it sounds like we should be able to get a
> dedicated Jenkins node for performance tests. Another thing that might help
> is making the runs a few times longer. They are currently running around 2
> seconds each, so the total time of the build probably exceeds testing.
> Internally at Google we are running them with 2000x as many events on
> Dataflow, but a job of that size won't even complete on the Direct Runner.
>
> I didn't see the query 3 issues, but now that you point it out it looks
> like a bug to me too.
>
> Andrew
>
> On Wed, Jul 18, 2018 at 1:13 AM Etienne Chauchot 
> wrote:
>
>> Hi Andrew,
>>
>> Yes I saw that, except dedicating jenkins nodes to nexmark, I see no
>> other way.
>>
>> Also, did you see query 3 output size on direct runner? Should be a
>> straight line and it is not, I'm wondering if there is a problem with sate
>> and timers impl in direct runner.
>>
>> Etienne
>>
>> Le mardi 17 juillet 2018 à 11:38 -0700, Andrew Pilloud a écrit :
>>
>> I'm noticing the graphs are really noisy. It looks like we are running
>> these on shared Jenkins executors, so our perf tests are fighting with
>> other builds for CPU. I've opened an issue
>> https://issues.apache.org/jira/browse/BEAM-4804 and am wondering if
>> anyone knows an easy fix to isolate these jobs.
>>
>> Andrew
>>
>> On Fri, Jul 13, 2018 at 2:39 AM Łukasz Gajowy  wrote:
>>
>> @Etienne: Nice to see the graphs! :)
>>
>> @Ismael: Good idea, there's no document yet. I think we could create a
>> small google doc with instructions on how to do this.
>>
>> pt., 13 lip 2018 o 10:46 Etienne Chauchot 
>> napisał(a):
>>
>> Hi,
>>
>> @Andrew, this is because I did not find a way to set 2 scales on the Y
>> axis on the perfkit graphs. Indeed numResults varies from 1 to 100 000 and
>> runtimeSec is usually bellow 10s.
>>
>> Etienne
>>
>> Le jeudi 12 juillet 2018 à 12:04 -0700, Andrew Pilloud a écrit :
>>
>> This is great, should make performance work much easier! I'm going to get
>> the Beam SQL Nexmark jobs publishing as well. (Opened
>> https://issues.apache.org/jira/browse/BEAM-4774 to track.) I might take
>> on the Dataflow runner as well if no one else volunteers.
>>
>> I am curious as to why you have two separate graphs for runtime and count
>> rather then graphing runtime/count to get the throughput rate for each run?
>> Or should that be a third graph? Looks like it would just be a small tweak
>> to the query in perfkit.
>>
>>
>>
>> Andrew
>>
>> On Thu, Jul 12, 2018 at 11:40 AM Pablo Estrada 
>> wrote:
>>
>> This is really cool Etienne : ) thanks for working on this.
>> Our of curiosity, do you know how often the tests run on each runner?
>>
>> Best
>> -P.
>>
>> On Thu, Jul 12, 2018 at 2:15 AM Romain Manni-Bucau 
>> wrote:
>>
>> Awesome Etienne, this is really important for the (user) community to
>> have that visibility since it is one of the most important aspect of the
>> Beam's quality, kudo!
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>>
>> Le jeu. 12 juil. 2018 à 10:59, Jean-Baptiste Onofré  a
>> écrit :
>>
>> It's really great to have these dashboards and integration in Jenkins !
>>
>> Thanks Etienne for driving this !
>>
>> Regards
>> JB
>>
>> On 11/07/2018 15:13, Etienne Chauchot wrote:
>> >
>> > Hi guys,
>> >
>> > I'm glad to announce that the CI of Beam has much improved ! Indeed
>> > Nexmark is now included in the perfkit dashboards.
>> >
>> > At each commit on master, nexmark suites are run and plots are created
>> > on the graphs.
>> >
>> > I've created 2 kind of dashboards:
>> > - one for performances (run times of the queries)
>> > - one for the size of the output PCollection (which should be constant)
>> >
>> > There are dashboards for these runners:
>> > - spark
>> > - flink
>> > - direct runner
>> >
>> > Each dashboard contains:
>> > - graphs in batch mode
>> > - graphs in streaming mode
>> > - graphs for the 13 queries.
>> >
>> > That gives more than a hundred of graphs (my right finger hurts after so
>> > many clics on the mouse :) ). It is detailed that much so that anyone
>> > can focus on the area they have interest in.
>> > Feel free to also create new dashboards with more aggregated data.
>> >
>> > Thanks to Lukasz and Cham for reviewing my PRs and showing how to use
>> > perfkit dashboards.
>> >
>> > Dashboards are there:
>> >
>> >
>> https://apache-beam-testing.appspot.com/explore?dashboard=5084698770407424
>> >
>> 

Re: [ANNOUNCEMENT] Nexmark included to the CI

2018-07-18 Thread Andrew Pilloud
Hi Etienne,

I've been asking around and it sounds like we should be able to get a
dedicated Jenkins node for performance tests. Another thing that might help
is making the runs a few times longer. They are currently running around 2
seconds each, so the total time of the build probably exceeds testing.
Internally at Google we are running them with 2000x as many events on
Dataflow, but a job of that size won't even complete on the Direct Runner.

I didn't see the query 3 issues, but now that you point it out it looks
like a bug to me too.

Andrew

On Wed, Jul 18, 2018 at 1:13 AM Etienne Chauchot 
wrote:

> Hi Andrew,
>
> Yes I saw that, except dedicating jenkins nodes to nexmark, I see no other
> way.
>
> Also, did you see query 3 output size on direct runner? Should be a
> straight line and it is not, I'm wondering if there is a problem with sate
> and timers impl in direct runner.
>
> Etienne
>
> Le mardi 17 juillet 2018 à 11:38 -0700, Andrew Pilloud a écrit :
>
> I'm noticing the graphs are really noisy. It looks like we are running
> these on shared Jenkins executors, so our perf tests are fighting with
> other builds for CPU. I've opened an issue
> https://issues.apache.org/jira/browse/BEAM-4804 and am wondering if
> anyone knows an easy fix to isolate these jobs.
>
> Andrew
>
> On Fri, Jul 13, 2018 at 2:39 AM Łukasz Gajowy  wrote:
>
> @Etienne: Nice to see the graphs! :)
>
> @Ismael: Good idea, there's no document yet. I think we could create a
> small google doc with instructions on how to do this.
>
> pt., 13 lip 2018 o 10:46 Etienne Chauchot 
> napisał(a):
>
> Hi,
>
> @Andrew, this is because I did not find a way to set 2 scales on the Y
> axis on the perfkit graphs. Indeed numResults varies from 1 to 100 000 and
> runtimeSec is usually bellow 10s.
>
> Etienne
>
> Le jeudi 12 juillet 2018 à 12:04 -0700, Andrew Pilloud a écrit :
>
> This is great, should make performance work much easier! I'm going to get
> the Beam SQL Nexmark jobs publishing as well. (Opened
> https://issues.apache.org/jira/browse/BEAM-4774 to track.) I might take
> on the Dataflow runner as well if no one else volunteers.
>
> I am curious as to why you have two separate graphs for runtime and count
> rather then graphing runtime/count to get the throughput rate for each run?
> Or should that be a third graph? Looks like it would just be a small tweak
> to the query in perfkit.
>
>
>
> Andrew
>
> On Thu, Jul 12, 2018 at 11:40 AM Pablo Estrada  wrote:
>
> This is really cool Etienne : ) thanks for working on this.
> Our of curiosity, do you know how often the tests run on each runner?
>
> Best
> -P.
>
> On Thu, Jul 12, 2018 at 2:15 AM Romain Manni-Bucau 
> wrote:
>
> Awesome Etienne, this is really important for the (user) community to have
> that visibility since it is one of the most important aspect of the Beam's
> quality, kudo!
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
>
> Le jeu. 12 juil. 2018 à 10:59, Jean-Baptiste Onofré  a
> écrit :
>
> It's really great to have these dashboards and integration in Jenkins !
>
> Thanks Etienne for driving this !
>
> Regards
> JB
>
> On 11/07/2018 15:13, Etienne Chauchot wrote:
> >
> > Hi guys,
> >
> > I'm glad to announce that the CI of Beam has much improved ! Indeed
> > Nexmark is now included in the perfkit dashboards.
> >
> > At each commit on master, nexmark suites are run and plots are created
> > on the graphs.
> >
> > I've created 2 kind of dashboards:
> > - one for performances (run times of the queries)
> > - one for the size of the output PCollection (which should be constant)
> >
> > There are dashboards for these runners:
> > - spark
> > - flink
> > - direct runner
> >
> > Each dashboard contains:
> > - graphs in batch mode
> > - graphs in streaming mode
> > - graphs for the 13 queries.
> >
> > That gives more than a hundred of graphs (my right finger hurts after so
> > many clics on the mouse :) ). It is detailed that much so that anyone
> > can focus on the area they have interest in.
> > Feel free to also create new dashboards with more aggregated data.
> >
> > Thanks to Lukasz and Cham for reviewing my PRs and showing how to use
> > perfkit dashboards.
> >
> > Dashboards are there:
> >
> >
> https://apache-beam-testing.appspot.com/explore?dashboard=5084698770407424
> >
> https://apache-beam-testing.appspot.com/explore?dashboard=5699257587728384
> > <
> https://apache-beam-testing.appspot.com/explore?dashboard=5138380291571712
> >
> https://apache-beam-testing.appspot.com/explore?dashboard=5138380291571712
> >
> >
> https://apache-beam-testing.appspot.com/explore?dashboard=5099379773931520
> >
> 

Re: Live coding & reviewing adventures

2018-07-18 Thread Holden Karau
Ok so follow up I'll be doing part 2 today at noon pacific today -
https://www.youtube.com/watch?v=6krU3YWsgYQ . If your @oscon come see the
talk w/the demo (and other thins) at 2:30 pm pacific in Portland 251 -
https://conferences.oreilly.com/oscon/oscon-or/public/schedule/speaker/128567

As for the venv reqs:

absl-py==0.2.2
apache-beam==2.6.0.dev0
astor==0.7.1
avro==1.8.2
backports-abc==0.5
backports.shutil-get-terminal-size==1.0.0
backports.weakref==1.0.post1
bleach==2.1.3
cachetools==2.1.0
certifi==2018.4.16
chardet==3.0.4
-e git+
https://github.com/holdenk/model-analysis.git@2cee83428d4db58fbba987f3e268114f3ac0e694#egg=chicago_taxi_setup=examples/chicago_taxi
configparser==3.5.0
crcmod==1.7
decorator==4.3.0
dill==0.2.6
docopt==0.6.2
entrypoints==0.2.3
enum34==1.1.6
fastavro==0.19.7
fasteners==0.14.1
funcsigs==1.0.2
functools32==3.2.3.post2
future==0.16.0
futures==3.2.0
gapic-google-cloud-pubsub-v1==0.15.4
gast==0.2.0
google-api-core==1.2.1
google-apitools==0.5.20
google-auth==1.5.0
google-auth-httplib2==0.0.3
google-cloud-bigquery==0.25.0
google-cloud-core==0.25.0
google-cloud-pubsub==0.26.0
google-gax==0.15.16
google-resumable-media==0.3.1
googleapis-common-protos==1.5.3
googledatastore==7.0.1
grpc-google-iam-v1==0.11.1
grpcio==1.13.0
hdfs==2.1.0
html5lib==0.999
httplib2==0.9.2
idna==2.7
ipykernel==4.8.2
ipython==5.7.0
ipython-genutils==0.2.0
ipywidgets==7.2.1
Jinja2==2.10
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.3
jupyter-console==5.2.0
jupyter-core==4.4.0
Markdown==2.6.11
MarkupSafe==1.0
mistune==0.8.3
mock==2.0.0
monotonic==1.5
nbconvert==5.3.1
nbformat==4.4.0
notebook==5.5.0
numpy==1.14.5
oauth2client==4.1.2
pandocfilters==1.4.2
pathlib2==2.3.2
pbr==4.1.0
pexpect==4.6.0
pickleshare==0.7.4
pkg-resources==0.0.0
ply==3.8
prompt-toolkit==1.0.15
proto-google-cloud-datastore-v1==0.90.4
proto-google-cloud-pubsub-v1==0.15.4
protobuf==3.6.0
ptyprocess==0.6.0
pyasn1==0.4.3
pyasn1-modules==0.2.2
Pygments==2.2.0
python-dateutil==2.7.3
pytz==2018.4
PyVCF==0.6.8
PyYAML==3.13
pyzmq==17.0.0
qtconsole==4.3.1
requests==2.19.1
rsa==3.4.2
scandir==1.7
Send2Trash==1.5.0
simplegeneric==0.8.1
singledispatch==3.4.0.3
six==1.11.0
tensorboard==1.6.0
tensorflow==1.6.0
tensorflow-model-analysis==0.6.0
tensorflow-serving-api==1.6.0
tensorflow-transform==0.6.0
termcolor==1.1.0
terminado==0.8.1
testpath==0.3.1
tornado==5.0.2
traitlets==4.3.2
typing==3.6.4
urllib3==1.23
wcwidth==0.1.7
Werkzeug==0.14.1
widgetsnbextension==3.2.1



On Wed, Jul 18, 2018 at 8:19 AM, Holden Karau  wrote:

> That’s a thing I’ve been thinking about but haven’t had the time to do
> yet. It’s a bit tricky because I don’t always know what I’m doing before I
> start and remembering to go back and tag things after a long stream is hard.
>
> On Tue, Jul 17, 2018 at 11:11 PM Ismaël Mejía  wrote:
>
>> Have you thought about creating some sort of index page for your past
>> live streams?
>> At least for the non-review ones it can provide great value given that
>> searching videos is not the easiest thing to do.
>> On Wed, Jul 18, 2018 at 12:51 AM Holden Karau 
>> wrote:
>> >
>> > Sure! I’ll respond with a pip freeze when I land.
>
> also ooos forgot to do that, I’ll do it today.
>
>>
>> >
>> > On Tue, Jul 17, 2018 at 2:28 PM Suneel Marthi 
>> wrote:
>> >>
>> >> Could u publish the python transitive deps some place that have the
>> Beam-Flink runner working ?
>> >>
>> >> On Tue, Jul 17, 2018 at 5:26 PM, Holden Karau 
>> wrote:
>> >>>
>> >>> And I've got an hour to kill @ SFO today so at some of the
>> suggestions from folks I'm going to do a more user focused one trying
>> getting the TFT demo to work with the portable flink runner (hopefully) -
>> https://www.youtube.com/watch?v=wL9mvQeN36E
>> >>>
>> >>> On Fri, Jul 13, 2018 at 11:54 AM, Holden Karau 
>> wrote:
>> 
>>  Hi folks! I've been doing some live coding in my other projects and
>> I figured I'd do some with Apache Beam as well.
>> 
>>  Today @ 3pm pacific I'm going be doing some impromptu exploration
>> better review tooling possibilities (looking at forking spark-pr-dashboard
>> for other projects like beam and setting up mentionbot to work with ASF
>> infra) - https://www.youtube.com/watch?v=ff8_jbzC8JI
>> 
>>  Next week (Thursday the 19th at 2pm pacific) I'm going to be working
>> on trying to get easier dependency management for the Python portable
>> runner in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA
>> 
>>  If your interested in seeing more of the development process I hope
>> you will join me :)
>> 
>>  P.S.
>> 
>>  You can also follow on twitch which does a better job of
>> notifications https://www.twitch.tv/holdenkarau
>> 
>>  Also one of the other thing I do is "live reviews" of PRs but they
>> are generally opt-in and I don't have enough opt-ins from the Beam
>> community to do live reviews in Beam, if you work on Beam and would be OK
>> with me doing a live streamed review of your PRs let me know (if your

Re: Vendoring / Shading Protobuf and gRPC

2018-07-18 Thread Lukasz Cwik
Ismael, the SDK should perform the pipeline translation to proto because I
expect the flow to be:
User Code -> SDK -> Proto Translation -> Job API -> Runner
I don't expect "runners" to live within the users process anymore
(excluding the direct runner). There will be one portable "runner" and it
will be responsible for communicating with the job management APIs. It
shouldn't be called a runner but for backwards compatibility it will behave
like a runner does today. Flink/Spark/... will all live on the other side
of the job management API.

Thomas, I can run RemoteExectuionTest from commit
ae2bebaf8b277e99840fa63f1b95d828f2093d16 without needing to modify the
project/module structure in Intellij. Adding the jars manually only helps
with code completion.

https://github.com/apache/beam/pull/5977 works around the duplicate content
root issue in Intellij. I have also run into the -Werror issue occasionally
and don't know any fix or why it gets triggered as it doesn't happen to me
all the time.

On Tue, Jul 17, 2018 at 7:01 PM Thomas Weise  wrote:

> Thanks, the classpath order matters indeed.
>
> Still not able to run RemoteExecutionTest, but I was able to get the Flink
> portable test to work by adding the following to the *top* of the
> dependency list of *beam-runners-flink_2.11_test*
>
>
> vendor/sdks-java-extensions-protobuf/build/libs/beam-vendor-sdks-java-extensions-protobuf-2.6.0-SNAPSHOT.jar
> model/fn-execution/build/libs/beam-model-fn-execution-2.6.0-SNAPSHOT.jar
>
>
> On Tue, Jul 17, 2018 at 6:00 PM Ankur Goenka  wrote:
>
>> Yes, I am able to run it.
>>
>> For tests, you also need to add dependencies to
>> ":beam-runners-java-fn-execution/beam-runners-java-fn-execution_*test*"
>> module.
>>
>> Also, I only added
>> :beam-model-job-management-2.6.0-SNAPSHOT.jar
>> :beam-model-fn-execution-2.6.0-SNAPSHOT.jar
>> to the dependencies manually so not sure if you want to add
>> io.grpc:grpc-core:1.12.0 and com.google.protobuf:protobuf-java:3.5.1 to
>> the dependencies.
>>
>> Note, you need to move them up in the dependencies list.
>>
>>
>> On Tue, Jul 17, 2018 at 5:54 PM Thomas Weise  wrote:
>>
>>> Are you able to
>>> run org.apache.beam.runners.fnexecution.control.RemoteExecutionTest from
>>> within Intellij ?
>>>
>>> I can get the compile errors to disappear by adding
>>> beam-model-job-management-2.6.0-SNAPSHOT.jar, io.grpc:grpc-core:1.12.0
>>> and com.google.protobuf:protobuf-java:3.5.1
>>>
>>> Running the test still fails since other dependencies are missing.
>>>
>>>
>>> On Tue, Jul 17, 2018 at 4:02 PM Ankur Goenka  wrote:
>>>
 For reference:
 I was able to make intellij work with the master by doing following
 steps

1. Remove module :beam:vendor-sdks-java-extensions-protobuf from
intellij.
2. Adding

 :beam-model-fn-execution/build/libs/beam-model-fn-execution-2.6.0-SNAPSHOT.jar
and 
 :beam-model-job-management/build/libs/beam-model-job-management-2.6.0-SNAPSHOT.jar
to the appropriate modules at the top of the dependency list.


 On Tue, Jul 17, 2018 at 2:29 PM Thomas Weise  wrote:

> Adding the external jar in Intellij (2018.1) currently fails due to a
> duplicate source directory (sdks/java/extensions/protobuf/src/main/java).
>
> The build as such also fails, with:  error: warnings found and -Werror
> specified
>
> Ismaël found removing
> https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L538
> as workaround.
>
>
> On Thu, Jul 12, 2018 at 1:55 PM Ismaël Mejía 
> wrote:
>
>> Seems reasonable, but why exactly may we need the model (or protobuf
>> related things) in the future in the SDK ? wasn’t it supposed to be
>> translated into the Pipeline proto representation via the runners (and
>> in this case the dep reside in the runner side) ?
>> On Thu, Jul 12, 2018 at 2:50 AM Lukasz Cwik  wrote:
>> >
>> > Got a fix[1] for Andrews issue which turned out to be a release
>> blocker since it broke performing the release. Also fixed several minor
>> things like javadoc that were wrong with the release. Solving it allowed 
>> me
>> to do the publishing in parallel and cut the release time from 20+ mins 
>> to
>> 8 mins on my machine.
>> >
>> > 1: https://github.com/apache/beam/pull/5936
>> >
>> > On Wed, Jul 11, 2018 at 3:51 PM Andrew Pilloud 
>> wrote:
>> >>
>> >> We discussed this in person, sounds like my issue is known and
>> will be fixed shortly. I'm running builds with '-Ppublishing' because I
>> need to generate release artifacts for bundling the Beam SQL shell with 
>> the
>> Google Cloud SDK. Hope to eventually just use the Beam release, but we 
>> are
>> currently cutting a release off master every week to quickly iterate on 
>> bug
>> fixes.
>> >>
>> >> Andrew
>> >>

Re: Live coding & reviewing adventures

2018-07-18 Thread Holden Karau
That’s a thing I’ve been thinking about but haven’t had the time to do yet.
It’s a bit tricky because I don’t always know what I’m doing before I start
and remembering to go back and tag things after a long stream is hard.

On Tue, Jul 17, 2018 at 11:11 PM Ismaël Mejía  wrote:

> Have you thought about creating some sort of index page for your past
> live streams?
> At least for the non-review ones it can provide great value given that
> searching videos is not the easiest thing to do.
> On Wed, Jul 18, 2018 at 12:51 AM Holden Karau 
> wrote:
> >
> > Sure! I’ll respond with a pip freeze when I land.

also ooos forgot to do that, I’ll do it today.

>
> >
> > On Tue, Jul 17, 2018 at 2:28 PM Suneel Marthi 
> wrote:
> >>
> >> Could u publish the python transitive deps some place that have the
> Beam-Flink runner working ?
> >>
> >> On Tue, Jul 17, 2018 at 5:26 PM, Holden Karau 
> wrote:
> >>>
> >>> And I've got an hour to kill @ SFO today so at some of the suggestions
> from folks I'm going to do a more user focused one trying getting the TFT
> demo to work with the portable flink runner (hopefully) -
> https://www.youtube.com/watch?v=wL9mvQeN36E
> >>>
> >>> On Fri, Jul 13, 2018 at 11:54 AM, Holden Karau 
> wrote:
> 
>  Hi folks! I've been doing some live coding in my other projects and I
> figured I'd do some with Apache Beam as well.
> 
>  Today @ 3pm pacific I'm going be doing some impromptu exploration
> better review tooling possibilities (looking at forking spark-pr-dashboard
> for other projects like beam and setting up mentionbot to work with ASF
> infra) - https://www.youtube.com/watch?v=ff8_jbzC8JI
> 
>  Next week (Thursday the 19th at 2pm pacific) I'm going to be working
> on trying to get easier dependency management for the Python portable
> runner in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA
> 
>  If your interested in seeing more of the development process I hope
> you will join me :)
> 
>  P.S.
> 
>  You can also follow on twitch which does a better job of
> notifications https://www.twitch.tv/holdenkarau
> 
>  Also one of the other thing I do is "live reviews" of PRs but they
> are generally opt-in and I don't have enough opt-ins from the Beam
> community to do live reviews in Beam, if you work on Beam and would be OK
> with me doing a live streamed review of your PRs let me know (if your
> curious to what they look like you can see some of them here in Spark land).
> 
>  --
>  Twitter: https://twitter.com/holdenkarau
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Twitter: https://twitter.com/holdenkarau
> >>
> >>
> > --
> > Twitter: https://twitter.com/holdenkarau
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: [portablility] metrics interrogations

2018-07-18 Thread Lukasz Cwik
On Wed, Jul 18, 2018 at 7:01 AM Etienne Chauchot 
wrote:

> Hi,
> Luke, Alex, I have some portable metrics interrogations, can you confirm
> them ?
>
> 1 - As it is the SDK harness that will run the code of the UDFs, if a UDF
> defines a metric, then the SDK harness will give updates through GRPC calls
> to the runner so that the runner could update metrics cells, right?
>

Yes.


>
> 2 - Alex, you mentioned in proto and design doc that there will be no
> aggreagation of metrics. But some runners (spark/flink) rely on
> accumulators and when they are merged, it triggers the merging of the whole
> chain to the metric cells. I know that Dataflow does not do the same, it
> uses non agregated metrics and sends them to an aggregation service. Will
> there be a change of paradigm with portability for runners that merge
> themselves ?
>

There will be local aggregation of metrics scoped to a bundle; after the
bundle is finished processing they are discarded. This will require some
kind of global aggregation support from a runner, whether that runner does
it via accumulators or via an aggregation service is up to the runner.

3 - Please confirm that the distinction between attempted and committed
> metrics is not the business of portable metrics. Indeed, it does not
> involve communication between the runner harness and the SDK harness as it
> is a runner only matter. I mean, when a runner commits a bundle it just
> updates its committed metrics and do not need to inform the SDK harness.
> But, of course, when the user requests committed metrics through the SDK,
> then the SDK harness will ask the runner harness to give them.
>
>
You are correct in saying that during execution, the SDK does not
differentiate between attempted and committed metrics and only the runner
does. We still lack an API definition and contract for how an SDK would
query for metrics from a runner but your right in saying that an SDK could
request committed metrics and the Runner would supply them some how.


> Thanks
>
> Best
> Etienne
>
>
>
>


[portablility] metrics interrogations

2018-07-18 Thread Etienne Chauchot
Hi,
Luke, Alex, I have some portable metrics interrogations, can you confirm them ? 

1 - As it is the SDK harness that will run the code of the UDFs, if a UDF 
defines a metric, then the SDK harness will
give updates through GRPC calls to the runner so that the runner could update 
metrics cells, right?

2 - Alex, you mentioned in proto and design doc that there will be no 
aggreagation of metrics. But some runners
(spark/flink) rely on accumulators and when they are merged, it triggers the 
merging of the whole chain to the metric
cells. I know that Dataflow does not do the same, it uses non agregated metrics 
and sends them to an aggregation
service. Will there be a change of paradigm with portability for runners that 
merge themselves ? 

3 - Please confirm that the distinction between attempted and committed metrics 
is not the business of portable metrics.
Indeed, it does not involve communication between the runner harness and the 
SDK harness as it is a runner only matter.
I mean, when a runner commits a bundle it just updates its committed metrics 
and do not need to inform the SDK harness.
But, of course, when the user requests committed metrics through the SDK, then 
the SDK harness will ask the runner
harness to give them.

Thanks

Best
Etienne




Re: An update on Eugene

2018-07-18 Thread Ekrem Aksoy
Thank you for all of your contribution. Good luck with your new venture.

On Mon, Jul 16, 2018 at 10:17 PM Eugene Kirpichov 
wrote:

> Hi beamers,
>
> After 5.5 years working on data processing systems at Google, several of
> these years working on Dataflow and Beam, I am moving on to do something
> new (also at Google) in the area of programming models for machine
> learning. Anybody who worked with me closely knows how much I love building
> programming models, so I could not pass up on the opportunity to build a
> new one - I expect to have a lot of fun there!
>
> On the new team we very much plan to make things open-source when the time
> is right, and make use of Beam, just as TensorFlow does - so I will stay in
> touch with the community, and I expect that we will still work together on
> some things. However, Beam will no longer be the main focus of my work.
>
> I've made the decision a couple months ago and have spent the time since
> then getting things into a good state and handing over the community
> efforts in which I have played a particularly active role - they are in
> very capable hands:
> - Robert Bradshaw and Ankur Goenka on Google side are taking charge of
> Portable Runners (e.g. the Portable Flink runner).
> - Luke Cwik will be in charge of the future of Splittable DoFn. Ismael
> Mejia has also been involved in the effort and actively helping, and I
> believe he continues to do so.
> - The Beam IO ecosystem in general is in very good shape (perhaps the best
> in the industry) and does not need a lot of constant direction; and it has
> a great community (thanks JB, Ismael, Etienne and many others!) - however,
> on Google side, Chamikara Jayalath will take it over.
>
> It was a great pleasure working with you all. My last day formally on Beam
> will be this coming Friday, then I'll take a couple weeks of vacation and
> jump right in on the new team.
>
> Of course, if my involvement in something is necessary, I'm still
> available on all the same channels as always (email, Slack, Hangouts) -
> but, in general, please contact the folks mentioned above instead of me
> about the respective matters from now on.
>
> Thanks!
>


Re: Let's start getting rid of BoundedSource

2018-07-18 Thread Etienne Chauchot
Le mardi 17 juillet 2018 à 09:48 -0700, Eugene Kirpichov a écrit :
> On Tue, Jul 17, 2018 at 2:49 AM Etienne Chauchot  wrote:
> > Hi Eugene
> > Le lundi 16 juillet 2018 à 07:52 -0700, Eugene Kirpichov a écrit :
> > > Hi Etienne - thanks for catching this; indeed, I somehow missed that 
> > > actually several runners do this same thing -
> > > it seemed to me as something that can be done in user code (because it 
> > > involves combining estimated size + split
> > > in pretty much the same way), 
> > 
> > When you say "user code", you mean IO writter code by opposition to runner 
> > code right ?
> Correct: "user code" is what happens in the SDK or the user pipeline.
>  
> >  
> > 
> > 
> > > but I'm not so sure: even though many runners have a "desired 
> > > parallelism" option or alike, it's not all of them,
> > > so we can't use such an option universally.
> > 
> > Agree, cannot be universal
> > > Maybe then the right thing to do is to:
> > > - Use bounded SDFs for these
> > > - Change SDF @SplitRestriction API to take a desired number of splits as 
> > > a parameter, and introduce an API
> > > @EstimateOutputSizeBytes(element) valid only on bounded SDFs
> > Agree with the idea but EstimateOutpuSize must return the size of the 
> > dataset not of an element.
> Please recall that the element here is e.g. a filename, or name of a BigTable 
> table, or something like that - i.e. the
> element describes the dataset, and the restriction describes what part of the 
> dataset
> 
> If e.g. we have a PCollection of filenames and apply a ReadTextFn SDF 
> to it, and want the runner to know the
> total size of all files - the runner could insert some transforms to apply 
> EstimateOutputSize to each element and
> Sum.globally() them.

You're right, I missunderstood what you meant by element. The important is that 
the runner could at some point before
calling @SplitRestriction know the size of the dataset, potentially with the 
Sum you mentioned.

>  
> >  On some runners, each worker is set to a given amount of heap.  Thus, it 
> > is important that a runner could evaluate
> > the size of the whole dataset to determine the size of each split (to fit 
> > in memory of the workers) and thus tell
> > the bounded SDF the number of desired splits.
> > > - Add some plumbing to the standard bounded SDF expansion so that 
> > > different runners can compute that parameter
> > > differently, the two standard ways being "split into given number of 
> > > splits" or "split based on the sub-linear
> > > formula of estimated size".
> > > 
> > > I think this would work, though this is somewhat more work than I 
> > > anticipated. Any alternative ideas?
> > +1 It will be very similar for an IO developer (@EstimateOutputSizeBytes 
> > will be similar to
> > source.getEstimatedSizeBytes(), and @SplitRestriction(desiredSplits) 
> > similar to source.split(desiredBundleSize))
> Yeah I'm not sure this is actually a good thing that these APIs end up so 
> similar to the old ones - I was hoping we
> could come up with something better - but seems like there's no viable 
> alternative at this point :) 
> > Etienne
> > > On Mon, Jul 16, 2018 at 3:07 AM Etienne Chauchot  
> > > wrote:
> > > > Hi, 
> > > > thanks Eugene for analyzing and sharing that.
> > > > I have one comment inline
> > > > 
> > > > Etienne
> > > > 
> > > > Le dimanche 15 juillet 2018 à 14:20 -0700, Eugene Kirpichov a écrit :
> > > > > Hey beamers,
> > > > > I've always wondered whether the BoundedSource implementations in the 
> > > > > Beam SDK are worth their complexity, or
> > > > > whether they rather could be converted to the much easier to code 
> > > > > ParDo style, which is also more modular and
> > > > > allows you to very easily implement readAll().
> > > > > 
> > > > > There's a handful: file-based sources, BigQuery, Bigtable, HBase, 
> > > > > Elasticsearch, MongoDB, Solr and a couple
> > > > > more.
> > > > > 
> > > > > Curiously enough, BoundedSource vs. ParDo matters *only* on Dataflow, 
> > > > > because AFAICT Dataflow is the only
> > > > > runner that cares about the things that BoundedSource can do and 
> > > > > ParDo can't:
> > > > > - size estimation (used to choose an initial number of workers) [ok, 
> > > > > Flink calls the function to return
> > > > > statistics, but doesn't seem to do anything else with it]
> > > > 
> > > > => Spark uses size estimation to set desired bundle size with something 
> > > > like desiredBundleSize = estimatedSize /
> > > > nbOfWorkersConfigured (partitions)
> > > > See 
> > > > https://github.com/apache/beam/blob/a5634128d194161aebc8d03229fdaa1066cf7739/runners/spark/src/main/java/org
> > > > /apache/beam/runners/spark/io/SourceRDD.java#L101
> > > > 
> > > > 
> > > > > - splitting into bundles of given size (Dataflow chooses the number 
> > > > > of bundles to create based on a simple
> > > > > formula that's not entirely unlike K*sqrt(size))
> > > > > - liquid sharding (splitAtFraction())
> > > > > 
> > > > > 

Build failed in Jenkins: beam_Release_Gradle_NightlySnapshot #104

2018-07-18 Thread Apache Jenkins Server
See 


Changes:

[apilloud] [SQL] Default timezone is UTC

[aromanenko.dev] [BEAM-4622] Makes required to call Beam SQL expressions 
validation

[aromanenko.dev] Check number of arguments at first

[elliottb] Fix the expected encoding of BigQuery's NUMERIC type when reading 
from

[apilloud] [BEAM-4774] Add Nexmark SQL to postcommits

[apilloud] [SQL] Actually run JDBC Jar Test

[apilloud] [SQL] Wrap PipelineOptions with correct class loader

[boyuanz] Automate release branch cut process

[lcwik] [BEAM-4744] Fix runners/google-cloud-dataflow-java/examples-streaming

--
[...truncated 18.77 MB...]
:beam-sdks-java-maven-archetypes-starter:compileTestJava (Thread[Task worker 
for ':' Thread 3,5,main]) completed. Took 0.0 secs.
:beam-sdks-java-maven-archetypes-starter:processTestResources (Thread[Task 
worker for ':' Thread 3,5,main]) started.

> Task :beam-sdks-java-maven-archetypes-starter:processTestResources UP-TO-DATE
Build cache key for task 
':beam-sdks-java-maven-archetypes-starter:processTestResources' is 
f74f3200edf284b276c50da93794d928
Caching disabled for task 
':beam-sdks-java-maven-archetypes-starter:processTestResources': Caching has 
not been enabled for the task
Skipping task ':beam-sdks-java-maven-archetypes-starter:processTestResources' 
as it is up-to-date.
:beam-sdks-java-maven-archetypes-starter:processTestResources (Thread[Task 
worker for ':' Thread 3,5,main]) completed. Took 0.002 secs.
:beam-sdks-java-maven-archetypes-starter:testClasses (Thread[Task worker for 
':' Thread 3,5,main]) started.

> Task :beam-sdks-java-maven-archetypes-starter:testClasses UP-TO-DATE
Skipping task ':beam-sdks-java-maven-archetypes-starter:testClasses' as it has 
no actions.
:beam-sdks-java-maven-archetypes-starter:testClasses (Thread[Task worker for 
':' Thread 3,5,main]) completed. Took 0.0 secs.
:beam-sdks-java-maven-archetypes-starter:shadowTestJar (Thread[Task worker for 
':' Thread 3,5,main]) started.

> Task :beam-sdks-java-maven-archetypes-starter:shadowTestJar
Build cache key for task 
':beam-sdks-java-maven-archetypes-starter:shadowTestJar' is 
f52e87e48f15c0156d6979015cff00d4
Caching disabled for task 
':beam-sdks-java-maven-archetypes-starter:shadowTestJar': Caching has not been 
enabled for the task
Task ':beam-sdks-java-maven-archetypes-starter:shadowTestJar' is not up-to-date 
because:
  No history is available.
***
GRADLE SHADOW STATS

Total Jars: 1 (includes project)
Total Time: 0.0s [0ms]
Average Time/Jar: 0.0s [0.0ms]
***
:beam-sdks-java-maven-archetypes-starter:shadowTestJar (Thread[Task worker for 
':' Thread 3,5,main]) completed. Took 0.008 secs.
:beam-sdks-java-maven-archetypes-starter:sourcesJar (Thread[Task worker for ':' 
Thread 3,5,main]) started.

> Task :beam-sdks-java-maven-archetypes-starter:sourcesJar
file or directory 
'
 not found
Build cache key for task ':beam-sdks-java-maven-archetypes-starter:sourcesJar' 
is a106f15937cacfee668e25636b705e03
Caching disabled for task 
':beam-sdks-java-maven-archetypes-starter:sourcesJar': Caching has not been 
enabled for the task
Task ':beam-sdks-java-maven-archetypes-starter:sourcesJar' is not up-to-date 
because:
  No history is available.
file or directory 
'
 not found
:beam-sdks-java-maven-archetypes-starter:sourcesJar (Thread[Task worker for ':' 
Thread 3,5,main]) completed. Took 0.004 secs.
:beam-sdks-java-maven-archetypes-starter:testSourcesJar (Thread[Task worker for 
':' Thread 3,5,main]) started.

> Task :beam-sdks-java-maven-archetypes-starter:testSourcesJar
file or directory 
'
 not found
Build cache key for task 
':beam-sdks-java-maven-archetypes-starter:testSourcesJar' is 
58715d6b8e221cace68f230ccfd69fd4
Caching disabled for task 
':beam-sdks-java-maven-archetypes-starter:testSourcesJar': Caching has not been 
enabled for the task
Task ':beam-sdks-java-maven-archetypes-starter:testSourcesJar' is not 
up-to-date because:
  No history is available.
file or directory 
'
 not found
:beam-sdks-java-maven-archetypes-starter:testSourcesJar (Thread[Task worker for 
':' Thread 3,5,main]) completed. Took 0.004 secs.
:beam-sdks-java-nexmark:generatePomFileForMavenJavaPublication (Thread[Task 
worker for ':' Thread 10,5,main]) started.

> Task :beam-sdks-java-nexmark:generatePomFileForMavenJavaPublication
Build cache key for task 

Re: [ANNOUNCEMENT] Nexmark included to the CI

2018-07-18 Thread Etienne Chauchot
Hi Andrew,
Yes I saw that, except dedicating jenkins nodes to nexmark, I see no other way.
Also, did you see query 3 output size on direct runner? Should be a straight 
line and it is not, I'm wondering if there
is a problem with sate and timers impl in direct runner.
Etienne
Le mardi 17 juillet 2018 à 11:38 -0700, Andrew Pilloud a écrit :
> I'm noticing the graphs are really noisy. It looks like we are running these 
> on shared Jenkins executors, so our perf
> tests are fighting with other builds for CPU. I've opened an issue 
> https://issues.apache.org/jira/browse/BEAM-4804 and
> am wondering if anyone knows an easy fix to isolate these jobs.
> Andrew
> On Fri, Jul 13, 2018 at 2:39 AM Łukasz Gajowy  wrote:
> > @Etienne: Nice to see the graphs! :)
> > 
> > @Ismael: Good idea, there's no document yet. I think we could create a 
> > small google doc with instructions on how to
> > do this.
> > 
> > pt., 13 lip 2018 o 10:46 Etienne Chauchot  napisał(a):
> > > Hi, 
> > > @Andrew, this is because I did not find a way to set 2 scales on the Y 
> > > axis on the perfkit graphs. Indeed
> > > numResults varies from 1 to  100 000 and runtimeSec is usually bellow 10s.
> > > Etienne
> > > Le jeudi 12 juillet 2018 à 12:04 -0700, Andrew Pilloud a écrit :
> > > > This is great, should make performance work much easier! I'm going to 
> > > > get the Beam SQL Nexmark jobs publishing
> > > > as well. (Opened https://issues.apache.org/jira/browse/BEAM-4774 to 
> > > > track.) I might take on the Dataflow runner
> > > > as well if no one else volunteers.
> > > > 
> > > > I am curious as to why you have two separate graphs for runtime and 
> > > > count rather then graphing runtime/count to
> > > > get the throughput rate for each run? Or should that be a third graph? 
> > > > Looks like it would just be a small tweak
> > > > to the query in perfkit.
> > > > Andrew
> > > > On Thu, Jul 12, 2018 at 11:40 AM Pablo Estrada  
> > > > wrote:
> > > > > This is really cool Etienne : ) thanks for working on this.Our of 
> > > > > curiosity, do you know how often the tests
> > > > > run on each runner?
> > > > > 
> > > > > Best
> > > > > -P.
> > > > > 
> > > > > On Thu, Jul 12, 2018 at 2:15 AM Romain Manni-Bucau 
> > > > >  wrote:
> > > > > > Awesome Etienne, this is really important for the (user) community 
> > > > > > to have that visibility since it is one
> > > > > > of the most important aspect of the Beam's quality, kudo!
> > > > > > 
> > > > > > 
> > > > > > Romain Manni-Bucau
> > > > > > @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
> > > > > > 
> > > > > > Le jeu. 12 juil. 2018 à 10:59, Jean-Baptiste Onofré 
> > > > > >  a écrit :
> > > > > > > It's really great to have these dashboards and integration in 
> > > > > > > Jenkins !
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > Thanks Etienne for driving this !
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > Regards
> > > > > > > 
> > > > > > > JB
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > On 11/07/2018 15:13, Etienne Chauchot wrote:
> > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > > Hi guys,
> > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > > I'm glad to announce that the CI of Beam has much improved ! 
> > > > > > > > Indeed
> > > > > > > 
> > > > > > > > Nexmark is now included in the perfkit dashboards.
> > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > > At each commit on master, nexmark suites are run and plots are 
> > > > > > > > created
> > > > > > > 
> > > > > > > > on the graphs.
> > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > > I've created 2 kind of dashboards:
> > > > > > > 
> > > > > > > > - one for performances (run times of the queries)
> > > > > > > 
> > > > > > > > - one for the size of the output PCollection (which should be 
> > > > > > > > constant)
> > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > > There are dashboards for these runners:
> > > > > > > 
> > > > > > > > - spark
> > > > > > > 
> > > > > > > > - flink
> > > > > > > 
> > > > > > > > - direct runner
> > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > > Each dashboard contains:
> > > > > > > 
> > > > > > > > - graphs in batch mode 
> > > > > > > 
> > > > > > > > - graphs in streaming mode
> > > > > > > 
> > > > > > > > - graphs for the 13 queries.
> > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > > That gives more than a hundred of graphs (my right finger hurts 
> > > > > > > > after so
> > > > > > > 
> > > > > > > > many clics on the mouse :) ). It is detailed that much so that 
> > > > > > > > anyone
> > > > > > > 
> > > > > > > > can focus on the area they have interest in.
> > > > > > > 
> > > > > > > > Feel free to also create new dashboards with more aggregated 
> > > > > > > > data.
> > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > > Thanks to Lukasz and Cham for reviewing my PRs and showing how 
> > > > > > > > to use
> > > > > > > 
> > > > > > > 

Re: Live coding & reviewing adventures

2018-07-18 Thread Ismaël Mejía
Have you thought about creating some sort of index page for your past
live streams?
At least for the non-review ones it can provide great value given that
searching videos is not the easiest thing to do.
On Wed, Jul 18, 2018 at 12:51 AM Holden Karau  wrote:
>
> Sure! I’ll respond with a pip freeze when I land.
>
> On Tue, Jul 17, 2018 at 2:28 PM Suneel Marthi  wrote:
>>
>> Could u publish the python transitive deps some place that have the 
>> Beam-Flink runner working ?
>>
>> On Tue, Jul 17, 2018 at 5:26 PM, Holden Karau  wrote:
>>>
>>> And I've got an hour to kill @ SFO today so at some of the suggestions from 
>>> folks I'm going to do a more user focused one trying getting the TFT demo 
>>> to work with the portable flink runner (hopefully) - 
>>> https://www.youtube.com/watch?v=wL9mvQeN36E
>>>
>>> On Fri, Jul 13, 2018 at 11:54 AM, Holden Karau  wrote:

 Hi folks! I've been doing some live coding in my other projects and I 
 figured I'd do some with Apache Beam as well.

 Today @ 3pm pacific I'm going be doing some impromptu exploration better 
 review tooling possibilities (looking at forking spark-pr-dashboard for 
 other projects like beam and setting up mentionbot to work with ASF infra) 
 - https://www.youtube.com/watch?v=ff8_jbzC8JI

 Next week (Thursday the 19th at 2pm pacific) I'm going to be working on 
 trying to get easier dependency management for the Python portable runner 
 in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA

 If your interested in seeing more of the development process I hope you 
 will join me :)

 P.S.

 You can also follow on twitch which does a better job of notifications 
 https://www.twitch.tv/holdenkarau

 Also one of the other thing I do is "live reviews" of PRs but they are 
 generally opt-in and I don't have enough opt-ins from the Beam community 
 to do live reviews in Beam, if you work on Beam and would be OK with me 
 doing a live streamed review of your PRs let me know (if your curious to 
 what they look like you can see some of them here in Spark land).

 --
 Twitter: https://twitter.com/holdenkarau
>>>
>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>
>>
> --
> Twitter: https://twitter.com/holdenkarau