Re: [PROPOSAL] CI improvement: be able to run the IT of the IOs from github pull request

2018-05-30 Thread Kenneth Knowles
This all seems extremely useful. Is there some action to be taken other
than advertising these related JIRAs?

Kenn

On Wed, May 30, 2018 at 5:45 AM Łukasz Gajowy 
wrote:

> +1 to generalizing IT. I think the tests you mentioned were developed
> earlier than the general idea of how the IOIT should look like emerged.
> AFAIK the same goes for the tests in io/google-cloud-platform module. I
> recently created some issues that address that [1], [2], [3]. If there's
> anyone willing to take those - feel free (I can help with this).
>
> [1] https://issues.apache.org/jira/browse/BEAM-4416
> [2] https://issues.apache.org/jira/browse/BEAM-4399
> [3] https://issues.apache.org/jira/browse/BEAM-4398
>
> 2018-05-30 14:00 GMT+02:00 Etienne Chauchot :
>
>> Hi Łukasz
>>
>> Thanks for the details.
>>
>> I was more thinking about generalizing IT test integration. For example
>> some IOs like Cassandra and Elasticsearch have IT but no groovy scripts.
>> Also I agree with your list
>> And thanks for the details about backend services automatic provisioning,
>> I did not know that.
>>
>> Etienne
>> Le mercredi 30 mai 2018 à 11:21 +0200, Łukasz Gajowy a écrit :
>>
>> Hi Etienne,
>>
>> it is already possible, provided that there is appropriate Jenkins job
>> defined (see examples here: [1],[2]). Either the reviewer or the author can
>> run the seed job to load job definitions (by typing "Run seed job" in
>> comment) and then run the test he/she is interested to run (by specifying
>> the correct phrase in the GitHub comment, eg. "Run Java JdbcIO Performance
>> Test". The results are then available on Jenkins so those are public too.
>>
>> Regarding the infrastructure: currently, if a test requires any
>> Kubernetes' infrastructure, it is set up by PerfKitBenchmarker tool before
>> the test is actually run. After the test execution, all the infrastructure
>> is torn down. This also is made automatically provided that all necessary
>> Kubernetes' scripts are there.
>>
>> Despite the fact that it is possible, I must say that all the
>> "Performance Testing Framework" needs improvement in the following areas
>> (so should be considered as an ongoing work in progress):
>>  - documentation and instructions for the community (this is getting more
>> urgent!)
>>  - support for other runners (currently only direct and Dataflow are
>> supported, as there were some issues when we tried to integrate it with
>> Spark and Flink)
>>  - support for other filesystems (currently only local and HDFS are
>> supported)
>>  - rename and reorganize IT jobs in Jenkins (see: [3])
>>
>> Also, I think it's worthy to look improvement in terms of job definitions
>> (seed jobs overwrite all jobs so this can collide with other developers
>> work). See the thread I started a while ago in [4] for further info.
>>
>> Best regards,
>> Łukasz Gajowy
>>
>>
>> [1]
>> https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PerformanceTests_JDBC.groovy
>> [2]
>> https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PerformanceTests_FileBasedIO_IT.groovy
>> [3] https://issues.apache.org/jira/browse/BEAM-4298
>> [4]
>> https://lists.apache.org/thread.html/b1aaea2c7eadc7ca1d1326b94a8c4c3a67befc0753897fd7fa4a3a4e@%3Cdev.beam.apache.org%3E
>>
>> 2018-05-30 10:14 GMT+02:00 Etienne Chauchot :
>>
>> Hi guys
>> Part of the CI improvement work, I would suggest to enable running the
>> integration tests of the IOs from the github PR.
>>
>> Indeed, when doing a review, either the reviewer or the author needs to
>> run the IT. The problem is that the results are private. It would be good
>> to be able to run IT using a phrase in github (like the validates runner
>> tests) to have the results public like any other test in the PR.
>> But it would require the backend IT infrastructures (kubernates/docker
>> ...) to be always up and also to set their credentials/location in the
>> related jenkins groovy script.
>>
>> I opened:
>> https://issues.apache.org/jira/browse/BEAM-4427
>>
>> Thoughts?
>>
>> Best
>> Etienne
>>
>>
>>
>


Jenkins build is back to normal : beam_SeedJob #1828

2018-05-30 Thread Apache Jenkins Server
See 



Re: Java compiler OOMs on Jenkins/Gradle

2018-05-30 Thread Lukasz Cwik
Try running without a daemon (use flag --no-daemon) to see if its an issue
with the gradle daemon you have been using isn't overloaded.

On Wed, May 30, 2018 at 5:11 PM Ankur Goenka  wrote:

> I am facing OOM while locally building the project using Gradle. Here is
> the scan https://scans.gradle.com/s/t3n42rw5666us
> The issue is happening from :rat task.
> Is this issue related?
>
> On Tue, May 1, 2018 at 4:40 PM Scott Wegner  wrote:
>
>> Sorry about the instability. We need to get the Gradle jobs tuned for our
>> Jenkins machines, and there's no way to test my configuration changes
>> without affecting all jobs :-/
>>
>> The changes I'm making are here: https://github.com/apache/beam/pull/5218
>>
>>
>> It seems that they're still not quite right: the intent is to allocate
>> half the memory to each job, and then divide it up by worker. But for 16
>> workers each is getting ~1GB, even though the machines have 100GB total. I
>> suspect I'm just calling the wrong API.
>>
>> On Tue, May 1, 2018 at 1:41 PM Eugene Kirpichov 
>> wrote:
>>
>>> Thanks! FWIW seems that my other Jenkins build is about to fail with the
>>> same issue
>>> https://builds.apache.org/job/beam_PreCommit_Java_GradleBuild/4806/ -
>>> "Expiring Daemon because JVM Tenured space is exhausted"
>>>
>>> On Tue, May 1, 2018 at 1:36 PM Lukasz Cwik  wrote:
>>>
 +sweg...@google.com who is currently messing around with tuning some
 Gradle flags related to the JVM and its memory usage.

 On Tue, May 1, 2018 at 1:34 PM Eugene Kirpichov 
 wrote:

> Hi,
>
> I've seen the same issue twice in a row on PR
> https://github.com/apache/beam/pull/4264 : the Java precommit fails
> with messages like:
>
> > Task :beam-sdks-java-core:compileTestJava
> An exception has occurred in the compiler ((version info not
> available)). Please file a bug against the Java compiler via the Java bug
> reporting page (http://bugreport.java.com) after checking the Bug
> Database (http://bugs.java.com) for duplicates. Include your program
> and the following diagnostic in your report. Thank you.
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>
> Full build link:
> https://builds.apache.org/job/beam_PreCommit_Java_GradleBuild/4803/consoleFull
>
> Anybody know what's up with that? I thought we got new powerful
> Jenkins executors and we shouldn't be running out of memory? However, I 
> see
> that the build specifies* -Dorg.gradle.jvmargs=-Xmx512m* - this seems
> too small. Should we increase this?
>
> Thanks.
>
 --
>>
>>
>> Got feedback? http://go/swegner-feedback
>>
>


Build failed in Jenkins: beam_SeedJob #1827

2018-05-30 Thread Apache Jenkins Server
See 

--
GitHub pull request #5406 of commit 05939d2358b3ca778636ffdf58396fead4b38e49, 
no merge conflicts.
Setting status of 05939d2358b3ca778636ffdf58396fead4b38e49 to PENDING with url 
https://builds.apache.org/job/beam_SeedJob/1827/ and message: 'Build started 
sha1 is merged.'
Using context: Jenkins: Seed Job
[EnvInject] - Loading node environment variables.
Building remotely on beam12 (beam) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/beam.git # timeout=10
Fetching upstream changes from https://github.com/apache/beam.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/beam.git 
 > +refs/heads/*:refs/remotes/origin/* 
 > +refs/pull/5406/*:refs/remotes/origin/pr/5406/*
 > git rev-parse refs/remotes/origin/pr/5406/merge^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/pr/5406/merge^{commit} # timeout=10
Checking out Revision 8b3c8cedd10152cd76de04b9cc0e3b0b6d1e4f22 
(refs/remotes/origin/pr/5406/merge)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 8b3c8cedd10152cd76de04b9cc0e3b0b6d1e4f22
Commit message: "Merge 05939d2358b3ca778636ffdf58396fead4b38e49 into 
123c20e9ad754cc1142a0aed17edbbef637b6176"
First time build. Skipping changelog.
Cleaning workspace
 > git rev-parse --verify HEAD # timeout=10
Resetting working tree
 > git reset --hard # timeout=10
 > git clean -fdx # timeout=10
Processing DSL script job_00_seed.groovy
Processing DSL script job_Dependency_Check.groovy
ERROR: startup failed:
job_Dependency_Check.groovy: 61: unexpected token: SCRIPT @ line 61, column 21.
 content(${SCRIPT, template="groovy-text.template"})
   ^

1 error




Re: Java compiler OOMs on Jenkins/Gradle

2018-05-30 Thread Ankur Goenka
I am facing OOM while locally building the project using Gradle. Here is
the scan https://scans.gradle.com/s/t3n42rw5666us
The issue is happening from :rat task.
Is this issue related?

On Tue, May 1, 2018 at 4:40 PM Scott Wegner  wrote:

> Sorry about the instability. We need to get the Gradle jobs tuned for our
> Jenkins machines, and there's no way to test my configuration changes
> without affecting all jobs :-/
>
> The changes I'm making are here: https://github.com/apache/beam/pull/5218
>
> It seems that they're still not quite right: the intent is to allocate
> half the memory to each job, and then divide it up by worker. But for 16
> workers each is getting ~1GB, even though the machines have 100GB total. I
> suspect I'm just calling the wrong API.
>
> On Tue, May 1, 2018 at 1:41 PM Eugene Kirpichov 
> wrote:
>
>> Thanks! FWIW seems that my other Jenkins build is about to fail with the
>> same issue
>> https://builds.apache.org/job/beam_PreCommit_Java_GradleBuild/4806/ -
>> "Expiring Daemon because JVM Tenured space is exhausted"
>>
>> On Tue, May 1, 2018 at 1:36 PM Lukasz Cwik  wrote:
>>
>>> +sweg...@google.com who is currently messing around with tuning some
>>> Gradle flags related to the JVM and its memory usage.
>>>
>>> On Tue, May 1, 2018 at 1:34 PM Eugene Kirpichov 
>>> wrote:
>>>
 Hi,

 I've seen the same issue twice in a row on PR
 https://github.com/apache/beam/pull/4264 : the Java precommit fails
 with messages like:

 > Task :beam-sdks-java-core:compileTestJava
 An exception has occurred in the compiler ((version info not
 available)). Please file a bug against the Java compiler via the Java bug
 reporting page (http://bugreport.java.com) after checking the Bug
 Database (http://bugs.java.com) for duplicates. Include your program
 and the following diagnostic in your report. Thank you.
 java.lang.OutOfMemoryError: GC overhead limit exceeded

 Full build link:
 https://builds.apache.org/job/beam_PreCommit_Java_GradleBuild/4803/consoleFull

 Anybody know what's up with that? I thought we got new powerful Jenkins
 executors and we shouldn't be running out of memory? However, I see that
 the build specifies* -Dorg.gradle.jvmargs=-Xmx512m* - this seems too
 small. Should we increase this?

 Thanks.

>>> --
>
>
> Got feedback? http://go/swegner-feedback
>


Re: parquet/beam

2018-05-30 Thread Chamikara Jayalath
On Wed, May 30, 2018 at 4:43 PM Lukasz Cwik  wrote:

> For Python Parquet support, hopefully we can have cross language pipelines
> solve this so we only need to implement it once. If it is really popular,
> having it implemented more then once may be worthwhile.
>

I'd say Parquet format is popular enough to warrant a Python implementation
:). Not sure if there are good candidate client libraries for Python though.


> Would the point of Arrow be to treat it as an IO connector similar to
> ParquetIO or JdbcIO (I was wondering what the purpose of the Arrow
> integration is)?
>
> Every C library adds some difficulty for users to test out their pipelines
> locally unless the C library was cross compiled for several distributions.
> Using C libraries increases the need for using a container like Docker for
> execution.
>

Usually we've preferred libraries that can be directly installed from PyPI
over libraries that have more complicated deployment models (native
compilation, Conda etc). This will make the connector easily available for
various runner/user deployments.


>
>
> On Wed, May 30, 2018 at 1:56 PM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
>
>> I can see great use cases with s3/Parquet - so that's a great addition
>> (which JB is addressing, for Java)!
>>
>> It would be even more ideal for the use cases I find myself around for
>> there to be python parquet support, so for perhaps this next release:
>> Would it make sense to be exploring: https://arrow.apache.org ?  I'd be
>> happy to explore proper procedure for design/feature proposal and
>> documentation for Beam, how to scope and develop it.
>>
>> Also, from the little I've looked at actual implementation, it appears
>> that (py)arrow relies on underlying C binaries, which was listed as a
>> problem or at least a point against choice of package with the developing
>> python/kafka source.  How big an issue is that -- what else should I be
>> considering?  Guidance absolutely welcomed!
>>
>


Re: parquet/beam

2018-05-30 Thread Lukasz Cwik
For Python Parquet support, hopefully we can have cross language pipelines
solve this so we only need to implement it once. If it is really popular,
having it implemented more then once may be worthwhile.

Would the point of Arrow be to treat it as an IO connector similar to
ParquetIO or JdbcIO (I was wondering what the purpose of the Arrow
integration is)?

Every C library adds some difficulty for users to test out their pipelines
locally unless the C library was cross compiled for several distributions.
Using C libraries increases the need for using a container like Docker for
execution.


On Wed, May 30, 2018 at 1:56 PM Austin Bennett 
wrote:

> I can see great use cases with s3/Parquet - so that's a great addition
> (which JB is addressing, for Java)!
>
> It would be even more ideal for the use cases I find myself around for
> there to be python parquet support, so for perhaps this next release:
> Would it make sense to be exploring: https://arrow.apache.org ?  I'd be
> happy to explore proper procedure for design/feature proposal and
> documentation for Beam, how to scope and develop it.
>
> Also, from the little I've looked at actual implementation, it appears
> that (py)arrow relies on underlying C binaries, which was listed as a
> problem or at least a point against choice of package with the developing
> python/kafka source.  How big an issue is that -- what else should I be
> considering?  Guidance absolutely welcomed!
>


Jenkins build is back to stable : beam_SeedJob #1822

2018-05-30 Thread Apache Jenkins Server
See 



Jenkins build became unstable: beam_SeedJob #1821

2018-05-30 Thread Apache Jenkins Server
See 



Re: [PROPOSAL] Preparing 2.5.0 release next week

2018-05-30 Thread Łukasz Gajowy
Regarding ParquetIO on S3: I am investigating the issue. It seems that it
never worked on s3 (I didn't expect that). Currently, I'm trying to
understand why it behaves differently than on other filesystems (HDFS,
local). Any help appreciated.

Regarding ParquetIO on HDFS: I was able to run it on my machine
successfully. I also created a PR with HDFS Performance test for Parquet
(and it is passing too): https://github.com/apache/beam/pull/5520. Hope
this will be helpful!

Best regards,
Łukasz



2018-05-31 0:41 GMT+02:00 Robert Bradshaw :

> On Wed, May 30, 2018 at 12:59 PM Ahmet Altay  wrote:
>
>> Thank you JB.
>>
>> For clarification, are you referring to the following items:
>> - RabbitMqIO - https://github.com/apache/beam/pull/1729
>> -  ParquetIO on HDFS/S3 - https://issues.apache.org/jira/browse/BEAM-4421
>>
>> If the above mapping is correct, could we separate addition of new
>> feature from addressing blocking issues? I would propose that we do not
>> block the release for the former one and fix the latter one before the
>> release.
>>
>> On Tue, May 29, 2018 at 10:26 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi,
>>>
>>> I would like to merge RabbitMqIO (we are doing the final touches) and we
>>> have an issue about ParquetIO on HDFS/S3 that I would like to
>>> investigate with the team.
>>>
>>
>> Do you know who is currently investigating the ParquetIO issue? Do you
>> need help with that?
>>
>
> Do we know if this is a regression, or has it never worked?
>
>
>> I plan to start the release process asap, hopefully later today.
>>>
>>
> That would be great. A lot has happened since the last release [1] and
> we've had a pretty good cadence so far in 2018 so it'd be nice to get this
> out in to the hands of our users. And thanks for volunteering to do the
> release!
>
> - Robert
>
>
> [1] https://github.com/apache/beam/compare/release-2.4.0...master
>
>
>
>
>>
>>> Regards
>>> JB
>>>
>>> On 29/05/2018 23:00, Ahmet Altay wrote:
>>> > Thank you JB for the update. Could we start the release process now? Is
>>> > there anyway I could help with moving the release forward?
>>> >
>>> > On Fri, May 25, 2018 at 8:19 AM, Lukasz Cwik >> > > wrote:
>>> >
>>> > Thanks for the update JB.
>>> >
>>> > Kenn, we have the post commit integration tests which run against
>>> > shaded artifacts like validates runner. We also have the nightly
>>> > snapshot and its verification run which validates the nightly
>>> > snapshot with DirectRunner / Dataflow / Apex / Spark / Flink for
>>> > WordCount and DirectRunner / Dataflow for the mobile gaming
>>> examples.
>>> >
>>> > I'm not sure about the IOs and whether the perfkit benchmark work
>>> > adequately covers them.
>>> >
>>> >
>>> > On Fri, May 25, 2018 at 1:28 AM Jean-Baptiste Onofré
>>> > mailto:j...@nanthrax.net>> wrote:
>>> >
>>> > Hi Luke,
>>> >
>>> > I tested the following build:
>>> >
>>> > ./gradlew publishToMavenLocal -PisRelease --no-parallel
>>> >
>>> > The artifacts are present in my .m2/repository.
>>> >
>>> > For instance, I can see:
>>> >
>>> > .m2/repository/org/apache/beam/beam-sdks-java-core/2.5.0$ ls
>>> -l
>>> > total 16256
>>> >  beam-sdks-java-core-2.5.0.jar
>>> >  beam-sdks-java-core-2.5.0.jar.asc
>>> >  beam-sdks-java-core-2.5.0-javadoc.jar
>>> >  beam-sdks-java-core-2.5.0-javadoc.jar.asc
>>> >  beam-sdks-java-core-2.5.0.pom
>>> >  beam-sdks-java-core-2.5.0.pom.asc
>>> >  beam-sdks-java-core-2.5.0-sources.jar
>>> >  beam-sdks-java-core-2.5.0-sources.jar.asc
>>> >  beam-sdks-java-core-2.5.0-tests.jar
>>> >  beam-sdks-java-core-2.5.0-tests.jar.asc
>>> >  beam-sdks-java-core-2.5.0-test-sources.jar
>>> >  beam-sdks-java-core-2.5.0-test-sources.jar.asc
>>> >
>>> > 1. The signatures are OK:
>>> >
>>> > gpg --verify beam-sdks-java-core-2.5.0.jar.asc
>>> > beam-sdks-java-core-2.5.0.jar
>>> > gpg: Signature made jeu. 24 mai 2018 16:55:11 CEST
>>> > gpg:using RSA key
>>> > 1AA8CF92D409A73393D0B736BFF2EE42C8282E76
>>> > gpg: Good signature from "Jean-Baptiste Onofré
>>> > mailto:jbono...@apache.org>>"
>>> > [unknown]
>>> >
>>> > 2. The pom looks correct to me but it's not optimal because
>>> >
>>> > 2.1. There's no parent definition, so each pom duplicate the
>>> same
>>> > configurations (like scm, license, etc)
>>> > 2.2. There's no Maven plugin configuration, even if it's not
>>> > used for
>>> > the build, other tools can parse and use plugin configuration
>>> > (like the
>>> > source/target version, etc).
>>> >
>>> > So, even if it's not optimal, the pom looks overall good.
>>> >
>>> > I think it makes sense to move forward on the release as it is
>>> > right now.
>>> >

Re: [PROPOSAL] Preparing 2.5.0 release next week

2018-05-30 Thread Robert Bradshaw
On Wed, May 30, 2018 at 12:59 PM Ahmet Altay  wrote:

> Thank you JB.
>
> For clarification, are you referring to the following items:
> - RabbitMqIO - https://github.com/apache/beam/pull/1729
> -  ParquetIO on HDFS/S3 - https://issues.apache.org/jira/browse/BEAM-4421
>
> If the above mapping is correct, could we separate addition of new feature
> from addressing blocking issues? I would propose that we do not block the
> release for the former one and fix the latter one before the release.
>
> On Tue, May 29, 2018 at 10:26 PM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi,
>>
>> I would like to merge RabbitMqIO (we are doing the final touches) and we
>> have an issue about ParquetIO on HDFS/S3 that I would like to
>> investigate with the team.
>>
>
> Do you know who is currently investigating the ParquetIO issue? Do you
> need help with that?
>

Do we know if this is a regression, or has it never worked?


> I plan to start the release process asap, hopefully later today.
>>
>
That would be great. A lot has happened since the last release [1] and
we've had a pretty good cadence so far in 2018 so it'd be nice to get this
out in to the hands of our users. And thanks for volunteering to do the
release!

- Robert


[1] https://github.com/apache/beam/compare/release-2.4.0...master




>
>> Regards
>> JB
>>
>> On 29/05/2018 23:00, Ahmet Altay wrote:
>> > Thank you JB for the update. Could we start the release process now? Is
>> > there anyway I could help with moving the release forward?
>> >
>> > On Fri, May 25, 2018 at 8:19 AM, Lukasz Cwik > > > wrote:
>> >
>> > Thanks for the update JB.
>> >
>> > Kenn, we have the post commit integration tests which run against
>> > shaded artifacts like validates runner. We also have the nightly
>> > snapshot and its verification run which validates the nightly
>> > snapshot with DirectRunner / Dataflow / Apex / Spark / Flink for
>> > WordCount and DirectRunner / Dataflow for the mobile gaming
>> examples.
>> >
>> > I'm not sure about the IOs and whether the perfkit benchmark work
>> > adequately covers them.
>> >
>> >
>> > On Fri, May 25, 2018 at 1:28 AM Jean-Baptiste Onofré
>> > mailto:j...@nanthrax.net>> wrote:
>> >
>> > Hi Luke,
>> >
>> > I tested the following build:
>> >
>> > ./gradlew publishToMavenLocal -PisRelease --no-parallel
>> >
>> > The artifacts are present in my .m2/repository.
>> >
>> > For instance, I can see:
>> >
>> > .m2/repository/org/apache/beam/beam-sdks-java-core/2.5.0$ ls -l
>> > total 16256
>> >  beam-sdks-java-core-2.5.0.jar
>> >  beam-sdks-java-core-2.5.0.jar.asc
>> >  beam-sdks-java-core-2.5.0-javadoc.jar
>> >  beam-sdks-java-core-2.5.0-javadoc.jar.asc
>> >  beam-sdks-java-core-2.5.0.pom
>> >  beam-sdks-java-core-2.5.0.pom.asc
>> >  beam-sdks-java-core-2.5.0-sources.jar
>> >  beam-sdks-java-core-2.5.0-sources.jar.asc
>> >  beam-sdks-java-core-2.5.0-tests.jar
>> >  beam-sdks-java-core-2.5.0-tests.jar.asc
>> >  beam-sdks-java-core-2.5.0-test-sources.jar
>> >  beam-sdks-java-core-2.5.0-test-sources.jar.asc
>> >
>> > 1. The signatures are OK:
>> >
>> > gpg --verify beam-sdks-java-core-2.5.0.jar.asc
>> > beam-sdks-java-core-2.5.0.jar
>> > gpg: Signature made jeu. 24 mai 2018 16:55:11 CEST
>> > gpg:using RSA key
>> > 1AA8CF92D409A73393D0B736BFF2EE42C8282E76
>> > gpg: Good signature from "Jean-Baptiste Onofré
>> > mailto:jbono...@apache.org>>"
>> > [unknown]
>> >
>> > 2. The pom looks correct to me but it's not optimal because
>> >
>> > 2.1. There's no parent definition, so each pom duplicate the
>> same
>> > configurations (like scm, license, etc)
>> > 2.2. There's no Maven plugin configuration, even if it's not
>> > used for
>> > the build, other tools can parse and use plugin configuration
>> > (like the
>> > source/target version, etc).
>> >
>> > So, even if it's not optimal, the pom looks overall good.
>> >
>> > I think it makes sense to move forward on the release as it is
>> > right now.
>> >
>> > If there's no objection, I will start the release process
>> during the
>> > week end.
>> >
>> > By the way, it would be good to verify that the Maven build is
>> still
>> > working. Ismaël and I fixed new issues on the Maven build.
>> > At some point, after the 2.5.0 release, we have to state to
>> > remove the
>> > Maven build (after a vote ;)).
>> >
>> > Thanks,
>> > Regards
>> > JB
>> >
>> >
>> > On 25/05/2018 01:34, Lukasz Cwik wrote:
>> > > The license inclusion issue that was brought up on the thread
>> > has been
>> > > resolved 

Re: Reducing Committer Load for Code Reviews

2018-05-30 Thread Udi Meiri
I thought this was the norm already? I have been the sole reviewer a few
PRs by committers and I'm only a contributor.

+1

On Wed, May 30, 2018 at 2:13 PM Kenneth Knowles  wrote:

> ++1
>
> This is good reasoning. If you trust someone with the committer
> responsibilities [1] you should trust them to find an appropriate reviewer.
>
> Also:
>
>  - adds a new way for non-committers and committers to bond
>  - makes committers seem less like gatekeepers because it goes both ways
>  - might help clear PR backlog, improving our community response latency
>  - encourages committers to code*
>
> Kenn
>
> [1] https://beam.apache.org/contribute/become-a-committer/
>
> *With today's system, if a committer and a few non-committers are working
> together, then when the committer writes code it is harder to get it merged
> because it takes an extra committer. It is easier to have non-committers
> write all the code and the committer just does reviews. It is 1 committer
> vs 2 being involved. This used to be fine when almost everyone was a
> committer and all working on the core, but it is not fine any more.
>
> On Wed, May 30, 2018 at 12:50 PM Thomas Groh  wrote:
>
>> Hey all;
>>
>> I've been thinking recently about the process we have for committing
>> code, and our current process. I'd like to propose that we change our
>> current process to require at least one committer is present for each code
>> review, but remove the need to have a second committer review the code
>> prior to submission if the original contributor is a committer.
>>
>> Generally, if we trust someone with the ability to merge code that
>> someone else has written, I think it's sensible to also trust them to
>> choose a capable reviewer. We expect that all of the people that we have
>> recognized as committers will maintain the project's quality bar - and
>> that's true for both code they author and code they review. Given that, I
>> think it's sensible to expect a committer will choose a reviewer who is
>> versed in the component they are contributing to who can provide insight
>> and will also hold up the quality bar.
>>
>> Making this change will help spread the review load out among regular
>> contributors to the project, and reduce bottlenecks caused by committers
>> who have few other committers working on their same component. Obviously,
>> this requires that committers act with the best interests of the project
>> when they send out their code for reviews - but this is the behavior we
>> demand before someone is recognized as a committer, so I don't see why that
>> would be cause for concern.
>>
>> Yours,
>>
>> Thomas
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Reducing Committer Load for Code Reviews

2018-05-30 Thread Kenneth Knowles
++1

This is good reasoning. If you trust someone with the committer
responsibilities [1] you should trust them to find an appropriate reviewer.

Also:

 - adds a new way for non-committers and committers to bond
 - makes committers seem less like gatekeepers because it goes both ways
 - might help clear PR backlog, improving our community response latency
 - encourages committers to code*

Kenn

[1] https://beam.apache.org/contribute/become-a-committer/

*With today's system, if a committer and a few non-committers are working
together, then when the committer writes code it is harder to get it merged
because it takes an extra committer. It is easier to have non-committers
write all the code and the committer just does reviews. It is 1 committer
vs 2 being involved. This used to be fine when almost everyone was a
committer and all working on the core, but it is not fine any more.

On Wed, May 30, 2018 at 12:50 PM Thomas Groh  wrote:

> Hey all;
>
> I've been thinking recently about the process we have for committing code,
> and our current process. I'd like to propose that we change our current
> process to require at least one committer is present for each code review,
> but remove the need to have a second committer review the code prior to
> submission if the original contributor is a committer.
>
> Generally, if we trust someone with the ability to merge code that someone
> else has written, I think it's sensible to also trust them to choose a
> capable reviewer. We expect that all of the people that we have recognized
> as committers will maintain the project's quality bar - and that's true for
> both code they author and code they review. Given that, I think it's
> sensible to expect a committer will choose a reviewer who is versed in the
> component they are contributing to who can provide insight and will also
> hold up the quality bar.
>
> Making this change will help spread the review load out among regular
> contributors to the project, and reduce bottlenecks caused by committers
> who have few other committers working on their same component. Obviously,
> this requires that committers act with the best interests of the project
> when they send out their code for reviews - but this is the behavior we
> demand before someone is recognized as a committer, so I don't see why that
> would be cause for concern.
>
> Yours,
>
> Thomas
>


parquet/beam

2018-05-30 Thread Austin Bennett
I can see great use cases with s3/Parquet - so that's a great addition
(which JB is addressing, for Java)!

It would be even more ideal for the use cases I find myself around for
there to be python parquet support, so for perhaps this next release:
Would it make sense to be exploring: https://arrow.apache.org ?  I'd be
happy to explore proper procedure for design/feature proposal and
documentation for Beam, how to scope and develop it.

Also, from the little I've looked at actual implementation, it appears that
(py)arrow relies on underlying C binaries, which was listed as a problem or
at least a point against choice of package with the developing python/kafka
source.  How big an issue is that -- what else should I be considering?
Guidance absolutely welcomed!


Re: [PROPOSAL] Preparing 2.5.0 release next week

2018-05-30 Thread Ahmet Altay
Thank you JB.

For clarification, are you referring to the following items:
- RabbitMqIO - https://github.com/apache/beam/pull/1729
-  ParquetIO on HDFS/S3 - https://issues.apache.org/jira/browse/BEAM-4421

If the above mapping is correct, could we separate addition of new feature
from addressing blocking issues? I would propose that we do not block the
release for the former one and fix the latter one before the release.

On Tue, May 29, 2018 at 10:26 PM, Jean-Baptiste Onofré 
wrote:

> Hi,
>
> I would like to merge RabbitMqIO (we are doing the final touches) and we
> have an issue about ParquetIO on HDFS/S3 that I would like to
> investigate with the team.
>

Do you know who is currently investigating the ParquetIO issue? Do you need
help with that?


>
> I plan to start the release process asap, hopefully later today.
>
> Regards
> JB
>
> On 29/05/2018 23:00, Ahmet Altay wrote:
> > Thank you JB for the update. Could we start the release process now? Is
> > there anyway I could help with moving the release forward?
> >
> > On Fri, May 25, 2018 at 8:19 AM, Lukasz Cwik  > > wrote:
> >
> > Thanks for the update JB.
> >
> > Kenn, we have the post commit integration tests which run against
> > shaded artifacts like validates runner. We also have the nightly
> > snapshot and its verification run which validates the nightly
> > snapshot with DirectRunner / Dataflow / Apex / Spark / Flink for
> > WordCount and DirectRunner / Dataflow for the mobile gaming examples.
> >
> > I'm not sure about the IOs and whether the perfkit benchmark work
> > adequately covers them.
> >
> >
> > On Fri, May 25, 2018 at 1:28 AM Jean-Baptiste Onofré
> > mailto:j...@nanthrax.net>> wrote:
> >
> > Hi Luke,
> >
> > I tested the following build:
> >
> > ./gradlew publishToMavenLocal -PisRelease --no-parallel
> >
> > The artifacts are present in my .m2/repository.
> >
> > For instance, I can see:
> >
> > .m2/repository/org/apache/beam/beam-sdks-java-core/2.5.0$ ls -l
> > total 16256
> >  beam-sdks-java-core-2.5.0.jar
> >  beam-sdks-java-core-2.5.0.jar.asc
> >  beam-sdks-java-core-2.5.0-javadoc.jar
> >  beam-sdks-java-core-2.5.0-javadoc.jar.asc
> >  beam-sdks-java-core-2.5.0.pom
> >  beam-sdks-java-core-2.5.0.pom.asc
> >  beam-sdks-java-core-2.5.0-sources.jar
> >  beam-sdks-java-core-2.5.0-sources.jar.asc
> >  beam-sdks-java-core-2.5.0-tests.jar
> >  beam-sdks-java-core-2.5.0-tests.jar.asc
> >  beam-sdks-java-core-2.5.0-test-sources.jar
> >  beam-sdks-java-core-2.5.0-test-sources.jar.asc
> >
> > 1. The signatures are OK:
> >
> > gpg --verify beam-sdks-java-core-2.5.0.jar.asc
> > beam-sdks-java-core-2.5.0.jar
> > gpg: Signature made jeu. 24 mai 2018 16:55:11 CEST
> > gpg:using RSA key
> > 1AA8CF92D409A73393D0B736BFF2EE42C8282E76
> > gpg: Good signature from "Jean-Baptiste Onofré
> > mailto:jbono...@apache.org>>"
> > [unknown]
> >
> > 2. The pom looks correct to me but it's not optimal because
> >
> > 2.1. There's no parent definition, so each pom duplicate the same
> > configurations (like scm, license, etc)
> > 2.2. There's no Maven plugin configuration, even if it's not
> > used for
> > the build, other tools can parse and use plugin configuration
> > (like the
> > source/target version, etc).
> >
> > So, even if it's not optimal, the pom looks overall good.
> >
> > I think it makes sense to move forward on the release as it is
> > right now.
> >
> > If there's no objection, I will start the release process during
> the
> > week end.
> >
> > By the way, it would be good to verify that the Maven build is
> still
> > working. Ismaël and I fixed new issues on the Maven build.
> > At some point, after the 2.5.0 release, we have to state to
> > remove the
> > Maven build (after a vote ;)).
> >
> > Thanks,
> > Regards
> > JB
> >
> >
> > On 25/05/2018 01:34, Lukasz Cwik wrote:
> > > The license inclusion issue that was brought up on the thread
> > has been
> > > resolved https://issues.apache.org/jira/browse/BEAM-4393
> > .
> > >
> > > JB, you find any other release related issues?
> > >
> > > On Fri, May 18, 2018 at 10:33 AM Lukasz Cwik  > 
> > > >> wrote:
> > >
> > > I believe JB is referring
> > > to https://issues.apache.org/jira/browse/BEAM-4060
> > 
> > >
> > >

Reducing Committer Load for Code Reviews

2018-05-30 Thread Thomas Groh
Hey all;

I've been thinking recently about the process we have for committing code,
and our current process. I'd like to propose that we change our current
process to require at least one committer is present for each code review,
but remove the need to have a second committer review the code prior to
submission if the original contributor is a committer.

Generally, if we trust someone with the ability to merge code that someone
else has written, I think it's sensible to also trust them to choose a
capable reviewer. We expect that all of the people that we have recognized
as committers will maintain the project's quality bar - and that's true for
both code they author and code they review. Given that, I think it's
sensible to expect a committer will choose a reviewer who is versed in the
component they are contributing to who can provide insight and will also
hold up the quality bar.

Making this change will help spread the review load out among regular
contributors to the project, and reduce bottlenecks caused by committers
who have few other committers working on their same component. Obviously,
this requires that committers act with the best interests of the project
when they send out their code for reviews - but this is the behavior we
demand before someone is recognized as a committer, so I don't see why that
would be cause for concern.

Yours,

Thomas


Re: Hello Beam!

2018-05-30 Thread Lukasz Cwik
You'll want to take a look at JdbcIO, there is an example of how to use it
in the Javadoc:
https://beam.apache.org/documentation/sdks/javadoc/2.4.0/org/apache/beam/sdk/io/jdbc/JdbcIO.html

On Wed, May 30, 2018 at 10:52 AM arun kumar  wrote:

> Thank you for response.
>
> Can you please share me , how we can connect with postgres database .
>
> Please share me if you have any examples.
>
> Thanks
> Arun
>
>
>
> On Wed, May 30, 2018, 8:09 PM Lukasz Cwik  wrote:
>
>> Arun, it would be best to checkout one of the quickstarts (Java, Python,
>> Go) (https://beam.apache.org/get-started/beam-overview/) and when you
>> have questions ask them on u...@beam.apache.org
>>
>> On Wed, May 30, 2018 at 5:32 AM arun kumar  wrote:
>>
>>> Hi All,
>>>
>>> Thank you for adding in the group and I am interested in Apache beam
>>> with Google cloud runner.
>>>
>>> I need to start on Apache beam with Google cloud runner.
>>>
>>> Please help me where I need to start from and if anyone have simple code
>>> for my requirement.
>>>
>>> Thanks
>>> Arunkumar
>>>
>>> On Wed, May 30, 2018, 10:55 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
 Welcome !

 Looking forward to work and discuss with you ;)

 Regards
 JB

 On 29/05/2018 23:49, Rui Wang wrote:
 > Hi there,
 >
 > I am Rui (pronounced as same as "Ray")!
 >
 > I recently joined Google Cloud. Beam is a very interesting project
 and I
 > cannot wait to contribute to it!
 >
 >
 > Thanks,
 > Rui

 --
 --
 Jean-Baptiste Onofré
 jbono...@apache.org
 http://blog.nanthrax.net
 Talend - http://www.talend.com

>>>


Re: Hello Beam!

2018-05-30 Thread arun kumar
Thank you for response.

Can you please share me , how we can connect with postgres database .

Please share me if you have any examples.

Thanks
Arun



On Wed, May 30, 2018, 8:09 PM Lukasz Cwik  wrote:

> Arun, it would be best to checkout one of the quickstarts (Java, Python,
> Go) (https://beam.apache.org/get-started/beam-overview/) and when you
> have questions ask them on u...@beam.apache.org
>
> On Wed, May 30, 2018 at 5:32 AM arun kumar  wrote:
>
>> Hi All,
>>
>> Thank you for adding in the group and I am interested in Apache beam with
>> Google cloud runner.
>>
>> I need to start on Apache beam with Google cloud runner.
>>
>> Please help me where I need to start from and if anyone have simple code
>> for my requirement.
>>
>> Thanks
>> Arunkumar
>>
>> On Wed, May 30, 2018, 10:55 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Welcome !
>>>
>>> Looking forward to work and discuss with you ;)
>>>
>>> Regards
>>> JB
>>>
>>> On 29/05/2018 23:49, Rui Wang wrote:
>>> > Hi there,
>>> >
>>> > I am Rui (pronounced as same as "Ray")!
>>> >
>>> > I recently joined Google Cloud. Beam is a very interesting project and
>>> I
>>> > cannot wait to contribute to it!
>>> >
>>> >
>>> > Thanks,
>>> > Rui
>>>
>>> --
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>


Jenkins build is back to normal : beam_SeedJob #1816

2018-05-30 Thread Apache Jenkins Server
See 



Build failed in Jenkins: beam_SeedJob #1815

2018-05-30 Thread Apache Jenkins Server
See 

--
GitHub pull request #5406 of commit c5af2747e7720891bfb421ea71db3233f8676edf, 
no merge conflicts.
Setting status of c5af2747e7720891bfb421ea71db3233f8676edf to PENDING with url 
https://builds.apache.org/job/beam_SeedJob/1815/ and message: 'Build started 
sha1 is merged.'
Using context: Jenkins: Seed Job
[EnvInject] - Loading node environment variables.
Building remotely on beam12 (beam) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/beam.git # timeout=10
Fetching upstream changes from https://github.com/apache/beam.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/beam.git 
 > +refs/heads/*:refs/remotes/origin/* 
 > +refs/pull/5406/*:refs/remotes/origin/pr/5406/*
 > git rev-parse refs/remotes/origin/pr/5406/merge^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/pr/5406/merge^{commit} # timeout=10
Checking out Revision 04708c3aacfb6a0ec58056f37396549becf6555b 
(refs/remotes/origin/pr/5406/merge)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 04708c3aacfb6a0ec58056f37396549becf6555b
Commit message: "Merge c5af2747e7720891bfb421ea71db3233f8676edf into 
856d842e7700bd26b6292f9faf7db9ddd85a55fc"
First time build. Skipping changelog.
Cleaning workspace
 > git rev-parse --verify HEAD # timeout=10
Resetting working tree
 > git reset --hard # timeout=10
 > git clean -fdx # timeout=10
Processing DSL script job_00_seed.groovy
Processing DSL script job_Dependency_Check.groovy
ERROR: (job_Dependency_Check.groovy, line 45) No signature of method: 
javaposse.jobdsl.dsl.helpers.step.StepContext.shell() is applicable for 
argument types: (java.lang.Boolean) values: [true]
Possible solutions: shell(java.lang.String), xShell(groovy.lang.Closure), 
sleep(long), grep(), every(), dsl(groovy.lang.Closure)



Re: [DISCUSS] Remove findbugs from sdks/java

2018-05-30 Thread Kenneth Knowles
Awesome!

In the meantime I've tried out Gradle + Checker and unfortunately
compilation hung. It could be due to any subset of Gradle, Checker,
Errorprone. I would not expect a performance problem, since Checker is
"pluggable type systems" and type checking is a very fast sort of analysis.
Also I didn't immediately find the configuration to ignore generated code.
I didn't have a lot of time so I didn't dig further.

Kenn

On Wed, May 30, 2018 at 9:28 AM Pablo Estrada  wrote:

> Thank you guys : D
>
> On Wed, May 30, 2018 at 9:20 AM Scott Wegner  wrote:
>
>> Sorry to revive an old thread, but I wanted to give a shout-out and say
>> thank you to Ismaël and Tim who have been quickly chipping away at the
>> ErrorProne backlog. We started with 47 ErrorProne JIRA's [1], and in two
>> weeks we're down to just 17 [2]. Thanks!
>>
>> [1] 
>> *https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20labels%20%3D%20errorprone
>> 
>>  *
>> [2]
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20labels%20%3D%20errorprone
>>
>>
>> On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía  wrote:
>>
>>> As part of the error-prone effort Tim has been also cleaning other static
>>> analysis warnings as reported by IntelliJ's Inspect -> Analyze code. I
>>> think this is a good moment to grok some of those too e.g. scoping,
>>> unused
>>> variables, redundancies, etc. So I hope the others taking part this work
>>> try to tackle a chunk of those as well.
>>>
>>> Extra note. Of course IntelliJ's code analysis should be judged a bit,
>>> there are always fake positives or undesirable changes.
>>>
>>> On Fri, May 18, 2018 at 7:56 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
>>> > Thanks Tim.
>>>
>>> > I think we will be able to remove findbugs after some run/check using
>>> ErrorProne and see the gaps.
>>>
>>> > Regards
>>> > JB
>>> > Le 18 mai 2018, à 07:49, Tim Robertson  a
>>> écrit:
>>>
>>> >> Thank you all.
>>>
>>> >> I think this is clear.  Removing findbugs can happen at a future
>>> point.
>>>
>>> >> @Scott - I've been working through the java IO error prone issues
>>> (some
>>> already merged, some with open PRs now) so will take those IO Jiras. I
>>> will
>>> enable failOnWarning, address dependency issues for findbugs and tackle
>>> the
>>> error prone warnings.
>>>
>>>
>>> >> On Fri, May 18, 2018 at 1:07 AM, Scott Wegner 
>>> wrote:
>>>
>>> >>> +0.02173913
>>>
>>> >>> I'm happy to replace FindBugs with ErrorProne, but we need to first
>>> upgrade ErrorProne analyzer warnings to errors. Currently the codebase is
>>> full of warning spam, and there's no enforcement preventing future
>>> violations from being added.
>>>
>>> >>> I've done the work for enforcing ErrorProne analysis on java-sdk-core
>>> [1], and I've sharded out the rest of the Java components in JIRA issues
>>> [2] (45 total).  Fixing the issues is relatively straightforward, and
>>> I've
>>> tried to provide enough guidance to make them as starter tasks (example:
>>> [3]). Teng Peng has already started on Spark [4] (thanks!)
>>>
>>> >>> [1] https://github.com/apache/beam/pull/5319
>>> >>> [2]
>>>
>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20%3D%20Open%20AND%20labels%20%3D%20errorprone
>>> >>> [3] https://issues.apache.org/jira/browse/BEAM-4347
>>> >>> [4] https://issues.apache.org/jira/browse/BEAM-4318
>>>
>>> >>> On Thu, May 17, 2018 at 2:00 PM Ismaël Mejía 
>>> wrote:
>>>
>>>  +0.7 also. Findbugs support for more recent versions of Java is
>>> lacking and
>>>  the maintenance seems frozen in time.
>>>
>>>  As a possible plan b can we identify the missing important
>>> validations
>>> to
>>>  identify how much we lose and if it is considerable, maybe we can
>>> create a
>>>  minimal configuration for those, and eventually migrate from
>>> findbugs
>>> to
>>>  spotbugs (https://github.com/spotbugs/spotbugs/) that seems at
>>> least
>>> to be
>>>  maintained and the most active findbugs fork.
>>>
>>>
>>>  On Thu, May 17, 2018 at 9:31 PM Kenneth Knowles 
>>> wrote:
>>>
>>>  > +0.7 I think we should work to remove findbugs. Errorprone covers
>>> most of
>>>  the same stuff but better and faster.
>>>
>>>  > The one thing I'm not sure about is nullness analysis. Findbugs
>>> has
>>> some
>>>  serious limitations there but it really improves code quality and
>>> prevents
>>>  blunders. I'm not sure errorprone covers that. I know the Checker
>>> analyzer
>>>  has a full solution that makes NPE impossible as in most modern
>>> languages.
>>>  Maybe that is easy to plug in. The core Java SDK is a good candidate
>>> for
>>>  the first place to do it since it is affects everything else.
>>>
>>>  > On Thu, May 17, 2018 at 12:02 PM Tim Robertson <
>>> timrobertson...@gmail.com>
>>>  

Re: [DISCUSS] Remove findbugs from sdks/java

2018-05-30 Thread Pablo Estrada
Thank you guys : D

On Wed, May 30, 2018 at 9:20 AM Scott Wegner  wrote:

> Sorry to revive an old thread, but I wanted to give a shout-out and say
> thank you to Ismaël and Tim who have been quickly chipping away at the
> ErrorProne backlog. We started with 47 ErrorProne JIRA's [1], and in two
> weeks we're down to just 17 [2]. Thanks!
>
> [1] 
> *https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20labels%20%3D%20errorprone
> 
>  *
> [2]
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20labels%20%3D%20errorprone
>
>
> On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía  wrote:
>
>> As part of the error-prone effort Tim has been also cleaning other static
>> analysis warnings as reported by IntelliJ's Inspect -> Analyze code. I
>> think this is a good moment to grok some of those too e.g. scoping, unused
>> variables, redundancies, etc. So I hope the others taking part this work
>> try to tackle a chunk of those as well.
>>
>> Extra note. Of course IntelliJ's code analysis should be judged a bit,
>> there are always fake positives or undesirable changes.
>>
>> On Fri, May 18, 2018 at 7:56 AM Jean-Baptiste Onofré 
>> wrote:
>>
>> > Thanks Tim.
>>
>> > I think we will be able to remove findbugs after some run/check using
>> ErrorProne and see the gaps.
>>
>> > Regards
>> > JB
>> > Le 18 mai 2018, à 07:49, Tim Robertson  a
>> écrit:
>>
>> >> Thank you all.
>>
>> >> I think this is clear.  Removing findbugs can happen at a future point.
>>
>> >> @Scott - I've been working through the java IO error prone issues (some
>> already merged, some with open PRs now) so will take those IO Jiras. I
>> will
>> enable failOnWarning, address dependency issues for findbugs and tackle
>> the
>> error prone warnings.
>>
>>
>> >> On Fri, May 18, 2018 at 1:07 AM, Scott Wegner 
>> wrote:
>>
>> >>> +0.02173913
>>
>> >>> I'm happy to replace FindBugs with ErrorProne, but we need to first
>> upgrade ErrorProne analyzer warnings to errors. Currently the codebase is
>> full of warning spam, and there's no enforcement preventing future
>> violations from being added.
>>
>> >>> I've done the work for enforcing ErrorProne analysis on java-sdk-core
>> [1], and I've sharded out the rest of the Java components in JIRA issues
>> [2] (45 total).  Fixing the issues is relatively straightforward, and I've
>> tried to provide enough guidance to make them as starter tasks (example:
>> [3]). Teng Peng has already started on Spark [4] (thanks!)
>>
>> >>> [1] https://github.com/apache/beam/pull/5319
>> >>> [2]
>>
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20%3D%20Open%20AND%20labels%20%3D%20errorprone
>> >>> [3] https://issues.apache.org/jira/browse/BEAM-4347
>> >>> [4] https://issues.apache.org/jira/browse/BEAM-4318
>>
>> >>> On Thu, May 17, 2018 at 2:00 PM Ismaël Mejía 
>> wrote:
>>
>>  +0.7 also. Findbugs support for more recent versions of Java is
>> lacking and
>>  the maintenance seems frozen in time.
>>
>>  As a possible plan b can we identify the missing important
>> validations
>> to
>>  identify how much we lose and if it is considerable, maybe we can
>> create a
>>  minimal configuration for those, and eventually migrate from findbugs
>> to
>>  spotbugs (https://github.com/spotbugs/spotbugs/) that seems at least
>> to be
>>  maintained and the most active findbugs fork.
>>
>>
>>  On Thu, May 17, 2018 at 9:31 PM Kenneth Knowles 
>> wrote:
>>
>>  > +0.7 I think we should work to remove findbugs. Errorprone covers
>> most of
>>  the same stuff but better and faster.
>>
>>  > The one thing I'm not sure about is nullness analysis. Findbugs has
>> some
>>  serious limitations there but it really improves code quality and
>> prevents
>>  blunders. I'm not sure errorprone covers that. I know the Checker
>> analyzer
>>  has a full solution that makes NPE impossible as in most modern
>> languages.
>>  Maybe that is easy to plug in. The core Java SDK is a good candidate
>> for
>>  the first place to do it since it is affects everything else.
>>
>>  > On Thu, May 17, 2018 at 12:02 PM Tim Robertson <
>> timrobertson...@gmail.com>
>>  wrote:
>>
>>  >> Hi all,
>>  >> [bringing a side thread discussion from slack to here]
>>
>>  >> We're tackling error-prone warnings now and we aim to fail the
>> build on
>>  warnings raised [1].
>>
>>  >> Enabling failOnWarning also fails the build on findbugs warnings.
>>  Currently I see places where these  arise from missing a dependency
>> on
>>  findbugs_annotations and I asked on slack the best way to introduce
>> this
>>  globally in gradle.
>>
>>  >> In that discussion the idea was floated to consider removing
>> findbugs
>>  completely given it is older, has licensing 

Re: GroupByKey with sorted values within key

2018-05-30 Thread Lukasz Cwik
To add some more context, not all Runners have to be Java based and also
not all shuffle implementations want to execute untrusted Java code which
is why the contract is around using a secondary key with lexicographically
ordered bytes.

I don't remember the original concerns as to why SortValues was added as an
extension vs being part of SDK core. It may have had to do something with
scalability across runners, or lack of javadoc/examples/tests/... but most
likely lack of time for someone to polish it and include it in the core.
Regardless, I think it would be worthwhile to have that discussion again
and whether it should be merged into SDK core. If this thread doesn't get
enough visibility for this specific topic, it would be prudent to start a
specific thread about it linking back to this.

On Wed, May 30, 2018 at 9:09 AM Kenneth Knowles  wrote:

> The sorting by bytes is a deliberate limitation of this particular
> approach. It basically assumes you are using bytes-based shuffle under the
> hood, so invoking a language-specific comparator would be something new. I
> know +Ben had some ideas about this.
>
> Kenn
>
> On Wed, May 30, 2018 at 8:53 AM David Morávek 
> wrote:
>
>> Thanks for pointing us the right direction. We'll try to prototype custom
>> translation for Spark runner within next sprint. In order to do so, I have
>> few questions:
>>
>> 1) Should we move SortValues tranform to beam-sdks-java-core or just add
>> it as spark runner dependency?
>> 2) I think we should try to make SortValues more flexible by letting user
>> to provide custom value comparator, sorting lexicographically by secondary
>> key may be painful in some use cases. What do you think?
>>
>> side note:
>> I agree that usually top n values, that fit in memory are sufficient and
>> we can combine them using PQ, but in practice we still have pipelines that
>> need to do top N selection over data that do not fit in memory for a single
>> key.
>>
>> D.
>>
>> On Wed, May 30, 2018 at 5:28 PM, Lukasz Cwik  wrote:
>>
>>> SortValues does not have a defined & documented URN yet. Once a Runner
>>> is providing such an override, it will happen. No runner publicly provides
>>> one to my knowledge.
>>>
>>> On Wed, May 30, 2018 at 8:08 AM Kenneth Knowles  wrote:
>>>
 I can see a few usability issues here. Totally agree w/ Luke, just
 noting:

  - The naming is slightly misleading because SortValues is actually
 already GBK+SortValues.
  - It also makes things look less supported when they are in the
 extensions/ folder. I'd say we should have a better place to put such a
 library if it is the official public implementation. The word "extensions"
 doesn't seem particularly accurate or meaningful to me.

 Q: Does SortValues have a defined & documented URN yet?

 Kenn

 On Wed, May 30, 2018 at 7:52 AM Lukasz Cwik  wrote:

> Each runner can choose to override the SortValues PTransform with
> their own internal offering. For example Spark overrides global combine[1]
> during pipeline translation. If Spark detected the SortValues PTransform
> during translation, it could override the offering with something that 
> used
> repartitionAndSortWithinPartitions.
>
> GroupByKeyAndSortValuesOnly inside Dataflow exists to support a
> specific use case. Users should rely on SortValues as it is the public
> implementation for sorting.
>
> 1:
> https://github.com/apache/beam/blob/85dcab56268fbac923ffd5885489ee154f097fc5/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java#L200
>
> As a side note, its uncommon where you need to sort all values,
> usually top 100 suffices and can be implemented much more efficiently with
> a combiner when compared to sorting.
>
> On Wed, May 30, 2018 at 3:38 AM  wrote:
>
>> Hi,
>>  I have question I am trying to do translation in dsl-euphoria for
>> “GroupByKey with sorted values within key” to Beam. I am aware of java 
>> sdk
>> extensions SortValues, but it doesn’t have sufficient abstraction for
>> runners.
>>
>> I noticed that in DataflowRunner there is translation of batch
>> GroupByKey to GroupByKeyAndSortValuesOnly but is it considered to have it
>> in beam core so for example SparkRunner could translate “GroupByKey with
>> sorted values within key” with their internals such as
>> repartitionAndSortWithinPartitions.
>>
>> Thank you.
>> Marek Simunek
>>
>
>>


Re: [DISCUSS] Remove findbugs from sdks/java

2018-05-30 Thread Scott Wegner
Sorry to revive an old thread, but I wanted to give a shout-out and say
thank you to Ismaël and Tim who have been quickly chipping away at the
ErrorProne backlog. We started with 47 ErrorProne JIRA's [1], and in two
weeks we're down to just 17 [2]. Thanks!

[1] 
*https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20labels%20%3D%20errorprone

*
[2]
https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20labels%20%3D%20errorprone


On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía  wrote:

> As part of the error-prone effort Tim has been also cleaning other static
> analysis warnings as reported by IntelliJ's Inspect -> Analyze code. I
> think this is a good moment to grok some of those too e.g. scoping, unused
> variables, redundancies, etc. So I hope the others taking part this work
> try to tackle a chunk of those as well.
>
> Extra note. Of course IntelliJ's code analysis should be judged a bit,
> there are always fake positives or undesirable changes.
>
> On Fri, May 18, 2018 at 7:56 AM Jean-Baptiste Onofré 
> wrote:
>
> > Thanks Tim.
>
> > I think we will be able to remove findbugs after some run/check using
> ErrorProne and see the gaps.
>
> > Regards
> > JB
> > Le 18 mai 2018, à 07:49, Tim Robertson  a
> écrit:
>
> >> Thank you all.
>
> >> I think this is clear.  Removing findbugs can happen at a future point.
>
> >> @Scott - I've been working through the java IO error prone issues (some
> already merged, some with open PRs now) so will take those IO Jiras. I will
> enable failOnWarning, address dependency issues for findbugs and tackle the
> error prone warnings.
>
>
> >> On Fri, May 18, 2018 at 1:07 AM, Scott Wegner 
> wrote:
>
> >>> +0.02173913
>
> >>> I'm happy to replace FindBugs with ErrorProne, but we need to first
> upgrade ErrorProne analyzer warnings to errors. Currently the codebase is
> full of warning spam, and there's no enforcement preventing future
> violations from being added.
>
> >>> I've done the work for enforcing ErrorProne analysis on java-sdk-core
> [1], and I've sharded out the rest of the Java components in JIRA issues
> [2] (45 total).  Fixing the issues is relatively straightforward, and I've
> tried to provide enough guidance to make them as starter tasks (example:
> [3]). Teng Peng has already started on Spark [4] (thanks!)
>
> >>> [1] https://github.com/apache/beam/pull/5319
> >>> [2]
>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20%3D%20Open%20AND%20labels%20%3D%20errorprone
> >>> [3] https://issues.apache.org/jira/browse/BEAM-4347
> >>> [4] https://issues.apache.org/jira/browse/BEAM-4318
>
> >>> On Thu, May 17, 2018 at 2:00 PM Ismaël Mejía 
> wrote:
>
>  +0.7 also. Findbugs support for more recent versions of Java is
> lacking and
>  the maintenance seems frozen in time.
>
>  As a possible plan b can we identify the missing important validations
> to
>  identify how much we lose and if it is considerable, maybe we can
> create a
>  minimal configuration for those, and eventually migrate from findbugs
> to
>  spotbugs (https://github.com/spotbugs/spotbugs/) that seems at least
> to be
>  maintained and the most active findbugs fork.
>
>
>  On Thu, May 17, 2018 at 9:31 PM Kenneth Knowles 
> wrote:
>
>  > +0.7 I think we should work to remove findbugs. Errorprone covers
> most of
>  the same stuff but better and faster.
>
>  > The one thing I'm not sure about is nullness analysis. Findbugs has
> some
>  serious limitations there but it really improves code quality and
> prevents
>  blunders. I'm not sure errorprone covers that. I know the Checker
> analyzer
>  has a full solution that makes NPE impossible as in most modern
> languages.
>  Maybe that is easy to plug in. The core Java SDK is a good candidate
> for
>  the first place to do it since it is affects everything else.
>
>  > On Thu, May 17, 2018 at 12:02 PM Tim Robertson <
> timrobertson...@gmail.com>
>  wrote:
>
>  >> Hi all,
>  >> [bringing a side thread discussion from slack to here]
>
>  >> We're tackling error-prone warnings now and we aim to fail the
> build on
>  warnings raised [1].
>
>  >> Enabling failOnWarning also fails the build on findbugs warnings.
>  Currently I see places where these  arise from missing a dependency on
>  findbugs_annotations and I asked on slack the best way to introduce
> this
>  globally in gradle.
>
>  >> In that discussion the idea was floated to consider removing
> findbugs
>  completely given it is older, has licensing considerations and is not
>  released regularly.
>
>  >> What do people think about this idea please?
>
>  >> Thanks,
>  >> Tim
>  >> [1]
>
>
> 

Re: [SQL] Unsupported features

2018-05-30 Thread Kenneth Knowles
This is extremely useful. Thanks for putting so much information together!

Kenn

On Wed, May 30, 2018 at 8:19 AM Kai Jiang  wrote:

> Hi all,
>
> Based on pull/5481 , I manually
> did a coverage test with TPC-ds queries (65%) and TPC-h queries (100%) and
> want to see what features Beam SQL is currently not supporting. Test was
> running on DirectRunner.
>
> I want to share the result.​
>  TPC-DS queries on Beam
> 
> ​
> TL;DR:
>
>1. aggregation function (stddev) missing or calculation of aggregation
>functions combination.
>2. nested beamjoinrel(condition=[true], joinType=[inner]) / cross join
>error
>3. date type casting/ calculation and other types casting.
>4. LIKE operator in String / alias for substring function
>5. order by w/o limit clause.
>6. OR operator is supported in join condition
>7. Syntax: exist/ not exist (errors) .rank() over (partition by) /
>view (unsupported)
>
>
> Best,
> Kai
> ᐧ
>


Re: GroupByKey with sorted values within key

2018-05-30 Thread Kenneth Knowles
The sorting by bytes is a deliberate limitation of this particular
approach. It basically assumes you are using bytes-based shuffle under the
hood, so invoking a language-specific comparator would be something new. I
know +Ben had some ideas about this.

Kenn

On Wed, May 30, 2018 at 8:53 AM David Morávek 
wrote:

> Thanks for pointing us the right direction. We'll try to prototype custom
> translation for Spark runner within next sprint. In order to do so, I have
> few questions:
>
> 1) Should we move SortValues tranform to beam-sdks-java-core or just add
> it as spark runner dependency?
> 2) I think we should try to make SortValues more flexible by letting user
> to provide custom value comparator, sorting lexicographically by secondary
> key may be painful in some use cases. What do you think?
>
> side note:
> I agree that usually top n values, that fit in memory are sufficient and
> we can combine them using PQ, but in practice we still have pipelines that
> need to do top N selection over data that do not fit in memory for a single
> key.
>
> D.
>
> On Wed, May 30, 2018 at 5:28 PM, Lukasz Cwik  wrote:
>
>> SortValues does not have a defined & documented URN yet. Once a Runner is
>> providing such an override, it will happen. No runner publicly provides one
>> to my knowledge.
>>
>> On Wed, May 30, 2018 at 8:08 AM Kenneth Knowles  wrote:
>>
>>> I can see a few usability issues here. Totally agree w/ Luke, just
>>> noting:
>>>
>>>  - The naming is slightly misleading because SortValues is actually
>>> already GBK+SortValues.
>>>  - It also makes things look less supported when they are in the
>>> extensions/ folder. I'd say we should have a better place to put such a
>>> library if it is the official public implementation. The word "extensions"
>>> doesn't seem particularly accurate or meaningful to me.
>>>
>>> Q: Does SortValues have a defined & documented URN yet?
>>>
>>> Kenn
>>>
>>> On Wed, May 30, 2018 at 7:52 AM Lukasz Cwik  wrote:
>>>
 Each runner can choose to override the SortValues PTransform with their
 own internal offering. For example Spark overrides global combine[1] during
 pipeline translation. If Spark detected the SortValues PTransform during
 translation, it could override the offering with something that used
 repartitionAndSortWithinPartitions.

 GroupByKeyAndSortValuesOnly inside Dataflow exists to support a
 specific use case. Users should rely on SortValues as it is the public
 implementation for sorting.

 1:
 https://github.com/apache/beam/blob/85dcab56268fbac923ffd5885489ee154f097fc5/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java#L200

 As a side note, its uncommon where you need to sort all values, usually
 top 100 suffices and can be implemented much more efficiently with a
 combiner when compared to sorting.

 On Wed, May 30, 2018 at 3:38 AM  wrote:

> Hi,
>  I have question I am trying to do translation in dsl-euphoria for
> “GroupByKey with sorted values within key” to Beam. I am aware of java sdk
> extensions SortValues, but it doesn’t have sufficient abstraction for
> runners.
>
> I noticed that in DataflowRunner there is translation of batch
> GroupByKey to GroupByKeyAndSortValuesOnly but is it considered to have it
> in beam core so for example SparkRunner could translate “GroupByKey with
> sorted values within key” with their internals such as
> repartitionAndSortWithinPartitions.
>
> Thank you.
> Marek Simunek
>

>


[SQL] Unsupported features

2018-05-30 Thread Kai Jiang
Hi all,

Based on pull/5481 , I manually
did a coverage test with TPC-ds queries (65%) and TPC-h queries (100%) and
want to see what features Beam SQL is currently not supporting. Test was
running on DirectRunner.

I want to share the result.​
 TPC-DS queries on Beam

​
TL;DR:

   1. aggregation function (stddev) missing or calculation of aggregation
   functions combination.
   2. nested beamjoinrel(condition=[true], joinType=[inner]) / cross join
   error
   3. date type casting/ calculation and other types casting.
   4. LIKE operator in String / alias for substring function
   5. order by w/o limit clause.
   6. OR operator is supported in join condition
   7. Syntax: exist/ not exist (errors) .rank() over (partition by) /
   view (unsupported)


Best,
Kai
ᐧ


Re: GroupByKey with sorted values within key

2018-05-30 Thread Kenneth Knowles
I can see a few usability issues here. Totally agree w/ Luke, just noting:

 - The naming is slightly misleading because SortValues is actually already
GBK+SortValues.
 - It also makes things look less supported when they are in the
extensions/ folder. I'd say we should have a better place to put such a
library if it is the official public implementation. The word "extensions"
doesn't seem particularly accurate or meaningful to me.

Q: Does SortValues have a defined & documented URN yet?

Kenn

On Wed, May 30, 2018 at 7:52 AM Lukasz Cwik  wrote:

> Each runner can choose to override the SortValues PTransform with their
> own internal offering. For example Spark overrides global combine[1] during
> pipeline translation. If Spark detected the SortValues PTransform during
> translation, it could override the offering with something that used
> repartitionAndSortWithinPartitions.
>
> GroupByKeyAndSortValuesOnly inside Dataflow exists to support a specific
> use case. Users should rely on SortValues as it is the public
> implementation for sorting.
>
> 1:
> https://github.com/apache/beam/blob/85dcab56268fbac923ffd5885489ee154f097fc5/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java#L200
>
> As a side note, its uncommon where you need to sort all values, usually
> top 100 suffices and can be implemented much more efficiently with a
> combiner when compared to sorting.
>
> On Wed, May 30, 2018 at 3:38 AM  wrote:
>
>> Hi,
>>  I have question I am trying to do translation in dsl-euphoria for
>> “GroupByKey with sorted values within key” to Beam. I am aware of java sdk
>> extensions SortValues, but it doesn’t have sufficient abstraction for
>> runners.
>>
>> I noticed that in DataflowRunner there is translation of batch GroupByKey
>> to GroupByKeyAndSortValuesOnly but is it considered to have it in beam core
>> so for example SparkRunner could translate “GroupByKey with sorted values
>> within key” with their internals such as repartitionAndSortWithinPartitions.
>>
>> Thank you.
>> Marek Simunek
>>
>


Re: GroupByKey with sorted values within key

2018-05-30 Thread Lukasz Cwik
Each runner can choose to override the SortValues PTransform with their own
internal offering. For example Spark overrides global combine[1] during
pipeline translation. If Spark detected the SortValues PTransform during
translation, it could override the offering with something that used
repartitionAndSortWithinPartitions.

GroupByKeyAndSortValuesOnly inside Dataflow exists to support a specific
use case. Users should rely on SortValues as it is the public
implementation for sorting.

1:
https://github.com/apache/beam/blob/85dcab56268fbac923ffd5885489ee154f097fc5/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java#L200

As a side note, its uncommon where you need to sort all values, usually top
100 suffices and can be implemented much more efficiently with a combiner
when compared to sorting.

On Wed, May 30, 2018 at 3:38 AM  wrote:

> Hi,
>  I have question I am trying to do translation in dsl-euphoria for
> “GroupByKey with sorted values within key” to Beam. I am aware of java sdk
> extensions SortValues, but it doesn’t have sufficient abstraction for
> runners.
>
> I noticed that in DataflowRunner there is translation of batch GroupByKey
> to GroupByKeyAndSortValuesOnly but is it considered to have it in beam core
> so for example SparkRunner could translate “GroupByKey with sorted values
> within key” with their internals such as repartitionAndSortWithinPartitions.
>
> Thank you.
> Marek Simunek
>


Re: Hello Beam!

2018-05-30 Thread Lukasz Cwik
Arun, it would be best to checkout one of the quickstarts (Java, Python,
Go) (https://beam.apache.org/get-started/beam-overview/) and when you have
questions ask them on u...@beam.apache.org

On Wed, May 30, 2018 at 5:32 AM arun kumar  wrote:

> Hi All,
>
> Thank you for adding in the group and I am interested in Apache beam with
> Google cloud runner.
>
> I need to start on Apache beam with Google cloud runner.
>
> Please help me where I need to start from and if anyone have simple code
> for my requirement.
>
> Thanks
> Arunkumar
>
> On Wed, May 30, 2018, 10:55 AM Jean-Baptiste Onofré 
> wrote:
>
>> Welcome !
>>
>> Looking forward to work and discuss with you ;)
>>
>> Regards
>> JB
>>
>> On 29/05/2018 23:49, Rui Wang wrote:
>> > Hi there,
>> >
>> > I am Rui (pronounced as same as "Ray")!
>> >
>> > I recently joined Google Cloud. Beam is a very interesting project and I
>> > cannot wait to contribute to it!
>> >
>> >
>> > Thanks,
>> > Rui
>>
>> --
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>


Re: [PROPOSAL] CI improvement: be able to run the IT of the IOs from github pull request

2018-05-30 Thread Łukasz Gajowy
+1 to generalizing IT. I think the tests you mentioned were developed
earlier than the general idea of how the IOIT should look like emerged.
AFAIK the same goes for the tests in io/google-cloud-platform module. I
recently created some issues that address that [1], [2], [3]. If there's
anyone willing to take those - feel free (I can help with this).

[1] https://issues.apache.org/jira/browse/BEAM-4416
[2] https://issues.apache.org/jira/browse/BEAM-4399
[3] https://issues.apache.org/jira/browse/BEAM-4398

2018-05-30 14:00 GMT+02:00 Etienne Chauchot :

> Hi Łukasz
>
> Thanks for the details.
>
> I was more thinking about generalizing IT test integration. For example
> some IOs like Cassandra and Elasticsearch have IT but no groovy scripts.
> Also I agree with your list
> And thanks for the details about backend services automatic provisioning,
> I did not know that.
>
> Etienne
> Le mercredi 30 mai 2018 à 11:21 +0200, Łukasz Gajowy a écrit :
>
> Hi Etienne,
>
> it is already possible, provided that there is appropriate Jenkins job
> defined (see examples here: [1],[2]). Either the reviewer or the author can
> run the seed job to load job definitions (by typing "Run seed job" in
> comment) and then run the test he/she is interested to run (by specifying
> the correct phrase in the GitHub comment, eg. "Run Java JdbcIO Performance
> Test". The results are then available on Jenkins so those are public too.
>
> Regarding the infrastructure: currently, if a test requires any
> Kubernetes' infrastructure, it is set up by PerfKitBenchmarker tool before
> the test is actually run. After the test execution, all the infrastructure
> is torn down. This also is made automatically provided that all necessary
> Kubernetes' scripts are there.
>
> Despite the fact that it is possible, I must say that all the "Performance
> Testing Framework" needs improvement in the following areas (so should be
> considered as an ongoing work in progress):
>  - documentation and instructions for the community (this is getting more
> urgent!)
>  - support for other runners (currently only direct and Dataflow are
> supported, as there were some issues when we tried to integrate it with
> Spark and Flink)
>  - support for other filesystems (currently only local and HDFS are
> supported)
>  - rename and reorganize IT jobs in Jenkins (see: [3])
>
> Also, I think it's worthy to look improvement in terms of job definitions
> (seed jobs overwrite all jobs so this can collide with other developers
> work). See the thread I started a while ago in [4] for further info.
>
> Best regards,
> Łukasz Gajowy
>
>
> [1] https://github.com/apache/beam/blob/master/.test-infra/
> jenkins/job_PerformanceTests_JDBC.groovy
> [2] https://github.com/apache/beam/blob/master/.test-infra/
> jenkins/job_PerformanceTests_FileBasedIO_IT.groovy
> [3] https://issues.apache.org/jira/browse/BEAM-4298
> [4] https://lists.apache.org/thread.html/b1aaea2c7eadc7ca1d1326b94a8c4c
> 3a67befc0753897fd7fa4a3a4e@%3Cdev.beam.apache.org%3E
>
> 2018-05-30 10:14 GMT+02:00 Etienne Chauchot :
>
> Hi guys
> Part of the CI improvement work, I would suggest to enable running the
> integration tests of the IOs from the github PR.
>
> Indeed, when doing a review, either the reviewer or the author needs to
> run the IT. The problem is that the results are private. It would be good
> to be able to run IT using a phrase in github (like the validates runner
> tests) to have the results public like any other test in the PR.
> But it would require the backend IT infrastructures (kubernates/docker
> ...) to be always up and also to set their credentials/location in the
> related jenkins groovy script.
>
> I opened:
> https://issues.apache.org/jira/browse/BEAM-4427
>
> Thoughts?
>
> Best
> Etienne
>
>
>


Re: Hello Beam!

2018-05-30 Thread arun kumar
Hi All,

Thank you for adding in the group and I am interested in Apache beam with
Google cloud runner.

I need to start on Apache beam with Google cloud runner.

Please help me where I need to start from and if anyone have simple code
for my requirement.

Thanks
Arunkumar

On Wed, May 30, 2018, 10:55 AM Jean-Baptiste Onofré  wrote:

> Welcome !
>
> Looking forward to work and discuss with you ;)
>
> Regards
> JB
>
> On 29/05/2018 23:49, Rui Wang wrote:
> > Hi there,
> >
> > I am Rui (pronounced as same as "Ray")!
> >
> > I recently joined Google Cloud. Beam is a very interesting project and I
> > cannot wait to contribute to it!
> >
> >
> > Thanks,
> > Rui
>
> --
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Performance testing documentation - suggestions request

2018-05-30 Thread Łukasz Gajowy
Hi,

the Performance Testing Framework is an ongoing effort for some time now.
As I noticed (and received signals from the community) it is getting more
popular (this is good) but besides changing commands from mvn to gradle
ones, it was not updated for a very long time (this is not good at all!).
It's about time to document it properly. I created a JIRA for this[1].

Before we resolve the JIRA, I thought it would be good to ask you about
biggest pain points in the current documentation[2] (or in the Performance
Testing Framework itself) - what do you find the most lacking? Maybe there
are some sections you'd like to add (eg. "Future plans of development)?
Such feedback would be very helpful in providing fine-tuned documentation
so it is much appreciated. Feel free to comment here or in the comments
section of the ticket. Thanks in advance!

[1]: https://issues.apache.org/jira/browse/BEAM-4430
[2]:
https://beam.apache.org/documentation/io/testing/#i-o-transform-integration-tests

Best regards,
Łukasz


Re: [PROPOSAL] CI improvement: be able to run the IT of the IOs from github pull request

2018-05-30 Thread Etienne Chauchot
Hi Łukasz
Thanks for the details.
I was more thinking about generalizing IT test integration. For example some 
IOs like Cassandra and Elasticsearch have
IT but no groovy scripts. Also I agree with your listAnd thanks for the details 
about backend services automatic
provisioning, I did not know that.
EtienneLe mercredi 30 mai 2018 à 11:21 +0200, Łukasz Gajowy a écrit : 
> Hi Etienne, 
> 
> it is already possible, provided that there is appropriate Jenkins job 
> defined (see examples here: [1],[2]). Either
> the reviewer or the author can run the seed job to load job definitions (by 
> typing "Run seed job" in comment) and then
> run the test he/she is interested to run (by specifying the correct phrase in 
> the GitHub comment, eg. "Run Java JdbcIO
> Performance Test". The results are then available on Jenkins so those are 
> public too.
> 
> Regarding the infrastructure: currently, if a test requires any Kubernetes' 
> infrastructure, it is set up by
> PerfKitBenchmarker tool before the test is actually run. After the test 
> execution, all the infrastructure is torn
> down. This also is made automatically provided that all necessary Kubernetes' 
> scripts are there. 
> 
> Despite the fact that it is possible, I must say that all the "Performance 
> Testing Framework" needs improvement in the
> following areas (so should be considered as an ongoing work in progress):
>  - documentation and instructions for the community (this is getting more 
> urgent!)
>  - support for other runners (currently only direct and Dataflow are 
> supported, as there were some issues when we
> tried to integrate it with Spark and Flink) - support for other filesystems 
> (currently only local and HDFS are
> supported)
>  - rename and reorganize IT jobs in Jenkins (see: [3])
>  
> Also, I think it's worthy to look improvement in terms of job definitions 
> (seed jobs overwrite all jobs so this can
> collide with other developers work). See the thread I started a while ago in 
> [4] for further info.
> 
> Best regards, 
> Łukasz Gajowy
> 
> 
> [1] 
> https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PerformanceTests_JDBC.groovy[2]
>  https://github.
> com/apache/beam/blob/master/.test-infra/jenkins/job_PerformanceTests_FileBasedIO_IT.groovy
> [3] https://issues.apache.org/jira/browse/BEAM-4298
> [4] 
> https://lists.apache.org/thread.html/b1aaea2c7eadc7ca1d1326b94a8c4c3a67befc0753897fd7fa4a3a4e@%3Cdev.beam.apache.o
> rg%3E
> 2018-05-30 10:14 GMT+02:00 Etienne Chauchot :
> > Hi guys
> > Part of the CI improvement work, I would suggest to enable running the 
> > integration tests of the IOs from the github
> > PR.
> > 
> > Indeed, when doing a review, either the reviewer or the author needs to run 
> > the IT. The problem is that the results
> > are private. It would be good to be able to run IT using a phrase in github 
> > (like the validates runner tests) to
> > have the results public like any other test in the PR. 
> > But it would require the backend IT infrastructures (kubernates/docker ...) 
> > to be always up and also to set their
> > credentials/location in the related jenkins groovy script.
> > 
> > I opened:
> > https://issues.apache.org/jira/browse/BEAM-4427
> > 
> > Thoughts?
> > 
> > Best
> > Etienne

Re: The full list of proposals / prototype documents

2018-05-30 Thread Łukasz Gajowy
Hi,

I just wanted to add those two (sorry for being kinda late with this):

https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit?usp=sharing
https://docs.google.com/document/d/1Cb7XVmqe__nA_WCrriAifL-3WCzbZzV4Am5W_SkQLeA/edit?usp=sharing

Thanks,
Łukasz

2018-05-29 22:42 GMT+02:00 Lukasz Cwik :

> Providing ownership to the PMC account allows others to take over
> ownership of the document once a contributor stops being active. This
> allows docs to be updated (even if just to point to a newer doc).
>
> On Tue, May 29, 2018 at 1:20 PM Kenneth Knowles  wrote:
>
>> My position on ownership is design docs are really documents "of the
>> moment" and authored by a particular individual or group. Experience shows
>> that even if you try, keeping it fresh is not likely to happen. Anything
>> that needs freshness (like end-user docs) should be in a different medium. I
>> would just date the gdoc so readers know how to interpret it (the automated
>> "last edit" date is not sufficient for understanding how stale something
>> is).
>>
>> So it seems like it makes little difference if the project or PMC has
>> ownership or even write access. Of course I have no objections if someone
>> wants to transfer ownership, but is there a reason to encourage it?
>>
>> Kenn
>>
>> On Tue, May 29, 2018 at 1:11 PM Lukasz Cwik  wrote:
>>
>>> I transferred ownership of the docs that I owned to the
>>> apacheb...@gmail.com PMC account and put the ones that I owned into the
>>> drive folder.
>>>
>>> Would it be a good idea for others to follow suit?
>>>
>>> Instructions on how to transfer ownership are here:
>>> http://support.it.mtu.edu/Accounts/E-Mail/75946047/How-
>>> do-I-transfer-ownership-of-a-Google-Doc.htm
>>>
>>>
>>>
>>> On Tue, May 29, 2018 at 11:23 AM Lukasz Cwik  wrote:
>>>
 I created a PR for the beam-site to link to the design docs and
 template from the contribution guide:
 https://github.com/apache/beam-site/pull/454

 On Fri, May 25, 2018 at 10:23 AM Lukasz Cwik  wrote:

> Here are some more links related to portability efforts:
>
>
> https://s.apache.org/beam-fn-api
>
> https://s.apache.org/beam-fn-api-processing-a-bundle
>
> https://s.apache.org/beam-fn-api-send-and-receive-data
>
> https://s.apache.org/beam-fn-state-api-and-bundle-processing
>
> https://s.apache.org/beam-fn-api-progress-reporting
>
> https://s.apache.org/beam-fn-api-container-contract
>
> https://s.apache.org/beam-breaking-fusion
>
> https://s.apache.org/beam-runner-api-combine-model
>
> https://s.apache.org/beam-fn-api-metrics
>
>
>
> On Thu, May 24, 2018 at 2:11 PM Scott Wegner 
> wrote:
>
>> Thanks for sharing these. I also put together a design doc template
>> based on common styling / sections I saw in the docs listed above. Others
>> are free to use it as they'd like.
>>
>> https://docs.google.com/document/d/1kVePqjt2daZd0bQHGUwghlcLbhvrn
>> y7VpflAzk9sjUg/edit?usp=sharing
>>
>> On Thu, May 24, 2018 at 6:23 AM Kenneth Knowles 
>> wrote:
>>
>>> OK, I will also put a list here of those I know off the top of my
>>> head. Some are redundant with Etienne's but short links that I can 
>>> think of:
>>>
>>> https://s.apache.org/a-new-dofn
>>> https://s.apache.org/beam-triggers
>>> https://s.apache.org/beam-sink-triggers
>>> https://s.apache.org/beam-runner-composites
>>> https://s.apache.org/beam-lateness
>>> https://s.apache.org/beam-runner-api
>>> https://s.apache.org/beam-state
>>> https://s.apache.org/beam-side-inputs-1-pager
>>>
>>> Kenn
>>>
>>> On Thu, May 24, 2018 at 6:08 AM Etienne Chauchot <
>>> echauc...@apache.org> wrote:
>>>
 Great that you take this action Alexey !
 Here are the links I have, there is duplicates with the ones you
 already received and maybe old docs as well:

 https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtlj
 xgy69FkHMvhTMA/edit
 https://docs.google.com/document/d/1wR56Jef3XIPwj4DFzQKznuGPM3JDf
 RDVkxzeDlbdVSQ/edit
 https://docs.google.com/document/d/1Fl_
 LM918j7ZxAmCSkm43GBjV8knsZAIA1tRhvJ4DneM/edit#heading=h.1lcfuwfvxg2
 https://docs.google.com/document/d/1zcF4ZGtq8pxzLZxgD_
 JMWAouSszIf9LnFANWHKBsZlg/edit
 https://docs.google.com/document/d/1AQmx-T9XjSi1PNoEp5_L-
 lT0j7BkgTbmQnc6uFEMI4c/edit#heading=h.dtl8cwoybr2y
 https://docs.google.com/document/d/1u-4o_
 0uj8uKa2SVNPBNxIKfvcJ4t66ecCoU1M2yVoDA/edit#heading=h.c1deqkr0bp31
 https://docs.google.com/document/d/17H2sBEtnoTSxjzlrz7rmKtX5E3F0m
 W1NpFQzWzSYOpY/edit#heading=h.1lcfuwfvxg2
 https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHG
 K0QIvXS1FOTgRc/edit#heading=h.puuotbien1gf
 

Re: Survey: what is everyone working on that you want to share?

2018-05-30 Thread Łukasz Gajowy
Let's add Performance Testing section too:
https://github.com/apache/beam-site/pull/455

Thanks,
Łukasz



2018-05-29 23:59 GMT+02:00 Carlos Alonso :

> My two cents:
>
> https://github.com/apache/beam/pull/5341
> https://issues.apache.org/jira/browse/BEAM-4257
>
> On Tue, May 29, 2018 at 7:50 PM Kenneth Knowles  wrote:
>
>> Fair point!
>>
>> It is already on master and released so I'll come up with one or more
>> concrete endeavors where someone could jump in and help.
>>
>> On Tue, May 29, 2018 at 10:46 AM Robert Bradshaw 
>> wrote:
>>
>>> Surprised there's not an item about SQL support, Kenn :).
>>>
>>> On Mon, May 28, 2018 at 5:16 AM David Morávek 
>>> wrote:
>>>
 here you go: https://github.com/apache/beam-site/pull/453

 Thanks,
 D.

 On Tue, May 15, 2018 at 11:51 PM, Kenneth Knowles 
 wrote:

> I wanted to bring back to this thread that, yes, I would most like
> pull requests :-)
>
> And please don't feel like you have to follow the pattern already
> there. It would be better if each project had a sentence describing it (be
> concise, still) and the whole content fit right in the page.
>
> On Tue, May 15, 2018 at 2:48 PM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Hi Kenn,
>>
>> I sent https://github.com/apache/beam-site/pull/441 to cover efforts
>> related to Python 3 support in Beam.
>>
>> Thanks,
>> Valentyn
>>
>> On Tue, May 15, 2018 at 10:27 AM, David Morávek <
>> david.mora...@gmail.com> wrote:
>>
>>> Hi Kenn,
>>>
>>> Java 8 DSL
>>>
>>> JIRA:
>>> 
>>> dsl-euphoria
>>> 
>>> / BEAM-3900 
>>>
>>> Feature branch: dsl-euphoria
>>> 
>>>
>>> Contact: David Moravek 
>>>
>>> Description: Java 8 wrapper over the Beam Java SDK, based on Euphoria
>>> API  project.
>>>
>>>
>>> Thanks,
>>>
>>> David
>>>
>>>
>>>
>>> On Tue, May 15, 2018 at 7:27 AM, Kenneth Knowles 
>>> wrote:
>>>
 Hi all,

 TL;DR: for anyone who is willing & able to answer: What big-picture
 thing are you working on? I want to highlight it here:

 https://beam.apache.org/contribute/#works-in-progress

 (I want each of these to have a nice description, too! And a saved
 JIRA search for starter tasks would be clever...)

  context 

 Following feedback from the Beam summit and other discussions, I've
 been working with Melissa and looping in folks to revamp the website to
 have a more welcoming tone and to make it easier for newcomers to get
 started.

 Part of that is making the site and guide more concise and making
 the entry points more prominent and interesting. Starter tasks are OK, 
 but
 to get a lasting engagement the idea is for starter tasks connect a
 newcomer with ongoing projects. Most importantly, they are more likely 
 to
 have exciting interactions with experienced contributors, and to have
 follow-up work to the starter task.

 Kenn

>>>
>>>
>>



GroupByKey with sorted values within key

2018-05-30 Thread marek-simunek
Hi,
 I have question I am trying to do translation in dsl-euphoria for “
GroupByKey with sorted values within key” to Beam. I am aware of java sdk
extensions SortValues, but it doesn’t have sufficient abstraction for
runners.

I noticed that in DataflowRunner there is translation of batch GroupByKey to
GroupByKeyAndSortValuesOnly but is it considered to have it in beam core so
for example SparkRunner could translate “GroupByKey with sorted values
within key” with their internals such as repartitionAndSortWithinPartitions.


Thank you.
Marek Simunek

Re: [PROPOSAL] CI improvement: be able to run the IT of the IOs from github pull request

2018-05-30 Thread Łukasz Gajowy
Hi Etienne,

it is already possible, provided that there is appropriate Jenkins job
defined (see examples here: [1],[2]). Either the reviewer or the author can
run the seed job to load job definitions (by typing "Run seed job" in
comment) and then run the test he/she is interested to run (by specifying
the correct phrase in the GitHub comment, eg. "Run Java JdbcIO Performance
Test". The results are then available on Jenkins so those are public too.

Regarding the infrastructure: currently, if a test requires any
Kubernetes' infrastructure, it is set up by PerfKitBenchmarker tool before
the test is actually run. After the test execution, all the infrastructure
is torn down. This also is made automatically provided that all necessary
Kubernetes' scripts are there.

Despite the fact that it is possible, I must say that all the "Performance
Testing Framework" needs improvement in the following areas (so should be
considered as an ongoing work in progress):
 - documentation and instructions for the community (this is getting more
urgent!)
 - support for other runners (currently only direct and Dataflow are
supported, as there were some issues when we tried to integrate it with
Spark and Flink)
 - support for other filesystems (currently only local and HDFS are
supported)
 - rename and reorganize IT jobs in Jenkins (see: [3])

Also, I think it's worthy to look improvement in terms of job definitions
(seed jobs overwrite all jobs so this can collide with other developers
work). See the thread I started a while ago in [4] for further info.

Best regards,
Łukasz Gajowy


[1]
https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PerformanceTests_JDBC.groovy
[2]
https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PerformanceTests_FileBasedIO_IT.groovy
[3] https://issues.apache.org/jira/browse/BEAM-4298
[4]
https://lists.apache.org/thread.html/b1aaea2c7eadc7ca1d1326b94a8c4c3a67befc0753897fd7fa4a3a4e@%3Cdev.beam.apache.org%3E

2018-05-30 10:14 GMT+02:00 Etienne Chauchot :

> Hi guys
> Part of the CI improvement work, I would suggest to enable running the
> integration tests of the IOs from the github PR.
>
> Indeed, when doing a review, either the reviewer or the author needs to
> run the IT. The problem is that the results are private. It would be good
> to be able to run IT using a phrase in github (like the validates runner
> tests) to have the results public like any other test in the PR.
> But it would require the backend IT infrastructures (kubernates/docker
> ...) to be always up and also to set their credentials/location in the
> related jenkins groovy script.
>
> I opened:
> https://issues.apache.org/jira/browse/BEAM-4427
>
> Thoughts?
>
> Best
> Etienne
>


Re: Hello Beam!

2018-05-30 Thread Łukasz Gajowy
Welcome! :)

2018-05-30 7:25 GMT+02:00 Jean-Baptiste Onofré :

> Welcome !
>
> Looking forward to work and discuss with you ;)
>
> Regards
> JB
>
> On 29/05/2018 23:49, Rui Wang wrote:
> > Hi there,
> >
> > I am Rui (pronounced as same as "Ray")!
> >
> > I recently joined Google Cloud. Beam is a very interesting project and I
> > cannot wait to contribute to it!
> >
> >
> > Thanks,
> > Rui
>
> --
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


[PROPOSAL] CI improvement: be able to run the IT of the IOs from github pull request

2018-05-30 Thread Etienne Chauchot
Hi guys
Part of the CI improvement work, I would suggest to enable running the 
integration tests of the IOs from the github PR.

Indeed, when doing a review, either the reviewer or the author needs to run the 
IT. The problem is that the results are
private. It would be good to be able to run IT using a phrase in github (like 
the validates runner tests) to have the
results public like any other test in the PR. 
But it would require the backend IT infrastructures (kubernates/docker ...) to 
be always up and also to set their
credentials/location in the related jenkins groovy script.

I opened:
https://issues.apache.org/jira/browse/BEAM-4427

Thoughts?

Best
Etienne