Build failed in Jenkins: beam_SeedJob #662

2017-11-21 Thread Apache Jenkins Server
See 

--
GitHub pull request #4160 of commit 133614493c179f73f1aac7fad272ba79754768a8, 
no merge conflicts.
Setting status of 133614493c179f73f1aac7fad272ba79754768a8 to PENDING with url 
https://builds.apache.org/job/beam_SeedJob/662/ and message: 'Build started 
sha1 is merged.'
Using context: Jenkins: Seed Job
[EnvInject] - Loading node environment variables.
Building remotely on beam3 (beam) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/beam.git # timeout=10
Fetching upstream changes from https://github.com/apache/beam.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/beam.git 
 > +refs/heads/*:refs/remotes/origin/* 
 > +refs/pull/4160/*:refs/remotes/origin/pr/4160/*
 > git rev-parse refs/remotes/origin/pr/4160/merge^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/pr/4160/merge^{commit} # timeout=10
Checking out Revision cc790bbfa8a36cf7ed729f39a02f2825bd63cd05 
(refs/remotes/origin/pr/4160/merge)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f cc790bbfa8a36cf7ed729f39a02f2825bd63cd05
Commit message: "Merge 133614493c179f73f1aac7fad272ba79754768a8 into 
e9d746a34a82791571b002a78c62e9f46efd4253"
First time build. Skipping changelog.
Cleaning workspace
 > git rev-parse --verify HEAD # timeout=10
Resetting working tree
 > git reset --hard # timeout=10
 > git clean -fdx # timeout=10
Processing DSL script job_00_seed.groovy
Processing DSL script job_beam_Java_Build.groovy
Processing DSL script job_beam_Java_CodeHealth.groovy
Processing DSL script job_beam_Java_IntegrationTest.groovy
Processing DSL script job_beam_Java_UnitTest.groovy
Processing DSL script job_beam_PerformanceTests_Dataflow.groovy
Processing DSL script job_beam_PerformanceTests_JDBC.groovy
Processing DSL script job_beam_PerformanceTests_Python.groovy
Processing DSL script job_beam_PerformanceTests_Spark.groovy
Processing DSL script job_beam_PostCommit_Java_JDKVersionsTest.groovy
Processing DSL script job_beam_PostCommit_Java_MavenInstall.groovy
Processing DSL script job_beam_PostCommit_Java_MavenInstall_Windows.groovy
Processing DSL script job_beam_PostCommit_Java_ValidatesRunner_Apex.groovy
Processing DSL script job_beam_PostCommit_Java_ValidatesRunner_Dataflow.groovy
Processing DSL script job_beam_PostCommit_Java_ValidatesRunner_Flink.groovy
Processing DSL script job_beam_PostCommit_Java_ValidatesRunner_Gearpump.groovy
Processing DSL script job_beam_PostCommit_Java_ValidatesRunner_Spark.groovy
Processing DSL script job_beam_PostCommit_Python_ValidatesRunner_Dataflow.groovy
Processing DSL script job_beam_PostCommit_Python_Verify.groovy
Processing DSL script job_beam_PreCommit_Go_MavenInstall.groovy
Processing DSL script job_beam_PreCommit_Java_GradleBuild.groovy
ERROR: (job_beam_PreCommit_Java_GradleBuild.groovy, line 37) No signature of 
method: javaposse.jobdsl.dsl.helpers.publisher.PublisherContext.archiveJUnit() 
is applicable for argument types: (java.lang.String) values: 
[**/build/test-results/**/*.xml]
Possible solutions: archiveJunit(java.lang.String), 
archiveJunit(java.lang.String, groovy.lang.Closure), 
archiveXUnit(groovy.lang.Closure)



Re: Issues processing 150K files with DataflowRunner

2017-11-21 Thread Chamikara Jayalath
I suspect that you might be hitting Dataflow API limit for messages during
initial splitting the source. Some details are available under "Total
number of BoundedSource objects" below (you should see a similar message in
worker logs but exact error message might be out of date).
https://cloud.google.com/dataflow/pipelines/troubleshooting-your-pipeline

The exact number of files you can support depends on the size of generated
splits (usually about 400k for TextIO).

One solution for this is to develop a ReadAll() transform for VcfSource
similar to the following available for TextIO.
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/textio.py#L409

Thanks,
Cham


On Tue, Nov 21, 2017 at 8:04 AM Asha Rostamianfar
 wrote:

> Hi,
>
> I'm wondering whether anyone has tried processing a large number (~150K) of
> files using DataflowRunner? We are seeing a behavior where the dataflow job
> starts, but never attaches any workers. After 1h, it cancels the job due to
> being "stuck". See logs here
> <
> https://02931532374587840286.googlegroups.com/attach/3d44192c94959/log?part=0.1=1=ANaJVrFF9hay-Htd06tIuxol3aQb6meA9h2pVoe4tjOwcG71IT9FCqTSWkGMUWnW_lxBuN6Daq8XzmnSUZaNHU-PLvSF3jHinYGwCE13Jg9o0W3AulQy7U4
> >.
> It works fine for smaller number of files (e.g. 1k).
>
> We have tried setting num_workers, max_num_workers, etc. Are there any
> other settings that we can try?
>
> Context: the pipeline is using the python Apache Beam SDK and running the
> code at https://github.com/googlegenomics/gcp-variant-transforms. It's
> using the VcfSource, which is based on TextSource. See this thread
> <
> https://groups.google.com/d/msg/google-genomics-discuss/LUgqh1s56SY/WUnJkkHUAwAJ
> >
> for
> more context.
>
> Thanks,
> Asha
>


Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-21 Thread Kenneth Knowles
On Mon, Nov 20, 2017 at 5:01 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> In the verification spreadsheet, I'm not sure I understand the difference
> between the "YARN" and "Standalone cluster/service". Which is Dataproc? It
> definitely uses YARN, but it is also a standalone cluster/service. Does it
> count for both?
>

No, it doesn't. A number of runners have their own non-YARN cluster mode. I
would expect that the launching experience might be different and the
portable container management to differ. If they are identical, experts in
those systems should feel free to coalesce the rows. Conversely, as other
platforms become supported, they could be added or not based on whether
they are substantively different from a user experience or QA point of view.

Kenn


> Seems now we're missing just Apex and Flink cluster verifications.
>
> *though Spark runner took 6x longer to run UserScore, partially I guess
> because it didn't do autoscaling (Dataflow runner ramped up to 5 workers
> whereas Spark runner used 2 workers). For some reason Spark runner chose
> not to split the 10GB input files into chunks.
>
> On Mon, Nov 20, 2017 at 3:46 PM Reuven Lax 
> wrote:
>
> > Done
> >
> > On Tue, Nov 21, 2017 at 3:08 AM, Robert Bradshaw <
> > rober...@google.com.invalid> wrote:
> >
> > > Thanks. You need to re-sign as well.
> > >
> > > On Mon, Nov 20, 2017 at 12:14 AM, Reuven Lax  >
> > > wrote:
> > > > FYI these generated files have been removed from the source
> > distribution.
> > > >
> > > > On Sat, Nov 18, 2017 at 9:09 AM, Reuven Lax 
> wrote:
> > > >
> > > >> hmmm, I thought I removed those generated files from the zip file
> > before
> > > >> sending this email. Let me check again.
> > > >>
> > > >> Reuven
> > > >>
> > > >> On Sat, Nov 18, 2017 at 8:52 AM, Robert Bradshaw <
> > > >> rober...@google.com.invalid> wrote:
> > > >>
> > > >>> The source distribution contains a couple of files not on github
> > (e.g.
> > > >>> folders that were added on master, Python generated files). The pom
> > > >>> files differed only by missing -SNAPSHOT, other than that
> presumably
> > > >>> the source release should just be "wget
> > > >>> https://github.com/apache/beam/archive/release-2.2.0.zip;?
> > > >>>
> > > >>> diff -rq apache-beam-2.2.0 beam/ | grep -v pom.xml
> > > >>>
> > > >>> # OK?
> > > >>>
> > > >>> Only in apache-beam-2.2.0: DEPENDENCIES
> > > >>>
> > > >>> # Expected.
> > > >>>
> > > >>> Only in beam/: .git
> > > >>> Only in beam/: .gitattributes
> > > >>> Only in beam/: .gitignore
> > > >>>
> > > >>> # These folders are probably from switching around between master
> and
> > > >>> git branches.
> > > >>>
> > > >>> Only in apache-beam-2.2.0: model
> > > >>> Only in apache-beam-2.2.0/runners/flink: examples
> > > >>> Only in apache-beam-2.2.0/runners/flink: runner
> > > >>> Only in apache-beam-2.2.0/runners/gearpump: jarstore
> > > >>> Only in apache-beam-2.2.0/sdks/java/extensions: gcp-core
> > > >>> Only in apache-beam-2.2.0/sdks/java/extensions: sketching
> > > >>> Only in apache-beam-2.2.0/sdks/java/io: file-based-io-tests
> > > >>> Only in apache-beam-2.2.0/sdks/java/io: hdfs
> > > >>> Only in apache-beam-2.2.0/sdks/java/maven-archetypes/examples/src/
> ma
> > > >>> in/resources/archetype-resources:
> > > >>> src
> > > >>> Only in apache-beam-2.2.0/sdks/java/maven-archetypes/examples-
> java8/
> > > >>> src/main/resources/archetype-resources:
> > > >>> src
> > > >>> Only in apache-beam-2.2.0/sdks/java: microbenchmarks
> > > >>>
> > > >>> # Here's the generated protos.
> > > >>>
> > > >>> Only in apache-beam-2.2.0/sdks/python/apache_beam/portability/api:
> > > >>> beam_artifact_api_pb2_grpc.py
> > > >>> Only in apache-beam-2.2.0/sdks/python/apache_beam/portability/api:
> > > >>> beam_artifact_api_pb2.py
> > > >>> Only in apache-beam-2.2.0/sdks/python/apache_beam/portability/api:
> > > >>> beam_fn_api_pb2_grpc.py
> > > >>> Only in apache-beam-2.2.0/sdks/python/apache_beam/portability/api:
> > > >>> beam_fn_api_pb2.py
> > > >>> Only in apache-beam-2.2.0/sdks/python/apache_beam/portability/api:
> > > >>> beam_job_api_pb2_grpc.py
> > > >>> Only in apache-beam-2.2.0/sdks/python/apache_beam/portability/api:
> > > >>> beam_job_api_pb2.py
> > > >>> Only in apache-beam-2.2.0/sdks/python/apache_beam/portability/api:
> > > >>> beam_provision_api_pb2_grpc.py
> > > >>> Only in apache-beam-2.2.0/sdks/python/apache_beam/portability/api:
> > > >>> beam_provision_api_pb2.py
> > > >>> Only in apache-beam-2.2.0/sdks/python/apache_beam/portability/api:
> > > >>> beam_runner_api_pb2_grpc.py
> > > >>> Only in apache-beam-2.2.0/sdks/python/apache_beam/portability/api:
> > > >>> beam_runner_api_pb2.py
> > > >>> Only in apache-beam-2.2.0/sdks/python/apache_beam/portability/api:
> > > >>> endpoints_pb2_grpc.py
> > > >>> Only in apache-beam-2.2.0/sdks/python/apache_beam/portability/api:
> > > >>> endpoints_pb2.py
> > > >>> Only in 

Issues processing 150K files with DataflowRunner

2017-11-21 Thread Asha Rostamianfar
Hi,

I'm wondering whether anyone has tried processing a large number (~150K) of
files using DataflowRunner? We are seeing a behavior where the dataflow job
starts, but never attaches any workers. After 1h, it cancels the job due to
being "stuck". See logs here
.
It works fine for smaller number of files (e.g. 1k).

We have tried setting num_workers, max_num_workers, etc. Are there any
other settings that we can try?

Context: the pipeline is using the python Apache Beam SDK and running the
code at https://github.com/googlegenomics/gcp-variant-transforms. It's
using the VcfSource, which is based on TextSource. See this thread

for
more context.

Thanks,
Asha