A few thoughts:

* The Jenkins job getting backed up
is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since
Mikhail refactored Jenkins jobs, this only runs when explicitly requested
via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this job
is idle more often than backlogged.

* It's difficult to reason about our exact quota needs because Dataflow
jobs get launched from various Jenkins jobs that have different parallelism
configurations. If we have budget, we could enable concurrent execution of
this job and increase our quota enough to give some breathing room. If we
do this, I recommend limiting the max concurrency via
throttleConcurrentBuilds [2] to some reasonable limit.

* This test suite is meant to be an exhaustive post-commit validation of
Dataflow runner, and tests a lot of different aspects of a runner. It would
be more efficient to run locally only the tests affected by your change.
Note that this requires having access to a GCP project with billing, but
most Dataflow developers probably have access to this already. The command
for this is:

./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
-PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
--tests "org.apache.beam.MyTestClass"

[1]
https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
[2]
https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds


On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com> wrote:

> The validates runner test parallelism is controlled here and is currently
> set to be "unlimited":
>
> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>
> Each test fork is run on a different gradle worker, so the number of
> parallel test runs is limited to the max number of workers configured which
> is controlled here:
>
> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
> It is currently configured to 3 * number of CPU cores.
>
> We are already running up to 48 Dataflow jobs in parallel.
>
>
> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <rfern...@google.com>
> wrote:
>
>> - How many resources to ValidatesRunner tests use?
>> - Where are those settings?
>>
>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com> wrote:
>>
>>> The specific issue only affects Dataflow ValidatesRunner tests. We
>>> currently allow only one of these to run at a time, to control usage of
>>> Dataflow and of GCE quota. Other types of tests do not suffer from this
>>> issue.
>>>
>>> I would like to see if it's possible to increase Dataflow quota so we
>>> can run more of these in parallel. It took me 8 hours end to end to run
>>> these tests (about 6 hours for the run to be scheduled). If there was a
>>> failure, I would have had to repeat the whole process. In the worst case,
>>> this process could have taken me days. While this is not as pressing as
>>> some other issues (as most people don't need to run the Dataflow tests on
>>> every PR), fixing it would make such changes much easier to manage.
>>>
>>> Reuven
>>>
>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <rfern...@google.com>
>>> wrote:
>>>
>>>> +Reuven Lax <re...@google.com> told me yesterday that he was waiting
>>>> for some test to be scheduled and run, and it took 6 hours or so. I would
>>>> like to help reduce these wait times by increasing parallelism. I need help
>>>> understanding the continuous minimum of what we use. It seems the following
>>>> is true:
>>>>
>>>>
>>>>    - There seems to always be 16 jenkins machines on (16 CPUs each)
>>>>    - There seems to be three GKE machines always on (1 CPU each)
>>>>    - Most (if not all) unit tests run on 1 machine, and seem to run
>>>>    one-at-a-time <-- I think we can safely parallelize this to 20.
>>>>
>>>> With current quotas, if we parallelize to 20 concurrent unit tests, we
>>>> still have room for 80 other concurrent dataflow jobs to execute, with 75%
>>>> of CPU capacity.
>>>>
>>>> Thoughts? Additional data?
>>>>
>>>> Thanks,
>>>> r
>>>>
>>>

Reply via email to