Summary for all folks following this story -- and many thanks for explaining configs to me and pointing me to files and such.
- Scott made changes to the config and we can now run 3 ValidatesRunner.Dataflow in parallel (each run is about 2 hours) - With the latest quota changes, we peaked at ~70% capacity in concurrent Dataflow jobs when running those - I've been keeping an eye on quota peaks for all resources today and have not seen any worryisome limits overall. - Also note there are improvements planned to the ValidatesRunner.Dataflow test so various items get batched and the test itself runs faster -- I believe it's on Alan's radar Cheers, r On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <rfern...@google.com> wrote: > Done! > > On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <sc...@apache.org> wrote: > >> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota [1]. Can >> you take a look? I've filed [BEAM-4722]: >> https://issues.apache.org/jira/browse/BEAM-4722 >> >> [1] https://github.com/apache/beam/pull/5861#issuecomment-401963630 >> >> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <rfern...@google.com> >> wrote: >> >>> OK, Scott just sent https://github.com/apache/beam/pull/5860 . Quotas >>> should not be a problem, if they are, please file a JIRA under gcp-quota. >>> >>> Cheers, >>> r >>> >>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <k...@google.com> wrote: >>> >>>> One thing that is nice when you do this is to be able to share your >>>> results. Though if all you are sharing is "they passed" then I guess we >>>> don't have to insist on evidence. >>>> >>>> Kenn >>>> >>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org> wrote: >>>> >>>>> A few thoughts: >>>>> >>>>> * The Jenkins job getting backed up >>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since >>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly requested >>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this job >>>>> is idle more often than backlogged. >>>>> >>>>> * It's difficult to reason about our exact quota needs because >>>>> Dataflow jobs get launched from various Jenkins jobs that have different >>>>> parallelism configurations. If we have budget, we could enable concurrent >>>>> execution of this job and increase our quota enough to give some breathing >>>>> room. If we do this, I recommend limiting the max concurrency via >>>>> throttleConcurrentBuilds [2] to some reasonable limit. >>>>> >>>>> * This test suite is meant to be an exhaustive post-commit validation >>>>> of Dataflow runner, and tests a lot of different aspects of a runner. It >>>>> would be more efficient to run locally only the tests affected by your >>>>> change. Note that this requires having access to a GCP project with >>>>> billing, but most Dataflow developers probably have access to this >>>>> already. >>>>> The command for this is: >>>>> >>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner >>>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot >>>>> --tests "org.apache.beam.MyTestClass" >>>>> >>>>> [1] >>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend >>>>> [2] >>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds >>>>> >>>>> >>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com> wrote: >>>>> >>>>>> The validates runner test parallelism is controlled here and is >>>>>> currently set to be "unlimited": >>>>>> >>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115 >>>>>> >>>>>> Each test fork is run on a different gradle worker, so the number of >>>>>> parallel test runs is limited to the max number of workers configured >>>>>> which >>>>>> is controlled here: >>>>>> >>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50 >>>>>> It is currently configured to 3 * number of CPU cores. >>>>>> >>>>>> We are already running up to 48 Dataflow jobs in parallel. >>>>>> >>>>>> >>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <rfern...@google.com> >>>>>> wrote: >>>>>> >>>>>>> - How many resources to ValidatesRunner tests use? >>>>>>> - Where are those settings? >>>>>>> >>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com> wrote: >>>>>>> >>>>>>>> The specific issue only affects Dataflow ValidatesRunner tests. We >>>>>>>> currently allow only one of these to run at a time, to control usage of >>>>>>>> Dataflow and of GCE quota. Other types of tests do not suffer from this >>>>>>>> issue. >>>>>>>> >>>>>>>> I would like to see if it's possible to increase Dataflow quota so >>>>>>>> we can run more of these in parallel. It took me 8 hours end to end to >>>>>>>> run >>>>>>>> these tests (about 6 hours for the run to be scheduled). If there was a >>>>>>>> failure, I would have had to repeat the whole process. In the worst >>>>>>>> case, >>>>>>>> this process could have taken me days. While this is not as pressing as >>>>>>>> some other issues (as most people don't need to run the Dataflow tests >>>>>>>> on >>>>>>>> every PR), fixing it would make such changes much easier to manage. >>>>>>>> >>>>>>>> Reuven >>>>>>>> >>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez < >>>>>>>> rfern...@google.com> wrote: >>>>>>>> >>>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he was >>>>>>>>> waiting for some test to be scheduled and run, and it took 6 hours or >>>>>>>>> so. I >>>>>>>>> would like to help reduce these wait times by increasing parallelism. >>>>>>>>> I >>>>>>>>> need help understanding the continuous minimum of what we use. It >>>>>>>>> seems the >>>>>>>>> following is true: >>>>>>>>> >>>>>>>>> >>>>>>>>> - There seems to always be 16 jenkins machines on (16 CPUs >>>>>>>>> each) >>>>>>>>> - There seems to be three GKE machines always on (1 CPU each) >>>>>>>>> - Most (if not all) unit tests run on 1 machine, and seem to >>>>>>>>> run one-at-a-time <-- I think we can safely parallelize this to 20. >>>>>>>>> >>>>>>>>> With current quotas, if we parallelize to 20 concurrent unit >>>>>>>>> tests, we still have room for 80 other concurrent dataflow jobs to >>>>>>>>> execute, >>>>>>>>> with 75% of CPU capacity. >>>>>>>>> >>>>>>>>> Thoughts? Additional data? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> r >>>>>>>>> >>>>>>>>
smime.p7s
Description: S/MIME Cryptographic Signature