How much slower did the post commits become after removing concurrency? On Thu, Aug 2, 2018 at 2:32 PM Mikhail Gryzykhin <mig...@google.com> wrote:
> I've disabled concurrency for auto-triggered post-commits job. That should > reduce job scheduling considerably. > > I believe that this change should resolve quota issue we have seen this > time. I'll monitor if problem reappears. > > --Mikhail > > Have feedback <http://go/migryz-feedback>? > > > On Wed, Aug 1, 2018 at 9:40 AM Pablo Estrada <pabl...@google.com> wrote: > >> It feels to me like a peak of 60 jobs per minute is pretty high. If I >> understand correctly, we run up to 20 dataflow jobs in parallel per test >> suite? Or what's the number here? >> >> It is also true that most our tests are simple NeedsRunner tests, that >> test a couple elements, so the whole pipeline overhead is on startup. This >> may be improved by lumping tests together (though might we lose >> debuggability?). Our average number of jobs is, I hope, muuuch smaller >> than 60 per minute... >> >> With all these considerations, I would lean more towards having a retry >> policy as the immediate solution. >> -P. >> >> On Wed, Aug 1, 2018 at 9:07 AM Andrew Pilloud <apill...@google.com> >> wrote: >> >>> I like 1 and 2. How do credentials get into Jenkins? Could we create a >>> user per Jenkins host? >>> >>> On Tue, Jul 31, 2018 at 4:33 PM Reuven Lax <re...@google.com> wrote: >>> >>>> There was also a proposal to lump multiple tests into a single Dataflow >>>> job instead of spinning up a separate Dataflow job for each test. >>>> >>>> On Tue, Jul 31, 2018 at 4:26 PM Mikhail Gryzykhin <mig...@google.com> >>>> wrote: >>>> >>>>> I synced with Rafael. Below is summary of discussion. >>>>> >>>>> This quota is CreateRequestsPerMinutePerUser and it has 60 requests >>>>> per user by default. >>>>> >>>>> I've created Jira [BEAM-5053]( >>>>> https://issues.apache.org/jira/browse/BEAM-5053) for this. >>>>> >>>>> I see following options we can utilize: >>>>> 1. Add retry logic. Although this limits us to 1 dataflow job start >>>>> per second for whole Jenkins. In long scale this can also block one test >>>>> job if other jobs take all the slots. >>>>> 2. Utilize different users to spin Dataflow jobs. >>>>> 3. Find way to rise quota limit on Dataflow. By default the field >>>>> limits value to 60 requests per minute. >>>>> 4. Long run generic suggestion: limit amount of dataflow jobs we spin >>>>> up and move tests to the form of unit or component tests. >>>>> >>>>> Please, fill in any insights or ideas you have on this. >>>>> >>>>> Regards, >>>>> --Mikhail >>>>> >>>>> Have feedback <http://go/migryz-feedback>? >>>>> >>>>> >>>>> On Tue, Jul 31, 2018 at 3:55 PM Mikhail Gryzykhin <mig...@google.com> >>>>> wrote: >>>>> >>>>>> Hi Everyone, >>>>>> >>>>>> Seems that we hit quota issue again: >>>>>> https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/553/consoleFull >>>>>> >>>>>> Can someone share information on how was this triaged last time or >>>>>> guide me on possible follow-up actions? >>>>>> >>>>>> Regards, >>>>>> --Mikhail >>>>>> >>>>>> Have feedback <http://go/migryz-feedback>? >>>>>> >>>>>> >>>>>> On Tue, Jul 3, 2018 at 9:12 PM Rafael Fernandez <rfern...@google.com> >>>>>> wrote: >>>>>> >>>>>>> Summary for all folks following this story -- and many thanks for >>>>>>> explaining configs to me and pointing me to files and such. >>>>>>> >>>>>>> - Scott made changes to the config and we can now run 3 >>>>>>> ValidatesRunner.Dataflow in parallel (each run is about 2 hours) >>>>>>> - With the latest quota changes, we peaked at ~70% capacity in >>>>>>> concurrent Dataflow jobs when running those >>>>>>> - I've been keeping an eye on quota peaks for all resources today >>>>>>> and have not seen any worryisome limits overall. >>>>>>> - Also note there are improvements planned to the >>>>>>> ValidatesRunner.Dataflow test so various items get batched and the test >>>>>>> itself runs faster -- I believe it's on Alan's radar >>>>>>> >>>>>>> Cheers, >>>>>>> r >>>>>>> >>>>>>> On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <rfern...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Done! >>>>>>>> >>>>>>>> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <sc...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota >>>>>>>>> [1]. Can you take a look? I've filed [BEAM-4722]: >>>>>>>>> https://issues.apache.org/jira/browse/BEAM-4722 >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://github.com/apache/beam/pull/5861#issuecomment-401963630 >>>>>>>>> >>>>>>>>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez < >>>>>>>>> rfern...@google.com> wrote: >>>>>>>>> >>>>>>>>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 . >>>>>>>>>> Quotas should not be a problem, if they are, please file a JIRA under >>>>>>>>>> gcp-quota. >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> r >>>>>>>>>> >>>>>>>>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <k...@google.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> One thing that is nice when you do this is to be able to share >>>>>>>>>>> your results. Though if all you are sharing is "they passed" then I >>>>>>>>>>> guess >>>>>>>>>>> we don't have to insist on evidence. >>>>>>>>>>> >>>>>>>>>>> Kenn >>>>>>>>>>> >>>>>>>>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> A few thoughts: >>>>>>>>>>>> >>>>>>>>>>>> * The Jenkins job getting backed up >>>>>>>>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. >>>>>>>>>>>> Since >>>>>>>>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly >>>>>>>>>>>> requested >>>>>>>>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So >>>>>>>>>>>> this job >>>>>>>>>>>> is idle more often than backlogged. >>>>>>>>>>>> >>>>>>>>>>>> * It's difficult to reason about our exact quota needs because >>>>>>>>>>>> Dataflow jobs get launched from various Jenkins jobs that have >>>>>>>>>>>> different >>>>>>>>>>>> parallelism configurations. If we have budget, we could enable >>>>>>>>>>>> concurrent >>>>>>>>>>>> execution of this job and increase our quota enough to give some >>>>>>>>>>>> breathing >>>>>>>>>>>> room. If we do this, I recommend limiting the max concurrency via >>>>>>>>>>>> throttleConcurrentBuilds [2] to some reasonable limit. >>>>>>>>>>>> >>>>>>>>>>>> * This test suite is meant to be an exhaustive post-commit >>>>>>>>>>>> validation of Dataflow runner, and tests a lot of different >>>>>>>>>>>> aspects of a >>>>>>>>>>>> runner. It would be more efficient to run locally only the tests >>>>>>>>>>>> affected >>>>>>>>>>>> by your change. Note that this requires having access to a GCP >>>>>>>>>>>> project with >>>>>>>>>>>> billing, but most Dataflow developers probably have access to this >>>>>>>>>>>> already. >>>>>>>>>>>> The command for this is: >>>>>>>>>>>> >>>>>>>>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner >>>>>>>>>>>> -PdataflowProject=myGcpProject >>>>>>>>>>>> -PdataflowTempRoot=gs://myGcsTempRoot >>>>>>>>>>>> --tests "org.apache.beam.MyTestClass" >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend >>>>>>>>>>>> [2] >>>>>>>>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> The validates runner test parallelism is controlled here and >>>>>>>>>>>>> is currently set to be "unlimited": >>>>>>>>>>>>> >>>>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115 >>>>>>>>>>>>> >>>>>>>>>>>>> Each test fork is run on a different gradle worker, so the >>>>>>>>>>>>> number of parallel test runs is limited to the max number of >>>>>>>>>>>>> workers >>>>>>>>>>>>> configured which is controlled here: >>>>>>>>>>>>> >>>>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50 >>>>>>>>>>>>> It is currently configured to 3 * number of CPU cores. >>>>>>>>>>>>> >>>>>>>>>>>>> We are already running up to 48 Dataflow jobs in parallel. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez < >>>>>>>>>>>>> rfern...@google.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> - How many resources to ValidatesRunner tests use? >>>>>>>>>>>>>> - Where are those settings? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The specific issue only affects Dataflow ValidatesRunner >>>>>>>>>>>>>>> tests. We currently allow only one of these to run at a time, >>>>>>>>>>>>>>> to control >>>>>>>>>>>>>>> usage of Dataflow and of GCE quota. Other types of tests do not >>>>>>>>>>>>>>> suffer from >>>>>>>>>>>>>>> this issue. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I would like to see if it's possible to increase Dataflow >>>>>>>>>>>>>>> quota so we can run more of these in parallel. It took me 8 >>>>>>>>>>>>>>> hours end to >>>>>>>>>>>>>>> end to run these tests (about 6 hours for the run to be >>>>>>>>>>>>>>> scheduled). If >>>>>>>>>>>>>>> there was a failure, I would have had to repeat the whole >>>>>>>>>>>>>>> process. In the >>>>>>>>>>>>>>> worst case, this process could have taken me days. While this >>>>>>>>>>>>>>> is not as >>>>>>>>>>>>>>> pressing as some other issues (as most people don't need to run >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> Dataflow tests on every PR), fixing it would make such changes >>>>>>>>>>>>>>> much easier >>>>>>>>>>>>>>> to manage. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Reuven >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez < >>>>>>>>>>>>>>> rfern...@google.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he >>>>>>>>>>>>>>>> was waiting for some test to be scheduled and run, and it took >>>>>>>>>>>>>>>> 6 hours or >>>>>>>>>>>>>>>> so. I would like to help reduce these wait times by increasing >>>>>>>>>>>>>>>> parallelism. >>>>>>>>>>>>>>>> I need help understanding the continuous minimum of what we >>>>>>>>>>>>>>>> use. It seems >>>>>>>>>>>>>>>> the following is true: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - There seems to always be 16 jenkins machines on (16 >>>>>>>>>>>>>>>> CPUs each) >>>>>>>>>>>>>>>> - There seems to be three GKE machines always on (1 CPU >>>>>>>>>>>>>>>> each) >>>>>>>>>>>>>>>> - Most (if not all) unit tests run on 1 machine, and >>>>>>>>>>>>>>>> seem to run one-at-a-time <-- I think we can safely >>>>>>>>>>>>>>>> parallelize this to 20. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> With current quotas, if we parallelize to 20 concurrent >>>>>>>>>>>>>>>> unit tests, we still have room for 80 other concurrent >>>>>>>>>>>>>>>> dataflow jobs to >>>>>>>>>>>>>>>> execute, with 75% of CPU capacity. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thoughts? Additional data? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> r >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >> Got feedback? go/pabloem-feedback >> <https://goto.google.com/pabloem-feedback> >> >