How much slower did the post commits become after removing concurrency?

On Thu, Aug 2, 2018 at 2:32 PM Mikhail Gryzykhin <mig...@google.com> wrote:

> I've disabled concurrency for auto-triggered post-commits job. That should
> reduce job scheduling considerably.
>
> I believe that this change should resolve quota issue we have seen this
> time. I'll monitor if problem reappears.
>
> --Mikhail
>
> Have feedback <http://go/migryz-feedback>?
>
>
> On Wed, Aug 1, 2018 at 9:40 AM Pablo Estrada <pabl...@google.com> wrote:
>
>> It feels to me like a peak of 60 jobs per minute is pretty high. If I
>> understand correctly, we run up to 20 dataflow jobs in parallel per test
>> suite? Or what's the number here?
>>
>> It is also true that most our tests are simple NeedsRunner tests, that
>> test a couple elements, so the whole pipeline overhead is on startup. This
>> may be improved by lumping tests together (though might we lose
>> debuggability?).  Our average number of jobs is, I hope, muuuch smaller
>> than 60 per minute...
>>
>> With all these considerations, I would lean more towards having a retry
>> policy as the immediate solution.
>> -P.
>>
>> On Wed, Aug 1, 2018 at 9:07 AM Andrew Pilloud <apill...@google.com>
>> wrote:
>>
>>> I like 1 and 2. How do credentials get into Jenkins? Could we create a
>>> user per Jenkins host?
>>>
>>> On Tue, Jul 31, 2018 at 4:33 PM Reuven Lax <re...@google.com> wrote:
>>>
>>>> There was also a proposal to lump multiple tests into a single Dataflow
>>>> job instead of spinning up a separate Dataflow job for each test.
>>>>
>>>> On Tue, Jul 31, 2018 at 4:26 PM Mikhail Gryzykhin <mig...@google.com>
>>>> wrote:
>>>>
>>>>> I synced with Rafael. Below is summary of discussion.
>>>>>
>>>>> This quota is CreateRequestsPerMinutePerUser and it has 60 requests
>>>>> per user by default.
>>>>>
>>>>> I've created Jira [BEAM-5053](
>>>>> https://issues.apache.org/jira/browse/BEAM-5053) for this.
>>>>>
>>>>> I see following options we can utilize:
>>>>> 1. Add retry logic. Although this limits us to 1 dataflow job start
>>>>> per second for whole Jenkins. In long scale this can also block one test
>>>>> job if other jobs take all the slots.
>>>>> 2. Utilize different users to spin Dataflow jobs.
>>>>> 3. Find way to rise quota limit on Dataflow. By default the field
>>>>> limits value to 60 requests per minute.
>>>>> 4. Long run generic suggestion: limit amount of dataflow jobs we spin
>>>>> up and move tests to the form of unit or component tests.
>>>>>
>>>>> Please, fill in any insights or ideas you have on this.
>>>>>
>>>>> Regards,
>>>>> --Mikhail
>>>>>
>>>>> Have feedback <http://go/migryz-feedback>?
>>>>>
>>>>>
>>>>> On Tue, Jul 31, 2018 at 3:55 PM Mikhail Gryzykhin <mig...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Everyone,
>>>>>>
>>>>>> Seems that we hit quota issue again:
>>>>>> https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/553/consoleFull
>>>>>>
>>>>>> Can someone share information on how was this triaged last time or
>>>>>> guide me on possible follow-up actions?
>>>>>>
>>>>>> Regards,
>>>>>> --Mikhail
>>>>>>
>>>>>> Have feedback <http://go/migryz-feedback>?
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 3, 2018 at 9:12 PM Rafael Fernandez <rfern...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Summary for all folks following this story -- and many thanks for
>>>>>>> explaining configs to me and pointing me to files and such.
>>>>>>>
>>>>>>> - Scott made changes to the config and we can now run 3
>>>>>>> ValidatesRunner.Dataflow in parallel (each run is about 2 hours)
>>>>>>> - With the latest quota changes, we peaked at ~70% capacity in
>>>>>>> concurrent Dataflow jobs when running those
>>>>>>> - I've been keeping an eye on quota peaks for all resources today
>>>>>>> and have not seen any worryisome limits overall.
>>>>>>> - Also note there are improvements planned to the
>>>>>>> ValidatesRunner.Dataflow test so various items get batched and the test
>>>>>>> itself runs faster -- I believe it's on Alan's radar
>>>>>>>
>>>>>>> Cheers,
>>>>>>> r
>>>>>>>
>>>>>>> On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <rfern...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Done!
>>>>>>>>
>>>>>>>> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <sc...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota
>>>>>>>>> [1]. Can you take a look? I've filed [BEAM-4722]:
>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-4722
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://github.com/apache/beam/pull/5861#issuecomment-401963630
>>>>>>>>>
>>>>>>>>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <
>>>>>>>>> rfern...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 .
>>>>>>>>>> Quotas should not be a problem, if they are, please file a JIRA under
>>>>>>>>>> gcp-quota.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> r
>>>>>>>>>>
>>>>>>>>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <k...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> One thing that is nice when you do this is to be able to share
>>>>>>>>>>> your results. Though if all you are sharing is "they passed" then I 
>>>>>>>>>>> guess
>>>>>>>>>>> we don't have to insist on evidence.
>>>>>>>>>>>
>>>>>>>>>>> Kenn
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <sc...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> A few thoughts:
>>>>>>>>>>>>
>>>>>>>>>>>> * The Jenkins job getting backed up
>>>>>>>>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. 
>>>>>>>>>>>> Since
>>>>>>>>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly 
>>>>>>>>>>>> requested
>>>>>>>>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So 
>>>>>>>>>>>> this job
>>>>>>>>>>>> is idle more often than backlogged.
>>>>>>>>>>>>
>>>>>>>>>>>> * It's difficult to reason about our exact quota needs because
>>>>>>>>>>>> Dataflow jobs get launched from various Jenkins jobs that have 
>>>>>>>>>>>> different
>>>>>>>>>>>> parallelism configurations. If we have budget, we could enable 
>>>>>>>>>>>> concurrent
>>>>>>>>>>>> execution of this job and increase our quota enough to give some 
>>>>>>>>>>>> breathing
>>>>>>>>>>>> room. If we do this, I recommend limiting the max concurrency via
>>>>>>>>>>>> throttleConcurrentBuilds [2] to some reasonable limit.
>>>>>>>>>>>>
>>>>>>>>>>>> * This test suite is meant to be an exhaustive post-commit
>>>>>>>>>>>> validation of Dataflow runner, and tests a lot of different 
>>>>>>>>>>>> aspects of a
>>>>>>>>>>>> runner. It would be more efficient to run locally only the tests 
>>>>>>>>>>>> affected
>>>>>>>>>>>> by your change. Note that this requires having access to a GCP 
>>>>>>>>>>>> project with
>>>>>>>>>>>> billing, but most Dataflow developers probably have access to this 
>>>>>>>>>>>> already.
>>>>>>>>>>>> The command for this is:
>>>>>>>>>>>>
>>>>>>>>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
>>>>>>>>>>>> -PdataflowProject=myGcpProject 
>>>>>>>>>>>> -PdataflowTempRoot=gs://myGcsTempRoot
>>>>>>>>>>>> --tests "org.apache.beam.MyTestClass"
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
>>>>>>>>>>>> [2]
>>>>>>>>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <lc...@google.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> The validates runner test parallelism is controlled here and
>>>>>>>>>>>>> is currently set to be "unlimited":
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>>>>>>>>>>>>>
>>>>>>>>>>>>> Each test fork is run on a different gradle worker, so the
>>>>>>>>>>>>> number of parallel test runs is limited to the max number of 
>>>>>>>>>>>>> workers
>>>>>>>>>>>>> configured which is controlled here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
>>>>>>>>>>>>> It is currently configured to 3 * number of CPU cores.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We are already running up to 48 Dataflow jobs in parallel.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <
>>>>>>>>>>>>> rfern...@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> - How many resources to ValidatesRunner tests use?
>>>>>>>>>>>>>> - Where are those settings?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <re...@google.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The specific issue only affects Dataflow ValidatesRunner
>>>>>>>>>>>>>>> tests. We currently allow only one of these to run at a time, 
>>>>>>>>>>>>>>> to control
>>>>>>>>>>>>>>> usage of Dataflow and of GCE quota. Other types of tests do not 
>>>>>>>>>>>>>>> suffer from
>>>>>>>>>>>>>>> this issue.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would like to see if it's possible to increase Dataflow
>>>>>>>>>>>>>>> quota so we can run more of these in parallel. It took me 8 
>>>>>>>>>>>>>>> hours end to
>>>>>>>>>>>>>>> end to run these tests (about 6 hours for the run to be 
>>>>>>>>>>>>>>> scheduled). If
>>>>>>>>>>>>>>> there was a failure, I would have had to repeat the whole 
>>>>>>>>>>>>>>> process. In the
>>>>>>>>>>>>>>> worst case, this process could have taken me days. While this 
>>>>>>>>>>>>>>> is not as
>>>>>>>>>>>>>>> pressing as some other issues (as most people don't need to run 
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> Dataflow tests on every PR), fixing it would make such changes 
>>>>>>>>>>>>>>> much easier
>>>>>>>>>>>>>>> to manage.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Reuven
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <
>>>>>>>>>>>>>>> rfern...@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +Reuven Lax <re...@google.com> told me yesterday that he
>>>>>>>>>>>>>>>> was waiting for some test to be scheduled and run, and it took 
>>>>>>>>>>>>>>>> 6 hours or
>>>>>>>>>>>>>>>> so. I would like to help reduce these wait times by increasing 
>>>>>>>>>>>>>>>> parallelism.
>>>>>>>>>>>>>>>> I need help understanding the continuous minimum of what we 
>>>>>>>>>>>>>>>> use. It seems
>>>>>>>>>>>>>>>> the following is true:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    - There seems to always be 16 jenkins machines on (16
>>>>>>>>>>>>>>>>    CPUs each)
>>>>>>>>>>>>>>>>    - There seems to be three GKE machines always on (1 CPU
>>>>>>>>>>>>>>>>    each)
>>>>>>>>>>>>>>>>    - Most (if not all) unit tests run on 1 machine, and
>>>>>>>>>>>>>>>>    seem to run one-at-a-time <-- I think we can safely 
>>>>>>>>>>>>>>>> parallelize this to 20.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> With current quotas, if we parallelize to 20 concurrent
>>>>>>>>>>>>>>>> unit tests, we still have room for 80 other concurrent 
>>>>>>>>>>>>>>>> dataflow jobs to
>>>>>>>>>>>>>>>> execute, with 75% of CPU capacity.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thoughts? Additional data?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> r
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>> Got feedback? go/pabloem-feedback
>> <https://goto.google.com/pabloem-feedback>
>>
>

Reply via email to