Re: Jenkins queue times steadily increasing for a few months now

2019-09-24 Thread Robert Bradshaw
Yeah, that's useful. I was asking about getting things at the jenkins
job level. E.g. are our PostCommits taking up all the time, or our
Precommits?

On Tue, Sep 24, 2019 at 1:23 PM Lukasz Cwik  wrote:
>
> We can get the per gradle task profile with the --profile flag: 
> https://jakewharton.com/static/files/trace/profile.html
> This information also appears within the build scans that are sent to Gradle.
>
> Integrating with either of these sources of information would allow us to 
> figure out whether its new tasks or old tasks taking longer.
>
> On Tue, Sep 24, 2019 at 12:23 PM Robert Bradshaw  wrote:
>>
>> Does anyone know how to gather stats on where the time is being spent?
>> Several times the idea of consolidating many of the (expensive)
>> validates runner integration tests into a single pipeline, and then
>> running things individually only if that fails, has come up. I think
>> that'd be a big win if indeed this is where our time is being spent.
>>
>> On Tue, Sep 24, 2019 at 12:13 PM Daniel Oliveira  
>> wrote:
>> >
>> > Those ideas all sound good. I especially agree with trying to reduce tests 
>> > first and then if we've done all we can there and latency is still too 
>> > high, it means we need more workers. Also in addition to reducing the 
>> > amount of tests, there's also running less important tests less 
>> > frequently, particularly when it comes to postcommits since many of those 
>> > are resource intensive. That would require people with good context around 
>> > what our many postcommits are used for.
>> >
>> > Another idea I thought of is trying to avoid running automated tests 
>> > outside of peak coding times. Ideally, during the times when we get the 
>> > greatest amounts of PRs (and therefore precommits) we shouldn't have any 
>> > postcommits running. If we have both pre and postcommits going at the same 
>> > time during peak hours, our queue times will shoot up even if the total 
>> > amount of work doesn't change much.
>> >
>> > Btw, you mentioned that this was a problem last year. Do you have any 
>> > links to discussions about that? It seems like it could be useful.
>> >
>> > On Thu, Sep 19, 2019 at 1:10 PM Mikhail Gryzykhin  
>> > wrote:
>> >>
>> >> Hi Daniel,
>> >>
>> >> Generally this looks feasible since jobs wait for new worker to be 
>> >> available to start.
>> >>
>> >> Over time we added more tests and did not deprecate enough, this 
>> >> increases load on workers. I wonder if we can add something like total 
>> >> runtime of all running jobs? This will be a safeguard metric that will 
>> >> show amount of time we actually run jobs. If it increases with same 
>> >> amount of workers, that will prove that we are overloading them (inverse 
>> >> is not necessarily correct).
>> >>
>> >> On addressing this, we can review approaches we took last year and see if 
>> >> any of them apply. If I do some brainstorming, following ideas come to 
>> >> mind: add more work force, reduce amount of tests, do better work on 
>> >> filtering out irrelevant tests, cancel irrelevant jobs (ie: cancel tests 
>> >> if linter fails) and/or add option for cancelling irrelevant jobs. One 
>> >> more big point can be effort on deflaking, but we seem to be decent in 
>> >> this area.
>> >>
>> >> Regards,
>> >> Mikhail.
>> >>
>> >>
>> >> On Thu, Sep 19, 2019 at 12:22 PM Daniel Oliveira  
>> >> wrote:
>> >>>
>> >>> Hi everyone,
>> >>>
>> >>> A little while ago I was taking a look at the Precommit Latency metrics 
>> >>> on Grafana (link) and saw that the monthly 90th percentile metric has 
>> >>> been really increasing the past few months, from around 10 minutes to 
>> >>> currently around 30 minutes.
>> >>>
>> >>> After doing some light digging I was shown this page (beam load 
>> >>> statistics) which seems to imply that queue times are shooting up when 
>> >>> all the test executors are occupied, and it seems this is happening 
>> >>> longer and more often recently. I also took a look at the commit history 
>> >>> for our Jenkins tests and I see that new tests have steadily been added.
>> >>>
>> >>> I wanted to bring this up with the dev@ to ask:
>> >>>
>> >>> 1. Is this accurate? Can anyone provide insight into the metrics? Does 
>> >>> anyone know how to double check my assumptions with more concrete 
>> >>> metrics?
>> >>>
>> >>> 2. Does anyone have ideas on how to address this?
>> >>>
>> >>> Thanks,
>> >>> Daniel Oliveira


Re: Jenkins queue times steadily increasing for a few months now

2019-09-24 Thread Lukasz Cwik
We can get the per gradle task profile with the --profile flag:
https://jakewharton.com/static/files/trace/profile.html
This information also appears within the build scans that are sent to
Gradle.

Integrating with either of these sources of information would allow us to
figure out whether its new tasks or old tasks taking longer.

On Tue, Sep 24, 2019 at 12:23 PM Robert Bradshaw 
wrote:

> Does anyone know how to gather stats on where the time is being spent?
> Several times the idea of consolidating many of the (expensive)
> validates runner integration tests into a single pipeline, and then
> running things individually only if that fails, has come up. I think
> that'd be a big win if indeed this is where our time is being spent.
>
> On Tue, Sep 24, 2019 at 12:13 PM Daniel Oliveira 
> wrote:
> >
> > Those ideas all sound good. I especially agree with trying to reduce
> tests first and then if we've done all we can there and latency is still
> too high, it means we need more workers. Also in addition to reducing the
> amount of tests, there's also running less important tests less frequently,
> particularly when it comes to postcommits since many of those are resource
> intensive. That would require people with good context around what our many
> postcommits are used for.
> >
> > Another idea I thought of is trying to avoid running automated tests
> outside of peak coding times. Ideally, during the times when we get the
> greatest amounts of PRs (and therefore precommits) we shouldn't have any
> postcommits running. If we have both pre and postcommits going at the same
> time during peak hours, our queue times will shoot up even if the total
> amount of work doesn't change much.
> >
> > Btw, you mentioned that this was a problem last year. Do you have any
> links to discussions about that? It seems like it could be useful.
> >
> > On Thu, Sep 19, 2019 at 1:10 PM Mikhail Gryzykhin 
> wrote:
> >>
> >> Hi Daniel,
> >>
> >> Generally this looks feasible since jobs wait for new worker to be
> available to start.
> >>
> >> Over time we added more tests and did not deprecate enough, this
> increases load on workers. I wonder if we can add something like total
> runtime of all running jobs? This will be a safeguard metric that will show
> amount of time we actually run jobs. If it increases with same amount of
> workers, that will prove that we are overloading them (inverse is not
> necessarily correct).
> >>
> >> On addressing this, we can review approaches we took last year and see
> if any of them apply. If I do some brainstorming, following ideas come to
> mind: add more work force, reduce amount of tests, do better work on
> filtering out irrelevant tests, cancel irrelevant jobs (ie: cancel tests if
> linter fails) and/or add option for cancelling irrelevant jobs. One more
> big point can be effort on deflaking, but we seem to be decent in this area.
> >>
> >> Regards,
> >> Mikhail.
> >>
> >>
> >> On Thu, Sep 19, 2019 at 12:22 PM Daniel Oliveira <
> danolive...@google.com> wrote:
> >>>
> >>> Hi everyone,
> >>>
> >>> A little while ago I was taking a look at the Precommit Latency
> metrics on Grafana (link) and saw that the monthly 90th percentile metric
> has been really increasing the past few months, from around 10 minutes to
> currently around 30 minutes.
> >>>
> >>> After doing some light digging I was shown this page (beam load
> statistics) which seems to imply that queue times are shooting up when all
> the test executors are occupied, and it seems this is happening longer and
> more often recently. I also took a look at the commit history for our
> Jenkins tests and I see that new tests have steadily been added.
> >>>
> >>> I wanted to bring this up with the dev@ to ask:
> >>>
> >>> 1. Is this accurate? Can anyone provide insight into the metrics? Does
> anyone know how to double check my assumptions with more concrete metrics?
> >>>
> >>> 2. Does anyone have ideas on how to address this?
> >>>
> >>> Thanks,
> >>> Daniel Oliveira
>


Re: Jenkins queue times steadily increasing for a few months now

2019-09-24 Thread Robert Bradshaw
Does anyone know how to gather stats on where the time is being spent?
Several times the idea of consolidating many of the (expensive)
validates runner integration tests into a single pipeline, and then
running things individually only if that fails, has come up. I think
that'd be a big win if indeed this is where our time is being spent.

On Tue, Sep 24, 2019 at 12:13 PM Daniel Oliveira  wrote:
>
> Those ideas all sound good. I especially agree with trying to reduce tests 
> first and then if we've done all we can there and latency is still too high, 
> it means we need more workers. Also in addition to reducing the amount of 
> tests, there's also running less important tests less frequently, 
> particularly when it comes to postcommits since many of those are resource 
> intensive. That would require people with good context around what our many 
> postcommits are used for.
>
> Another idea I thought of is trying to avoid running automated tests outside 
> of peak coding times. Ideally, during the times when we get the greatest 
> amounts of PRs (and therefore precommits) we shouldn't have any postcommits 
> running. If we have both pre and postcommits going at the same time during 
> peak hours, our queue times will shoot up even if the total amount of work 
> doesn't change much.
>
> Btw, you mentioned that this was a problem last year. Do you have any links 
> to discussions about that? It seems like it could be useful.
>
> On Thu, Sep 19, 2019 at 1:10 PM Mikhail Gryzykhin  wrote:
>>
>> Hi Daniel,
>>
>> Generally this looks feasible since jobs wait for new worker to be available 
>> to start.
>>
>> Over time we added more tests and did not deprecate enough, this increases 
>> load on workers. I wonder if we can add something like total runtime of all 
>> running jobs? This will be a safeguard metric that will show amount of time 
>> we actually run jobs. If it increases with same amount of workers, that will 
>> prove that we are overloading them (inverse is not necessarily correct).
>>
>> On addressing this, we can review approaches we took last year and see if 
>> any of them apply. If I do some brainstorming, following ideas come to mind: 
>> add more work force, reduce amount of tests, do better work on filtering out 
>> irrelevant tests, cancel irrelevant jobs (ie: cancel tests if linter fails) 
>> and/or add option for cancelling irrelevant jobs. One more big point can be 
>> effort on deflaking, but we seem to be decent in this area.
>>
>> Regards,
>> Mikhail.
>>
>>
>> On Thu, Sep 19, 2019 at 12:22 PM Daniel Oliveira  
>> wrote:
>>>
>>> Hi everyone,
>>>
>>> A little while ago I was taking a look at the Precommit Latency metrics on 
>>> Grafana (link) and saw that the monthly 90th percentile metric has been 
>>> really increasing the past few months, from around 10 minutes to currently 
>>> around 30 minutes.
>>>
>>> After doing some light digging I was shown this page (beam load statistics) 
>>> which seems to imply that queue times are shooting up when all the test 
>>> executors are occupied, and it seems this is happening longer and more 
>>> often recently. I also took a look at the commit history for our Jenkins 
>>> tests and I see that new tests have steadily been added.
>>>
>>> I wanted to bring this up with the dev@ to ask:
>>>
>>> 1. Is this accurate? Can anyone provide insight into the metrics? Does 
>>> anyone know how to double check my assumptions with more concrete metrics?
>>>
>>> 2. Does anyone have ideas on how to address this?
>>>
>>> Thanks,
>>> Daniel Oliveira


Re: Jenkins queue times steadily increasing for a few months now

2019-09-24 Thread Daniel Oliveira
Those ideas all sound good. I especially agree with trying to reduce tests
first and then if we've done all we can there and latency is still too
high, it means we need more workers. Also in addition to reducing the
amount of tests, there's also running less important tests less frequently,
particularly when it comes to postcommits since many of those are resource
intensive. That would require people with good context around what our many
postcommits are used for.

Another idea I thought of is trying to avoid running automated tests
outside of peak coding times. Ideally, during the times when we get the
greatest amounts of PRs (and therefore precommits) we shouldn't have any
postcommits running. If we have both pre and postcommits going at the same
time during peak hours, our queue times will shoot up even if the total
amount of work doesn't change much.

Btw, you mentioned that this was a problem last year. Do you have any links
to discussions about that? It seems like it could be useful.

On Thu, Sep 19, 2019 at 1:10 PM Mikhail Gryzykhin  wrote:

> Hi Daniel,
>
> Generally this looks feasible since jobs wait for new worker to be
> available to start.
>
> Over time we added more tests and did not deprecate enough, this increases
> load on workers. I wonder if we can add something like total runtime of all
> running jobs? This will be a safeguard metric that will show amount of time
> we actually run jobs. If it increases with same amount of workers, that
> will prove that we are overloading them (inverse is not necessarily
> correct).
>
> On addressing this, we can review approaches we took last year and see if
> any of them apply. If I do some brainstorming, following ideas come to
> mind: add more work force, reduce amount of tests, do better work on
> filtering out irrelevant tests, cancel irrelevant jobs (ie: cancel tests if
> linter fails) and/or add option for cancelling irrelevant jobs. One more
> big point can be effort on deflaking, but we seem to be decent in this area.
>
> Regards,
> Mikhail.
>
>
> On Thu, Sep 19, 2019 at 12:22 PM Daniel Oliveira 
> wrote:
>
>> Hi everyone,
>>
>> A little while ago I was taking a look at the Precommit Latency metrics
>> on Grafana (link
>> )
>> and saw that the monthly 90th percentile metric has been really increasing
>> the past few months, from around 10 minutes to currently around 30 minutes.
>>
>> After doing some light digging I was shown this page (beam load
>> statistics
>> ) which
>> seems to imply that queue times are shooting up when all the test executors
>> are occupied, and it seems this is happening longer and more often
>> recently. I also took a look at the commit history for our Jenkins tests
>> 
>>  and
>> I see that new tests have steadily been added.
>>
>> I wanted to bring this up with the dev@ to ask:
>>
>> 1. Is this accurate? Can anyone provide insight into the metrics? Does
>> anyone know how to double check my assumptions with more concrete metrics?
>>
>> 2. Does anyone have ideas on how to address this?
>>
>> Thanks,
>> Daniel Oliveira
>>
>