Hi Daniel,

Generally this looks feasible since jobs wait for new worker to be
available to start.

Over time we added more tests and did not deprecate enough, this increases
load on workers. I wonder if we can add something like total runtime of all
running jobs? This will be a safeguard metric that will show amount of time
we actually run jobs. If it increases with same amount of workers, that
will prove that we are overloading them (inverse is not necessarily
correct).

On addressing this, we can review approaches we took last year and see if
any of them apply. If I do some brainstorming, following ideas come to
mind: add more work force, reduce amount of tests, do better work on
filtering out irrelevant tests, cancel irrelevant jobs (ie: cancel tests if
linter fails) and/or add option for cancelling irrelevant jobs. One more
big point can be effort on deflaking, but we seem to be decent in this area.

Regards,
Mikhail.


On Thu, Sep 19, 2019 at 12:22 PM Daniel Oliveira <[email protected]>
wrote:

> Hi everyone,
>
> A little while ago I was taking a look at the Precommit Latency metrics on
> Grafana (link
> <http://104.154.241.245/d/_TNndF2iz/pre-commit-test-latency?orgId=1&from=now-90d&to=now>)
> and saw that the monthly 90th percentile metric has been really increasing
> the past few months, from around 10 minutes to currently around 30 minutes.
>
> After doing some light digging I was shown this page (beam load statistics
> <https://builds.apache.org/label/beam/load-statistics?type=min>) which
> seems to imply that queue times are shooting up when all the test executors
> are occupied, and it seems this is happening longer and more often
> recently. I also took a look at the commit history for our Jenkins tests
> <https://github.com/apache/beam/commits/master?after=864e2e0cac88ee317ca600dafe31ec4f527d5d5f+34&path%5B%5D=.test-infra&path%5B%5D=jenkins>
>  and
> I see that new tests have steadily been added.
>
> I wanted to bring this up with the dev@ to ask:
>
> 1. Is this accurate? Can anyone provide insight into the metrics? Does
> anyone know how to double check my assumptions with more concrete metrics?
>
> 2. Does anyone have ideas on how to address this?
>
> Thanks,
> Daniel Oliveira
>

Reply via email to