Re: [DISCUSS] options for precommit test reliability?

Sean Busbey Mon, 09 Oct 2017 12:50:29 -0700

On Oct 9, 2017 09:49, "Mike Drob" <[email protected]> wrote:

Addressing your individual suggestions inline.

Another one that you missed (more long term) is splitting up the server
module into smaller modules. We have some work on this already (backup,
mapreduce) but it's a long way to go...

That's fair. My unstated prerequisite is that I want something we can
implement like this week. ;)

On Mon, Oct 9, 2017 at 9:38 AM, Sean Busbey <[email protected]> wrote:

> Hi folks!
>
> Lately our precommit runs have had a large amount of noise around unit
> test failures due to timeout, especially for the hbase-server module.
>
> I'd really like to get us back to a place where a precommit -1 doesn't
> just result in a reflexive "precommit is unreliable."
>
> When the hbase-server module is going to be run (which would include
> changes to that module and changes to the top-level of the project), I
> can think of a few ways to bring the noise down:
>
> * Do fewer parallel executions. We do 5 tests at once now and the
> hbase-server module takes ~1.5 hours. We could tune down just the
> hbase-server module to do fewer.
>

1.5 hours is already past the threshold where I have to go do something
else while I wait for the tests to finish. Putting this up to 3 hours
wouldn't affect my productivity, I don't think.

That's both good and sad to hear. I think anything that brings the time
down is going to have to involve more hardware.

> * Do more test re-runs. We could have tests that fail retry more. I
> think maybe we allow a single retry currently via surefire. We'd have
> to do it outside of surefire to account for the large number of
> time-out failures.
>

I like the idea of more retries, but I don't like going outside of
surefire. I don't want us maintaining more custom hacks and shims in place
for something that should be temporary - once we get the tests stabilized
we shouldn't need it, right?

I don't think this is true. Even if our current problems aren't just "need
more hardware", historically that's been where we hit bedrock. So long as
we don't have a source of steadily increasing hardware either our test time
or our test stability has to pay the difference.

Unless there are surefire configs for retrying on timeout. If there are
those, then maybe a shorter timeout and more retries get us what we need at
our current amount of parallel test runs.

Re: [DISCUSS] options for precommit test reliability?

Reply via email to