Thats a lovely report Busbey. Let me see if I can get a rough answer to your question on minicluster cores.
S On Wed, Oct 11, 2017 at 6:43 AM, Sean Busbey <[email protected]> wrote: > Currently our precommit build has a history of ~233 builds. > > Looking across[1] those for those with unit test logs, and treating > the string "timeout" as an indicator that things failed because of > timeout rather than a known bad answer, we have 80 builds that had one > or more test timeout. > > breaking this down by host: > > | Host | % timeout | Success | Timeout Failure | General Failure | > | ---- | ---------:| -------:| ---------------:| ---------------:| > | H0 | 42% | 10 | 15 | 11 | > | H1 | 54% | 6 | 14 | 6 | > | H2 | 45% | 18 | 35 | 24 | > | H3 | 100% | 0 | 1 | 0 | > | H4 | 0% | 1 | 0 | 2 | > | H5 | 20% | 1 | 1 | 3 | > | H6 | 44% | 4 | 4 | 1 | > | H9 | 35% | 2 | 7 | 11 | > | H10 | 26% | 4 | 8 | 19 | > | H11 | 0% | 0 | 0 | 2 | > | H12 | 43% | 1 | 3 | 3 | > | H13 | 22% | 1 | 2 | 6 | > | H26 | 0% | 0 | 0 | 1 | > > > It's odd that we so strongly favor H2. But I don't see evidence that > we have a bad host that we could just exclude. > > Scaling our concurrency by number of CPU cores is something surefire > can do. Let me see what the H* hosts look like to figure out some > example mappings. Do we have a rough bound on how many cores a single > test using MiniCluster should need? 3? > > -busbey > > [1]: By "looking across" I mean using the python-jenkins library > > https://gist.github.com/busbey/ff5f7ae3a292164cc110fdb934935c8c > > > > On Mon, Oct 9, 2017 at 4:40 PM, Stack <[email protected]> wrote: > > On Mon, Oct 9, 2017 at 7:38 AM, Sean Busbey <[email protected]> wrote: > > > >> Hi folks! > >> > >> Lately our precommit runs have had a large amount of noise around unit > >> test failures due to timeout, especially for the hbase-server module. > >> > >> > > I've not looked at why the timeouts. Anyone? Usually there is a cause. > > > > ... > > > > > >> I'd really like to get us back to a place where a precommit -1 doesn't > >> just result in a reflexive "precommit is unreliable." > > > > > > This is the default. The exception is one of us works on stabilizing test > > suite. It takes a while and a bunch of effort but stabilization has been > > doable in the past. Once stable, it stays that way a while before the rot > > sets in. > > > > > > > >> * Do fewer parallel executions. We do 5 tests at once now and the > >> hbase-server module takes ~1.5 hours. We could tune down just the > >> hbase-server module to do fewer. > >> > > > > > > Is it the loading that is the issue or tests stamping on each other. If > > latter, I'd think we'd want to fix it. If former, would want to look at > it > > too; I'd think our tests shouldn't be such that they fall over if the > > context is other than 'perfect'. > > > > I've not looked at a machine when five concurrent hbase tests running. Is > > it even putting up a load? Over the extent of the full test suite? Or is > it > > that it is just a few tests that when run together, they cause issue. > Could > > we stagger these or give them their own category or have them burn less > > brightly? > > > > If tests are failing because contention for resources, we should fix the > > test. If given a machine, we should burn it up rather than pussy-foot it > > I'd say (can we size the concurrency off a query of the underlying OS so > we > > step by CPUs say?). > > > > Tests could do with an edit. Generally, tests are written once and then > > never touched again. Meantime the system evolves. Edit could look for > > redundancy. Edit could look for cases where we start clusters > > --timeconsumming-- and we don't have to (use Mocks or start standalone > > instances instead). We also have some crazy tests that spin up lots of > > clusters all inside a single JVM though the context is the same as that > of > > a simple method evaluation. > > > > St.Ack >
