Is anyone using the parallel test runner? I did another test of it today
and it triggered 278 failing tests. I noticed a lot of timeouts so I tried
bumping the default wait time from 15 seconds to 120 seconds. That brought
it down to 43 failures.

Taking a look at the remaining failures, it seems it is going too wide on
my system (the system has 12 core, 24 hyperthreads, although I see 48
entries in /proc/cpuinfo):

[----------] 1 test from DiskQuotaTest
[ RUN      ] DiskQuotaTest.SlaveRecovery
/home/bmahler/git/mesos/build/src/mesos-containerizer: fork: retry:
Resource temporarily unavailable
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
../../src/tests/disk_quota_tests.cpp:666: Failure
Value of: status->state()
  Actual: TASK_FAILED
Expected: TASK_RUNNING
../../src/tests/disk_quota_tests.cpp:671: Failure
Value of: containers->size()
  Actual: 0
Expected: 1u
Which is: 1
[  FAILED  ] DiskQuotaTest.SlaveRecovery (1636 ms)
[----------] 1 test from DiskQuotaTest (1638 ms total)

[----------] 1 test from FetcherTest
[ RUN      ] FetcherTest.UNZIP_ExtractFileWithDuplicatedEntries
../../src/tests/fetcher_tests.cpp:911: Failure
(fetch).failure(): Failed to execute mesos-fetcher: Failed to clone:
Resource temporarily unavailable
[  FAILED  ] FetcherTest.UNZIP_ExtractFileWithDuplicatedEntries (8 ms)
[----------] 1 test from FetcherTest (8 ms total)

It seems we should constrain how wide it goes, as well as restrict the
number of worker threads libprocess uses in each instance.

On Thu, Oct 13, 2016 at 3:51 PM, Michael Park <mp...@apache.org> wrote:

> Thanks for pushing this through Benjamin!
>
> I understand if you're unable to attend the community sync on the 20th,
> but would you be able to present this as a demo somehow? maybe via a
> screencast?
>
> MPark
>
> On Thu, Oct 13, 2016 at 6:33 PM, Benjamin Mahler <bmah...@apache.org>
> wrote:
>
> > Great to see this Benjamin!
> >
> > Looking forward to seeing the parallel test runner turn green, I'll help
> > file tickets under the epic (I see there are a lot of test failures for
> > me).
> >
> > Once we clear the issues and turn it green, shall we make this the
> default?
> > I would be in favor of that.
> >
> > Ben
> >
> > On Thu, Oct 13, 2016 at 2:28 PM, Benjamin Bannier <
> > benjamin.bann...@mesosphere.io> wrote:
> >
> > >
> > > Hi,
> > >
> > > Since most tests in the Mesos, libprocess, and stout test suites can
> > > be executed in parallel (the exception being some `ROOT` tests with
> > > global side effects in Mesos), we recently added a parallel test
> > > runner `support/mesos-gtest-runner.py`. This should allow to
> > > potentially significantly speed up running of test suites.
> > >
> > > To enable automatic parallel execution of tests for test targets
> > > executed during `make check`, configure Mesos with the option
> > > `--enable-parallel-test-execution`. This will configure the test
> runner
> > > to run all tests but the `ROOT` tests in parallel; `ROOT` tests will
> > > be run in a separate, sequential step.
> > >
> > > * * *
> > >
> > > We use the environment variable `TEST_DRIVER` to drive parallel test
> > > execution. By setting this variable to an empty string you can
> > > temporarily disable configured parallel execution, e.g.,
> > >
> > >     % make check TEST_DRIVER=
> > >
> > > By setting this environment variable you have control over the test
> > > runner itself and its arguments, even without enabling parallel test
> > > during `./configure` time. Be aware that many `ROOT` tests cannot be
> > > run in parallel.
> > >
> > >
> > > The current settings oversubscribe the machine by running `#cores*1.5`
> > > parallel jobs. This was driven by the observation that currently our
> > > tests by and large do not make extended use of even a single core.
> > > The number of parallel jobs can by controlled with the `-j` flag of
> > > the test runner.
> > >
> > > Since making more use of the machine will likely increase machine load
> > > during test execution, running tests in parallel might expose test
> > > flakiness. Tests might also fail to run in parallel if testcases e.g.,
> > > write data to hardcoded locations or use hardcoded ports. Please file
> > > JIRA tickets for such tests if they do not yet exist.
> > >
> > >
> > > There is still some work needed to improve reporting from parallel
> > > tests. We currently use a very silent mode if tests are running
> > > without failures, and just report the logs of failed jobs in case of
> > > failure. MESOS-6387 sketches out possible future improvements in this
> > > area.
> > >
> > >
> > > Happy testing,
> > >
> > > Benjamin with help from Kevin & Till
> > >
> > >
> >
>

Reply via email to