I think one thing that is contributing to this a lot too is the general
issue of the tests taking up a lot of file descriptors (10k+ if I run them
on a standard Debian machine).
There are a few suits that contribute to this in particular like
`org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others,
appears to consume a lot of fds.

Wouldn't it make sense to open JIRAs about those and actively try to reduce
the resource consumption of these tests?
Seems to me these can cause a lot of unpredictable behavior (making the
reason for flaky tests hard to identify especially when there's timeouts
etc. involved) + they make it prohibitively expensive for many to test
locally imo.

On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <sxk1...@hotmail.com>
wrote:

> I was working on something to address this a while ago
> https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in
> testing locally made things a lot more complicated to fix for each of the
> unit tests, should we resurface this JIRA again, I would whole heartedly
> agree with the flakiness assessment of the unit tests.
> [SPARK-9487] Use the same num. worker threads in Scala ...
> <https://issues.apache.org/jira/browse/SPARK-9487>
> issues.apache.org
> In Python we use `local[4]` for unit tests, while in Scala/Java we use
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other
> components. If the ...
>
>
>
> ------------------------------
> *From:* Kay Ousterhout <kayousterh...@gmail.com>
> *Sent:* Wednesday, February 15, 2017 12:10 PM
> *To:* dev@spark.apache.org
> *Subject:* File JIRAs for all flaky test failures
>
> Hi all,
>
> I've noticed the Spark tests getting increasingly flaky -- it seems more
> common than not now that the tests need to be re-run at least once on PRs
> before they pass.  This is both annoying and problematic because it makes
> it harder to tell when a PR is introducing new flakiness.
>
> To try to clean this up, I'd propose filing a JIRA *every time* Jenkins
> fails on a PR (for a reason unrelated to the PR).  Just provide a quick
> description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or
> "Tests failed because 250m timeout expired", a link to the failed build,
> and include the "Tests" component.  If there's already a JIRA for the
> issue, just comment with a link to the latest failure.  I know folks don't
> always have time to track down why a test failed, but this it at least
> helpful to someone else who, later on, is trying to diagnose when the issue
> started to find the problematic code / test.
>
> If this seems like too high overhead, feel free to suggest alternative
> ways to make the tests less flaky!
>
> -Kay
>

Reply via email to