I think one thing that is contributing to this a lot too is the general issue of the tests taking up a lot of file descriptors (10k+ if I run them on a standard Debian machine). There are a few suits that contribute to this in particular like `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few others, appears to consume a lot of fds.
Wouldn't it make sense to open JIRAs about those and actively try to reduce the resource consumption of these tests? Seems to me these can cause a lot of unpredictable behavior (making the reason for flaky tests hard to identify especially when there's timeouts etc. involved) + they make it prohibitively expensive for many to test locally imo. On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <[email protected]> wrote: > I was working on something to address this a while ago > https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in > testing locally made things a lot more complicated to fix for each of the > unit tests, should we resurface this JIRA again, I would whole heartedly > agree with the flakiness assessment of the unit tests. > [SPARK-9487] Use the same num. worker threads in Scala ... > <https://issues.apache.org/jira/browse/SPARK-9487> > issues.apache.org > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the ... > > > > ------------------------------ > *From:* Kay Ousterhout <[email protected]> > *Sent:* Wednesday, February 15, 2017 12:10 PM > *To:* [email protected] > *Subject:* File JIRAs for all flaky test failures > > Hi all, > > I've noticed the Spark tests getting increasingly flaky -- it seems more > common than not now that the tests need to be re-run at least once on PRs > before they pass. This is both annoying and problematic because it makes > it harder to tell when a PR is introducing new flakiness. > > To try to clean this up, I'd propose filing a JIRA *every time* Jenkins > fails on a PR (for a reason unrelated to the PR). Just provide a quick > description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or > "Tests failed because 250m timeout expired", a link to the failed build, > and include the "Tests" component. If there's already a JIRA for the > issue, just comment with a link to the latest failure. I know folks don't > always have time to track down why a test failed, but this it at least > helpful to someone else who, later on, is trying to diagnose when the issue > started to find the problematic code / test. > > If this seems like too high overhead, feel free to suggest alternative > ways to make the tests less flaky! > > -Kay >
