Re: Spurious test failures, testing best practices

Patrick Wendell Sun, 30 Nov 2014 18:16:05 -0800

Hey Ryan,

The existing JIRA also covers publishing nightly docs:
https://issues.apache.org/jira/browse/SPARK-1517


- Patrick

On Sun, Nov 30, 2014 at 5:53 PM, Ryan Williams
<ryan.blake.willi...@gmail.com> wrote:
> Thanks Nicholas, glad to hear that some of this info will be pushed to the
> main site soon, but this brings up yet another point of confusion that I've
> struggled with, namely whether the documentation on github or that on
> spark.apache.org should be considered the primary reference for people
> seeking to learn about best practices for developing Spark.
>
> Trying to read docs starting from
> https://github.com/apache/spark/blob/master/docs/index.md right now, I find
> that all of the links to other parts of the documentation are broken: they
> point to relative paths that end in ".html", which will work when published
> on the docs-site, but that would have to end in ".md" if a person was to be
> able to navigate them on github.
>
> So expecting people to use the up-to-date docs on github (where all
> internal URLs 404 and the main github README suggests that the "latest
> Spark documentation" can be found on the actually-months-old docs-site
> <https://github.com/apache/spark#online-documentation>) is not a good
> solution. On the other hand, consulting months-old docs on the site is also
> problematic, as this thread and your last email have borne out.  The result
> is that there is no good place on the internet to learn about the most
> up-to-date best practices for using/developing Spark.
>
> Why not build http://spark.apache.org/docs/latest/ nightly (or every
> commit) off of what's in github, rather than having that URL point to the
> last release's docs (up to ~3 months old)? This way, casual users who want
> the docs for the released version they happen to be using (which is already
> frequently != "/latest" today, for many Spark users) can (still) find them
> at http://spark.apache.org/docs/X.Y.Z, and the github README can safely
> point people to a site (/latest) that actually has up-to-date docs that
> reflect ToT and whose links work.
>
> If there are concerns about existing semantics around "/latest" URLs being
> broken, some new URL could be used, like
> http://spark.apache.org/docs/snapshot/, but given that everything under
> http://spark.apache.org/docs/latest/ is in a state of
> planned-backwards-incompatible-changes every ~3mos, that doesn't sound like
> that serious an issue to me; anyone sending around permanent links to
> things under /latest is already going to have those links break / not make
> sense in the near future.
>
>
> On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>>
>>    - currently the docs only contain information about building with
>>    maven,
>>    and even then don't cover many important cases
>>
>>  All other points aside, I just want to point out that the docs document
>> both how to use Maven and SBT and clearly state
>> <https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt>
>> that Maven is the "build of reference" while SBT may be preferable for
>> day-to-day development.
>>
>> I believe the main reason most people miss this documentation is that,
>> though it's up-to-date on GitHub, it has't been published yet to the docs
>> site. It should go out with the 1.2 release.
>>
>> Improvements to the documentation on building Spark belong here:
>> https://github.com/apache/spark/blob/master/docs/building-spark.md
>>
>> If there are clear recommendations that come out of this thread but are
>> not in that doc, they should be added in there. Other, less important
>> details may possibly be better suited for the Contributing to Spark
>> <https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark>
>> guide.
>>
>> Nick
>>
>>
>> On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell <pwend...@gmail.com>
>> wrote:
>>
>>> Hey Ryan,
>>>
>>> A few more things here. You should feel free to send patches to
>>> Jenkins to test them, since this is the reference environment in which
>>> we regularly run tests. This is the normal workflow for most
>>> developers and we spend a lot of effort provisioning/maintaining a
>>> very large jenkins cluster to allow developers access this resource. A
>>> common development approach is to locally run tests that you've added
>>> in a patch, then send it to jenkins for the full run, and then try to
>>> debug locally if you see specific unanticipated test failures.
>>>
>>> One challenge we have is that given the proliferation of OS versions,
>>> Java versions, Python versions, ulimits, etc. there is a combinatorial
>>> number of environments in which tests could be run. It is very hard in
>>> some cases to figure out post-hoc why a given test is not working in a
>>> specific environment. I think a good solution here would be to use a
>>> standardized docker container for running Spark tests and asking folks
>>> to use that locally if they are trying to run all of the hundreds of
>>> Spark tests.
>>>
>>> Another solution would be to mock out every system interaction in
>>> Spark's tests including e.g. filesystem interactions to try and reduce
>>> variance across environments. However, that seems difficult.
>>>
>>> As the number of developers of Spark increases, it's definitely a good
>>> idea for us to invest in developer infrastructure including things
>>> like snapshot releases, better documentation, etc. Thanks for bringing
>>> this up as a pain point.
>>>
>>> - Patrick
>>>
>>>
>>> On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams
>>> <ryan.blake.willi...@gmail.com> wrote:
>>> > thanks for the info, Matei and Brennon. I will try to switch my
>>> workflow to
>>> > using sbt. Other potential action items:
>>> >
>>> > - currently the docs only contain information about building with maven,
>>> > and even then don't cover many important cases, as I described in my
>>> > previous email. If SBT is as much better as you've described then that
>>> > should be made much more obvious. Wasn't it the case recently that there
>>> > was only a page about building with SBT, and not one about building with
>>> > maven? Clearer messaging around this needs to exist in the
>>> documentation,
>>> > not just on the mailing list, imho.
>>> >
>>> > - +1 to better distinguishing between unit and integration tests, having
>>> > separate scripts for each, improving documentation around common
>>> workflows,
>>> > expectations of brittleness with each kind of test, advisability of just
>>> > relying on Jenkins for certain kinds of tests to not waste too much
>>> time,
>>> > etc. Things like the compiler crash should be discussed in the
>>> > documentation, not just in the mailing list archives, if new
>>> contributors
>>> > are likely to run into them through no fault of their own.
>>> >
>>> > - What is the algorithm you use to decide what tests you might have
>>> broken?
>>> > Can we codify it in some scripts that other people can use?
>>> >
>>> >
>>> >
>>> > On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia <matei.zaha...@gmail.com
>>> >
>>> > wrote:
>>> >
>>> >> Hi Ryan,
>>> >>
>>> >> As a tip (and maybe this isn't documented well), I normally use SBT for
>>> >> development to avoid the slow build process, and use its interactive
>>> >> console to run only specific tests. The nice advantage is that SBT can
>>> keep
>>> >> the Scala compiler loaded and JITed across builds, making it faster to
>>> >> iterate. To use it, you can do the following:
>>> >>
>>> >> - Start the SBT interactive console with sbt/sbt
>>> >> - Build your assembly by running the "assembly" target in the assembly
>>> >> project: assembly/assembly
>>> >> - Run all the tests in one module: core/test
>>> >> - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
>>> (this
>>> >> also supports tab completion)
>>> >>
>>> >> Running all the tests does take a while, and I usually just rely on
>>> >> Jenkins for that once I've run the tests for the things I believed my
>>> patch
>>> >> could break. But this is because some of them are integration tests
>>> (e.g.
>>> >> DistributedSuite, which creates multi-process mini-clusters). Many of
>>> the
>>> >> individual suites run fast without requiring this, however, so you can
>>> pick
>>> >> the ones you want. Perhaps we should find a way to tag them so people
>>> can
>>> >> do a "quick-test" that skips the integration ones.
>>> >>
>>> >> The assembly builds are annoying but they only take about a minute for
>>> me
>>> >> on a MacBook Pro with SBT warmed up. The assembly is actually only
>>> required
>>> >> for some of the "integration" tests (which launch new processes), but
>>> I'd
>>> >> recommend doing it all the time anyway since it would be very
>>> confusing to
>>> >> run those with an old assembly. The Scala compiler crash issue can
>>> also be
>>> >> a problem, but I don't see it very often with SBT. If it happens, I
>>> exit
>>> >> SBT and do sbt clean.
>>> >>
>>> >> Anyway, this is useful feedback and I think we should try to improve
>>> some
>>> >> of these suites, but hopefully you can also try the faster SBT
>>> process. At
>>> >> the end of the day, if we want integration tests, the whole test
>>> process
>>> >> will take an hour, but most of the developers I know leave that to
>>> Jenkins
>>> >> and only run individual tests locally before submitting a patch.
>>> >>
>>> >> Matei
>>> >>
>>> >>
>>> >> > On Nov 30, 2014, at 2:39 PM, Ryan Williams <
>>> >> ryan.blake.willi...@gmail.com> wrote:
>>> >> >
>>> >> > In the course of trying to make contributions to Spark, I have had a
>>> lot
>>> >> of
>>> >> > trouble running Spark's tests successfully. The main pain points I've
>>> >> > experienced are:
>>> >> >
>>> >> >    1) frequent, spurious test failures
>>> >> >    2) high latency of running tests
>>> >> >    3) difficulty running specific tests in an iterative fashion
>>> >> >
>>> >> > Here is an example series of failures that I encountered this weekend
>>> >> > (along with footnote links to the console output from each and
>>> >> > approximately how long each took):
>>> >> >
>>> >> > - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
>>> >> > before.
>>> >> > - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
>>> >> > - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]:
>>> BroadcastSuite
>>> >> > passed, but scala compiler crashed on the "catalyst" project.
>>> >> > - `mvn clean`: some attempts to run earlier commands (that previously
>>> >> > didn't crash the compiler) all result in the same compiler crash.
>>> >> Previous
>>> >> > discussion on this list implies this can only be solved by a `mvn
>>> clean`
>>> >> > [4].
>>> >> > - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
>>> >> > BroadcastSuite can't run because assembly is not built.
>>> >> > - `./dev/run-tests` again [6]: pyspark tests fail, some messages
>>> about
>>> >> > version mismatches and python 2.6. The machine this ran on has python
>>> >> 2.7,
>>> >> > so I don't know what that's about.
>>> >> > - `./dev/run-tests` again [7]: "too many open files" errors in
>>> several
>>> >> > tests. `ulimit -a` shows a maximum of 4864 open files. Apparently
>>> this is
>>> >> > not enough, but only some of the time? I increased it to 8192 and
>>> tried
>>> >> > again.
>>> >> > - `./dev/run-tests` again [8]: same pyspark errors as before. This
>>> seems
>>> >> to
>>> >> > be the issue from SPARK-3867 [9], which was supposedly fixed on
>>> October
>>> >> 14;
>>> >> > not sure how I'm seeing it now. In any case, switched to Python 2.6
>>> and
>>> >> > installed unittest2, and python/run-tests seems to be unblocked.
>>> >> > - `./dev/run-tests` again [10]: finally passes!
>>> >> >
>>> >> > This was on a spark checkout at ceb6281 (ToT Friday), with a few
>>> trivial
>>> >> > changes added on (that I wanted to test before sending out a PR), on
>>> a
>>> >> > macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].
>>> >> >
>>> >> > Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar
>>> >> commands
>>> >> > from the same repo state:
>>> >> >
>>> >> > - `./dev/run-tests` [12]: YarnClusterSuite failure.
>>> >> > - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've
>>> seen
>>> >> > this one before on this machine and am guessing it actually occurs
>>> every
>>> >> > time.
>>> >> > - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one
>>> more
>>> >> > time from ceb6281, and saw the same failure.
>>> >> >
>>> >> > This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to
>>> >> narrow
>>> >> > down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on
>>> my
>>> >> mac,
>>> >> > from ceb6281, with java 1.7 (instead of 1.8, which the previous runs
>>> >> used),
>>> >> > and it passed [16], so the failure seems specific to my linux
>>> >> machine/arch.
>>> >> >
>>> >> > At this point I believe that my changes don't break any tests (the
>>> >> > YarnClusterSuite failure on my linux presumably not being... "real"),
>>> >> and I
>>> >> > am ready to send out a PR. Whew!
>>> >> >
>>> >> > However, reflecting on the 5 or 6 distinct failure-modes represented
>>> >> above:
>>> >> >
>>> >> > - One of them (too many files open), is something I can (and did,
>>> >> > hopefully) fix once and for all. It cost me an ~hour this time
>>> >> (approximate
>>> >> > time of running ./dev/run-tests) and a few hours other times when I
>>> >> didn't
>>> >> > fully understand/fix it. It doesn't happen deterministically (why?),
>>> but
>>> >> > does happen somewhat frequently to people, having been discussed on
>>> the
>>> >> > user list multiple times [17] and on SO [18]. Maybe some note in the
>>> >> > documentation advising people to check their ulimit makes sense?
>>> >> > - One of them (unittest2 must be installed for python 2.6) was
>>> supposedly
>>> >> > fixed upstream of the commits I tested here; I don't know why I'm
>>> still
>>> >> > running into it. This cost me a few hours of running
>>> `./dev/run-tests`
>>> >> > multiple times to see if it was transient, plus some time
>>> researching and
>>> >> > working around it.
>>> >> > - The original BroadcastSuite failure cost me a few hours and went
>>> away
>>> >> > before I'd even run `mvn clean`.
>>> >> > - A new incarnation of the sbt-compiler-crash phenomenon cost me a
>>> few
>>> >> > hours of running `./dev/run-tests` in different ways before deciding
>>> >> that,
>>> >> > as usual, there was no way around it and that I'd need to run `mvn
>>> clean`
>>> >> > and start running tests from scratch.
>>> >> > - The YarnClusterSuite failures on my linux box have cost me hours of
>>> >> > trying to figure out whether they're my fault. I've seen them many
>>> times
>>> >> > over the past weeks/months, plus or minus other failures that have
>>> come
>>> >> and
>>> >> > gone, and was especially befuddled by them when I was seeing a
>>> disjoint
>>> >> set
>>> >> > of reproducible failures on my mac [19] (the triaging of which
>>> involved
>>> >> > dozens of runs of `./dev/run-tests`).
>>> >> >
>>> >> > While I'm interested in digging into each of these issues, I also
>>> want to
>>> >> > discuss the frequency with which I've run into issues like these.
>>> This is
>>> >> > unfortunately not the first time in recent months that I've spent
>>> days
>>> >> > playing spurious-test-failure whack-a-mole with a 60-90min
>>> dev/run-tests
>>> >> > iteration time, which is no fun! So I am wondering/thinking:
>>> >> >
>>> >> > - Do other people experience this level of flakiness from spark
>>> tests?
>>> >> > - Do other people bother running dev/run-tests locally, or just let
>>> >> Jenkins
>>> >> > do it during the CR process?
>>> >> > - Needing to run a full assembly post-clean just to continue running
>>> one
>>> >> > specific test case feels especially wasteful, and the failure output
>>> when
>>> >> > naively attempting to run a specific test without having built an
>>> >> assembly
>>> >> > jar is not always clear about what the issue is or how to fix it;
>>> even
>>> >> the
>>> >> > fact that certain tests require "building the world" is not
>>> something I
>>> >> > would have expected, and has cost me hours of confusion.
>>> >> >    - Should a person running spark tests assume that they must build
>>> an
>>> >> > assembly JAR before running anything?
>>> >> >    - Are there some proper "unit" tests that are actually
>>> self-contained
>>> >> /
>>> >> > able to be run without building an assembly jar?
>>> >> >    - Can we better document/demarcate which tests have which
>>> >> dependencies?
>>> >> >    - Is there something finer-grained than building an assembly JAR
>>> that
>>> >> > is sufficient in some cases?
>>> >> >        - If so, can we document that?
>>> >> >        - If not, can we move to a world of finer-grained
>>> dependencies for
>>> >> > some of these?
>>> >> > - Leaving all of these spurious failures aside, the process of
>>> assembling
>>> >> > and testing a new JAR is not a quick one (40 and 60 mins for me
>>> >> typically,
>>> >> > respectively). I would guess that there are dozens (hundreds?) of
>>> people
>>> >> > who build a Spark assembly from various ToTs on any given day, and
>>> who
>>> >> all
>>> >> > wait on the exact same compilation / assembly steps to occur.
>>> Expanding
>>> >> on
>>> >> > the recent work to publish nightly snapshots [20], can we do a
>>> better job
>>> >> > caching/sharing compilation artifacts at a more granular level
>>> (pre-built
>>> >> > assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA?
>>> more
>>> >> > granular maven modules, plus the previous two?), or otherwise save
>>> some
>>> >> of
>>> >> > the considerable amount of redundant compilation work that I had to
>>> do
>>> >> over
>>> >> > the course of my odyssey this weekend?
>>> >> >
>>> >> > Ramping up on most projects involves some amount of supplementing the
>>> >> > documentation with trial and error to figure out what to run, which
>>> >> > "errors" are real errors and which can be ignored, etc., but
>>> navigating
>>> >> > that minefield on Spark has proved especially challenging and
>>> >> > time-consuming for me. Some of that comes directly from scala's
>>> >> relatively
>>> >> > slow compilation times and immature build-tooling ecosystem, but
>>> that is
>>> >> > the world we live in and it would be nice if Spark took the
>>> alleviation
>>> >> of
>>> >> > the resulting pain more seriously, as one of the more interesting and
>>> >> > well-known large scala projects around right now. The official
>>> >> > documentation around how to build different subsets of the codebase
>>> is
>>> >> > somewhat sparse [21], and there have been many mixed [22] accounts
>>> [23]
>>> >> on
>>> >> > this mailing list about preferred ways to build on mvn vs. sbt (none
>>> of
>>> >> > which has made it into official documentation, as far as I've seen).
>>> >> > Expecting new contributors to piece together all of this received
>>> >> > folk-wisdom about how to build/test in a sane way by trawling mailing
>>> >> list
>>> >> > archives seems suboptimal.
>>> >> >
>>> >> > Thanks for reading, looking forward to hearing your ideas!
>>> >> >
>>> >> > -Ryan
>>> >> >
>>> >> > P.S. Is "best practice" for emailing this list to not incorporate any
>>> >> HTML
>>> >> > in the body? It seems like all of the archives I've seen strip it
>>> out,
>>> >> but
>>> >> > other people have used it and gmail displays it.
>>> >> >
>>> >> >
>>> >> > [1]
>>> >> > https://gist.githubusercontent.com/ryan-williams/
>>> 8a162367c4dc157d2479/
>>> >> raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail
>>> >> > (57 mins)
>>> >> > [2]
>>> >> > https://gist.githubusercontent.com/ryan-williams/
>>> 8a162367c4dc157d2479/
>>> >> raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail
>>> >> > (6 mins)
>>> >> > [3]
>>> >> > https://gist.githubusercontent.com/ryan-williams/
>>> 8a162367c4dc157d2479/
>>> >> raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:%
>>> >> 20pass%20test,%20fail%20subsequent%20compile
>>> >> > (4 mins)
>>> >> > [4]
>>> >> > https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&;
>>> >> cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user-
>>> >> list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling-
>>> >> DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr-
>>> >> iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2=
>>> >> zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja
>>> >> > [5]
>>> >> > https://gist.githubusercontent.com/ryan-williams/
>>> 8a162367c4dc157d2479/
>>> >> raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post%
>>> >> 20clean,%20need%20dependencies%20built
>>> >> > [6]
>>> >> > https://gist.githubusercontent.com/ryan-williams/
>>> 8a162367c4dc157d2479/
>>> >> raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,%
>>> >> 20post%20clean
>>> >> > (50 mins)
>>> >> > [7]
>>> >> > https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97#
>>> >> file-dev-run-tests-failure-too-many-files-open-then-hang-L5260
>>> >> > (1hr)
>>> >> > [8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr)
>>> >> > [9] https://issues.apache.org/jira/browse/SPARK-3867
>>> >> > [10] https://gist.github.com/ryan-williams/735adf543124c99647cc
>>> >> > [11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564
>>> >> > [12]
>>> >> > https://gist.github.com/ryan-williams/07df5c583c9481fe1c14#
>>> >> file-gistfile1-txt-L853
>>> >> > (~90 mins)
>>> >> > [13]
>>> >> > https://gist.github.com/ryan-williams/718f6324af358819b496#
>>> >> file-gistfile1-txt-L852
>>> >> > (91 mins)
>>> >> > [14]
>>> >> > https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965#
>>> >> file-gistfile1-txt-L854
>>> >> > [15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73
>>> >> > [16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745
>>> >> > [17]
>>> >> > http://apache-spark-user-list.1001560.n3.nabble.com/quot-
>>> >> Too-many-open-files-quot-exception-on-reduceByKey-td2462.html
>>> >> > [18]
>>> >> > http://stackoverflow.com/questions/25707629/why-does-
>>> >> spark-job-fail-with-too-many-open-files
>>> >> > [19] https://issues.apache.org/jira/browse/SPARK-4002
>>> >> > [20] https://issues.apache.org/jira/browse/SPARK-4542
>>> >> > [21]
>>> >> > https://spark.apache.org/docs/latest/building-with-maven.
>>> >> html#spark-tests-in-maven
>>> >> > [22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.html
>>> >> > [23]
>>> >> > http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%
>>> >> 3ccaohmdzeunhucr41b7krptewmn4cga_2tnpzrwqqb8reekok...@mail.gmail.com
>>> %3E
>>> >>
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spurious test failures, testing best practices

Reply via email to