Hey Ryan, The existing JIRA also covers publishing nightly docs: https://issues.apache.org/jira/browse/SPARK-1517
- Patrick On Sun, Nov 30, 2014 at 5:53 PM, Ryan Williams <ryan.blake.willi...@gmail.com> wrote: > Thanks Nicholas, glad to hear that some of this info will be pushed to the > main site soon, but this brings up yet another point of confusion that I've > struggled with, namely whether the documentation on github or that on > spark.apache.org should be considered the primary reference for people > seeking to learn about best practices for developing Spark. > > Trying to read docs starting from > https://github.com/apache/spark/blob/master/docs/index.md right now, I find > that all of the links to other parts of the documentation are broken: they > point to relative paths that end in ".html", which will work when published > on the docs-site, but that would have to end in ".md" if a person was to be > able to navigate them on github. > > So expecting people to use the up-to-date docs on github (where all > internal URLs 404 and the main github README suggests that the "latest > Spark documentation" can be found on the actually-months-old docs-site > <https://github.com/apache/spark#online-documentation>) is not a good > solution. On the other hand, consulting months-old docs on the site is also > problematic, as this thread and your last email have borne out. The result > is that there is no good place on the internet to learn about the most > up-to-date best practices for using/developing Spark. > > Why not build http://spark.apache.org/docs/latest/ nightly (or every > commit) off of what's in github, rather than having that URL point to the > last release's docs (up to ~3 months old)? This way, casual users who want > the docs for the released version they happen to be using (which is already > frequently != "/latest" today, for many Spark users) can (still) find them > at http://spark.apache.org/docs/X.Y.Z, and the github README can safely > point people to a site (/latest) that actually has up-to-date docs that > reflect ToT and whose links work. > > If there are concerns about existing semantics around "/latest" URLs being > broken, some new URL could be used, like > http://spark.apache.org/docs/snapshot/, but given that everything under > http://spark.apache.org/docs/latest/ is in a state of > planned-backwards-incompatible-changes every ~3mos, that doesn't sound like > that serious an issue to me; anyone sending around permanent links to > things under /latest is already going to have those links break / not make > sense in the near future. > > > On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> >> - currently the docs only contain information about building with >> maven, >> and even then don't cover many important cases >> >> All other points aside, I just want to point out that the docs document >> both how to use Maven and SBT and clearly state >> <https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt> >> that Maven is the "build of reference" while SBT may be preferable for >> day-to-day development. >> >> I believe the main reason most people miss this documentation is that, >> though it's up-to-date on GitHub, it has't been published yet to the docs >> site. It should go out with the 1.2 release. >> >> Improvements to the documentation on building Spark belong here: >> https://github.com/apache/spark/blob/master/docs/building-spark.md >> >> If there are clear recommendations that come out of this thread but are >> not in that doc, they should be added in there. Other, less important >> details may possibly be better suited for the Contributing to Spark >> <https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark> >> guide. >> >> Nick >> >> >> On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell <pwend...@gmail.com> >> wrote: >> >>> Hey Ryan, >>> >>> A few more things here. You should feel free to send patches to >>> Jenkins to test them, since this is the reference environment in which >>> we regularly run tests. This is the normal workflow for most >>> developers and we spend a lot of effort provisioning/maintaining a >>> very large jenkins cluster to allow developers access this resource. A >>> common development approach is to locally run tests that you've added >>> in a patch, then send it to jenkins for the full run, and then try to >>> debug locally if you see specific unanticipated test failures. >>> >>> One challenge we have is that given the proliferation of OS versions, >>> Java versions, Python versions, ulimits, etc. there is a combinatorial >>> number of environments in which tests could be run. It is very hard in >>> some cases to figure out post-hoc why a given test is not working in a >>> specific environment. I think a good solution here would be to use a >>> standardized docker container for running Spark tests and asking folks >>> to use that locally if they are trying to run all of the hundreds of >>> Spark tests. >>> >>> Another solution would be to mock out every system interaction in >>> Spark's tests including e.g. filesystem interactions to try and reduce >>> variance across environments. However, that seems difficult. >>> >>> As the number of developers of Spark increases, it's definitely a good >>> idea for us to invest in developer infrastructure including things >>> like snapshot releases, better documentation, etc. Thanks for bringing >>> this up as a pain point. >>> >>> - Patrick >>> >>> >>> On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams >>> <ryan.blake.willi...@gmail.com> wrote: >>> > thanks for the info, Matei and Brennon. I will try to switch my >>> workflow to >>> > using sbt. Other potential action items: >>> > >>> > - currently the docs only contain information about building with maven, >>> > and even then don't cover many important cases, as I described in my >>> > previous email. If SBT is as much better as you've described then that >>> > should be made much more obvious. Wasn't it the case recently that there >>> > was only a page about building with SBT, and not one about building with >>> > maven? Clearer messaging around this needs to exist in the >>> documentation, >>> > not just on the mailing list, imho. >>> > >>> > - +1 to better distinguishing between unit and integration tests, having >>> > separate scripts for each, improving documentation around common >>> workflows, >>> > expectations of brittleness with each kind of test, advisability of just >>> > relying on Jenkins for certain kinds of tests to not waste too much >>> time, >>> > etc. Things like the compiler crash should be discussed in the >>> > documentation, not just in the mailing list archives, if new >>> contributors >>> > are likely to run into them through no fault of their own. >>> > >>> > - What is the algorithm you use to decide what tests you might have >>> broken? >>> > Can we codify it in some scripts that other people can use? >>> > >>> > >>> > >>> > On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia <matei.zaha...@gmail.com >>> > >>> > wrote: >>> > >>> >> Hi Ryan, >>> >> >>> >> As a tip (and maybe this isn't documented well), I normally use SBT for >>> >> development to avoid the slow build process, and use its interactive >>> >> console to run only specific tests. The nice advantage is that SBT can >>> keep >>> >> the Scala compiler loaded and JITed across builds, making it faster to >>> >> iterate. To use it, you can do the following: >>> >> >>> >> - Start the SBT interactive console with sbt/sbt >>> >> - Build your assembly by running the "assembly" target in the assembly >>> >> project: assembly/assembly >>> >> - Run all the tests in one module: core/test >>> >> - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite >>> (this >>> >> also supports tab completion) >>> >> >>> >> Running all the tests does take a while, and I usually just rely on >>> >> Jenkins for that once I've run the tests for the things I believed my >>> patch >>> >> could break. But this is because some of them are integration tests >>> (e.g. >>> >> DistributedSuite, which creates multi-process mini-clusters). Many of >>> the >>> >> individual suites run fast without requiring this, however, so you can >>> pick >>> >> the ones you want. Perhaps we should find a way to tag them so people >>> can >>> >> do a "quick-test" that skips the integration ones. >>> >> >>> >> The assembly builds are annoying but they only take about a minute for >>> me >>> >> on a MacBook Pro with SBT warmed up. The assembly is actually only >>> required >>> >> for some of the "integration" tests (which launch new processes), but >>> I'd >>> >> recommend doing it all the time anyway since it would be very >>> confusing to >>> >> run those with an old assembly. The Scala compiler crash issue can >>> also be >>> >> a problem, but I don't see it very often with SBT. If it happens, I >>> exit >>> >> SBT and do sbt clean. >>> >> >>> >> Anyway, this is useful feedback and I think we should try to improve >>> some >>> >> of these suites, but hopefully you can also try the faster SBT >>> process. At >>> >> the end of the day, if we want integration tests, the whole test >>> process >>> >> will take an hour, but most of the developers I know leave that to >>> Jenkins >>> >> and only run individual tests locally before submitting a patch. >>> >> >>> >> Matei >>> >> >>> >> >>> >> > On Nov 30, 2014, at 2:39 PM, Ryan Williams < >>> >> ryan.blake.willi...@gmail.com> wrote: >>> >> > >>> >> > In the course of trying to make contributions to Spark, I have had a >>> lot >>> >> of >>> >> > trouble running Spark's tests successfully. The main pain points I've >>> >> > experienced are: >>> >> > >>> >> > 1) frequent, spurious test failures >>> >> > 2) high latency of running tests >>> >> > 3) difficulty running specific tests in an iterative fashion >>> >> > >>> >> > Here is an example series of failures that I encountered this weekend >>> >> > (along with footnote links to the console output from each and >>> >> > approximately how long each took): >>> >> > >>> >> > - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen >>> >> > before. >>> >> > - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure. >>> >> > - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: >>> BroadcastSuite >>> >> > passed, but scala compiler crashed on the "catalyst" project. >>> >> > - `mvn clean`: some attempts to run earlier commands (that previously >>> >> > didn't crash the compiler) all result in the same compiler crash. >>> >> Previous >>> >> > discussion on this list implies this can only be solved by a `mvn >>> clean` >>> >> > [4]. >>> >> > - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean, >>> >> > BroadcastSuite can't run because assembly is not built. >>> >> > - `./dev/run-tests` again [6]: pyspark tests fail, some messages >>> about >>> >> > version mismatches and python 2.6. The machine this ran on has python >>> >> 2.7, >>> >> > so I don't know what that's about. >>> >> > - `./dev/run-tests` again [7]: "too many open files" errors in >>> several >>> >> > tests. `ulimit -a` shows a maximum of 4864 open files. Apparently >>> this is >>> >> > not enough, but only some of the time? I increased it to 8192 and >>> tried >>> >> > again. >>> >> > - `./dev/run-tests` again [8]: same pyspark errors as before. This >>> seems >>> >> to >>> >> > be the issue from SPARK-3867 [9], which was supposedly fixed on >>> October >>> >> 14; >>> >> > not sure how I'm seeing it now. In any case, switched to Python 2.6 >>> and >>> >> > installed unittest2, and python/run-tests seems to be unblocked. >>> >> > - `./dev/run-tests` again [10]: finally passes! >>> >> > >>> >> > This was on a spark checkout at ceb6281 (ToT Friday), with a few >>> trivial >>> >> > changes added on (that I wanted to test before sending out a PR), on >>> a >>> >> > macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11]. >>> >> > >>> >> > Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar >>> >> commands >>> >> > from the same repo state: >>> >> > >>> >> > - `./dev/run-tests` [12]: YarnClusterSuite failure. >>> >> > - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've >>> seen >>> >> > this one before on this machine and am guessing it actually occurs >>> every >>> >> > time. >>> >> > - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one >>> more >>> >> > time from ceb6281, and saw the same failure. >>> >> > >>> >> > This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to >>> >> narrow >>> >> > down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on >>> my >>> >> mac, >>> >> > from ceb6281, with java 1.7 (instead of 1.8, which the previous runs >>> >> used), >>> >> > and it passed [16], so the failure seems specific to my linux >>> >> machine/arch. >>> >> > >>> >> > At this point I believe that my changes don't break any tests (the >>> >> > YarnClusterSuite failure on my linux presumably not being... "real"), >>> >> and I >>> >> > am ready to send out a PR. Whew! >>> >> > >>> >> > However, reflecting on the 5 or 6 distinct failure-modes represented >>> >> above: >>> >> > >>> >> > - One of them (too many files open), is something I can (and did, >>> >> > hopefully) fix once and for all. It cost me an ~hour this time >>> >> (approximate >>> >> > time of running ./dev/run-tests) and a few hours other times when I >>> >> didn't >>> >> > fully understand/fix it. It doesn't happen deterministically (why?), >>> but >>> >> > does happen somewhat frequently to people, having been discussed on >>> the >>> >> > user list multiple times [17] and on SO [18]. Maybe some note in the >>> >> > documentation advising people to check their ulimit makes sense? >>> >> > - One of them (unittest2 must be installed for python 2.6) was >>> supposedly >>> >> > fixed upstream of the commits I tested here; I don't know why I'm >>> still >>> >> > running into it. This cost me a few hours of running >>> `./dev/run-tests` >>> >> > multiple times to see if it was transient, plus some time >>> researching and >>> >> > working around it. >>> >> > - The original BroadcastSuite failure cost me a few hours and went >>> away >>> >> > before I'd even run `mvn clean`. >>> >> > - A new incarnation of the sbt-compiler-crash phenomenon cost me a >>> few >>> >> > hours of running `./dev/run-tests` in different ways before deciding >>> >> that, >>> >> > as usual, there was no way around it and that I'd need to run `mvn >>> clean` >>> >> > and start running tests from scratch. >>> >> > - The YarnClusterSuite failures on my linux box have cost me hours of >>> >> > trying to figure out whether they're my fault. I've seen them many >>> times >>> >> > over the past weeks/months, plus or minus other failures that have >>> come >>> >> and >>> >> > gone, and was especially befuddled by them when I was seeing a >>> disjoint >>> >> set >>> >> > of reproducible failures on my mac [19] (the triaging of which >>> involved >>> >> > dozens of runs of `./dev/run-tests`). >>> >> > >>> >> > While I'm interested in digging into each of these issues, I also >>> want to >>> >> > discuss the frequency with which I've run into issues like these. >>> This is >>> >> > unfortunately not the first time in recent months that I've spent >>> days >>> >> > playing spurious-test-failure whack-a-mole with a 60-90min >>> dev/run-tests >>> >> > iteration time, which is no fun! So I am wondering/thinking: >>> >> > >>> >> > - Do other people experience this level of flakiness from spark >>> tests? >>> >> > - Do other people bother running dev/run-tests locally, or just let >>> >> Jenkins >>> >> > do it during the CR process? >>> >> > - Needing to run a full assembly post-clean just to continue running >>> one >>> >> > specific test case feels especially wasteful, and the failure output >>> when >>> >> > naively attempting to run a specific test without having built an >>> >> assembly >>> >> > jar is not always clear about what the issue is or how to fix it; >>> even >>> >> the >>> >> > fact that certain tests require "building the world" is not >>> something I >>> >> > would have expected, and has cost me hours of confusion. >>> >> > - Should a person running spark tests assume that they must build >>> an >>> >> > assembly JAR before running anything? >>> >> > - Are there some proper "unit" tests that are actually >>> self-contained >>> >> / >>> >> > able to be run without building an assembly jar? >>> >> > - Can we better document/demarcate which tests have which >>> >> dependencies? >>> >> > - Is there something finer-grained than building an assembly JAR >>> that >>> >> > is sufficient in some cases? >>> >> > - If so, can we document that? >>> >> > - If not, can we move to a world of finer-grained >>> dependencies for >>> >> > some of these? >>> >> > - Leaving all of these spurious failures aside, the process of >>> assembling >>> >> > and testing a new JAR is not a quick one (40 and 60 mins for me >>> >> typically, >>> >> > respectively). I would guess that there are dozens (hundreds?) of >>> people >>> >> > who build a Spark assembly from various ToTs on any given day, and >>> who >>> >> all >>> >> > wait on the exact same compilation / assembly steps to occur. >>> Expanding >>> >> on >>> >> > the recent work to publish nightly snapshots [20], can we do a >>> better job >>> >> > caching/sharing compilation artifacts at a more granular level >>> (pre-built >>> >> > assembly JARs at each SHA? pre-built JARs per-maven-module, per-SHA? >>> more >>> >> > granular maven modules, plus the previous two?), or otherwise save >>> some >>> >> of >>> >> > the considerable amount of redundant compilation work that I had to >>> do >>> >> over >>> >> > the course of my odyssey this weekend? >>> >> > >>> >> > Ramping up on most projects involves some amount of supplementing the >>> >> > documentation with trial and error to figure out what to run, which >>> >> > "errors" are real errors and which can be ignored, etc., but >>> navigating >>> >> > that minefield on Spark has proved especially challenging and >>> >> > time-consuming for me. Some of that comes directly from scala's >>> >> relatively >>> >> > slow compilation times and immature build-tooling ecosystem, but >>> that is >>> >> > the world we live in and it would be nice if Spark took the >>> alleviation >>> >> of >>> >> > the resulting pain more seriously, as one of the more interesting and >>> >> > well-known large scala projects around right now. The official >>> >> > documentation around how to build different subsets of the codebase >>> is >>> >> > somewhat sparse [21], and there have been many mixed [22] accounts >>> [23] >>> >> on >>> >> > this mailing list about preferred ways to build on mvn vs. sbt (none >>> of >>> >> > which has made it into official documentation, as far as I've seen). >>> >> > Expecting new contributors to piece together all of this received >>> >> > folk-wisdom about how to build/test in a sane way by trawling mailing >>> >> list >>> >> > archives seems suboptimal. >>> >> > >>> >> > Thanks for reading, looking forward to hearing your ideas! >>> >> > >>> >> > -Ryan >>> >> > >>> >> > P.S. Is "best practice" for emailing this list to not incorporate any >>> >> HTML >>> >> > in the body? It seems like all of the archives I've seen strip it >>> out, >>> >> but >>> >> > other people have used it and gmail displays it. >>> >> > >>> >> > >>> >> > [1] >>> >> > https://gist.githubusercontent.com/ryan-williams/ >>> 8a162367c4dc157d2479/ >>> >> raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:%20fail >>> >> > (57 mins) >>> >> > [2] >>> >> > https://gist.githubusercontent.com/ryan-williams/ >>> 8a162367c4dc157d2479/ >>> >> raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail >>> >> > (6 mins) >>> >> > [3] >>> >> > https://gist.githubusercontent.com/ryan-williams/ >>> 8a162367c4dc157d2479/ >>> >> raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:% >>> >> 20pass%20test,%20fail%20subsequent%20compile >>> >> > (4 mins) >>> >> > [4] >>> >> > https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web& >>> >> cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user- >>> >> list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling- >>> >> DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr- >>> >> iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2= >>> >> zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja >>> >> > [5] >>> >> > https://gist.githubusercontent.com/ryan-williams/ >>> 8a162367c4dc157d2479/ >>> >> raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post% >>> >> 20clean,%20need%20dependencies%20built >>> >> > [6] >>> >> > https://gist.githubusercontent.com/ryan-williams/ >>> 8a162367c4dc157d2479/ >>> >> raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,% >>> >> 20post%20clean >>> >> > (50 mins) >>> >> > [7] >>> >> > https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97# >>> >> file-dev-run-tests-failure-too-many-files-open-then-hang-L5260 >>> >> > (1hr) >>> >> > [8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f (1hr) >>> >> > [9] https://issues.apache.org/jira/browse/SPARK-3867 >>> >> > [10] https://gist.github.com/ryan-williams/735adf543124c99647cc >>> >> > [11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564 >>> >> > [12] >>> >> > https://gist.github.com/ryan-williams/07df5c583c9481fe1c14# >>> >> file-gistfile1-txt-L853 >>> >> > (~90 mins) >>> >> > [13] >>> >> > https://gist.github.com/ryan-williams/718f6324af358819b496# >>> >> file-gistfile1-txt-L852 >>> >> > (91 mins) >>> >> > [14] >>> >> > https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965# >>> >> file-gistfile1-txt-L854 >>> >> > [15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73 >>> >> > [16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745 >>> >> > [17] >>> >> > http://apache-spark-user-list.1001560.n3.nabble.com/quot- >>> >> Too-many-open-files-quot-exception-on-reduceByKey-td2462.html >>> >> > [18] >>> >> > http://stackoverflow.com/questions/25707629/why-does- >>> >> spark-job-fail-with-too-many-open-files >>> >> > [19] https://issues.apache.org/jira/browse/SPARK-4002 >>> >> > [20] https://issues.apache.org/jira/browse/SPARK-4542 >>> >> > [21] >>> >> > https://spark.apache.org/docs/latest/building-with-maven. >>> >> html#spark-tests-in-maven >>> >> > [22] https://www.mail-archive.com/dev@spark.apache.org/msg06443.html >>> >> > [23] >>> >> > http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/% >>> >> 3ccaohmdzeunhucr41b7krptewmn4cga_2tnpzrwqqb8reekok...@mail.gmail.com >>> %3E >>> >> >>> >> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> For additional commands, e-mail: dev-h...@spark.apache.org >>> >>> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org