Thanks Patrick, great to hear that docs-snapshots-via-jenkins is already JIRA'd; you can interpret some of this thread as a gigantic +1 from me on prioritizing that, which it looks like you are doing :)
I do understand the limitations of the "github vs. official site" status quo; I was mostly responding to a perceived implication that I should have been getting building/testing-spark advice from the github .md files instead of from /latest. I agree that neither one works very well currently, and that docs-snapshots-via-jenkins is the right solution. Per my other email, leaving /latest as-is sounds reasonable, as long as jenkins is putting the latest docs *somewhere*. On Sun Nov 30 2014 at 7:19:33 PM Patrick Wendell <pwend...@gmail.com> wrote: > Btw - the documnetation on github represents the source code of our > docs, which is versioned with each release. Unfortunately github will > always try to render ".md" files so it could look to a passerby like > this is supposed to represent published docs. This is a feature > limitation of github, AFAIK we cannot disable it. > > The official published docs are associated with each release and > available on the apache.org website. I think "/latest" is a common > convention for referring to the latest *published release* docs, so > probably we can't change that (the audience for /latest is orders of > magnitude larger than for snapshot docs). However we could just add > /snapshot and publish docs there. > > - Patrick > > On Sun, Nov 30, 2014 at 6:15 PM, Patrick Wendell <pwend...@gmail.com> > wrote: > > Hey Ryan, > > > > The existing JIRA also covers publishing nightly docs: > > https://issues.apache.org/jira/browse/SPARK-1517 > > > > - Patrick > > > > On Sun, Nov 30, 2014 at 5:53 PM, Ryan Williams > > <ryan.blake.willi...@gmail.com> wrote: > >> Thanks Nicholas, glad to hear that some of this info will be pushed to > the > >> main site soon, but this brings up yet another point of confusion that > I've > >> struggled with, namely whether the documentation on github or that on > >> spark.apache.org should be considered the primary reference for people > >> seeking to learn about best practices for developing Spark. > >> > >> Trying to read docs starting from > >> https://github.com/apache/spark/blob/master/docs/index.md right now, I > find > >> that all of the links to other parts of the documentation are broken: > they > >> point to relative paths that end in ".html", which will work when > published > >> on the docs-site, but that would have to end in ".md" if a person was > to be > >> able to navigate them on github. > >> > >> So expecting people to use the up-to-date docs on github (where all > >> internal URLs 404 and the main github README suggests that the "latest > >> Spark documentation" can be found on the actually-months-old docs-site > >> <https://github.com/apache/spark#online-documentation>) is not a good > >> solution. On the other hand, consulting months-old docs on the site is > also > >> problematic, as this thread and your last email have borne out. The > result > >> is that there is no good place on the internet to learn about the most > >> up-to-date best practices for using/developing Spark. > >> > >> Why not build http://spark.apache.org/docs/latest/ nightly (or every > >> commit) off of what's in github, rather than having that URL point to > the > >> last release's docs (up to ~3 months old)? This way, casual users who > want > >> the docs for the released version they happen to be using (which is > already > >> frequently != "/latest" today, for many Spark users) can (still) find > them > >> at http://spark.apache.org/docs/X.Y.Z, and the github README can safely > >> point people to a site (/latest) that actually has up-to-date docs that > >> reflect ToT and whose links work. > >> > >> If there are concerns about existing semantics around "/latest" URLs > being > >> broken, some new URL could be used, like > >> http://spark.apache.org/docs/snapshot/, but given that everything under > >> http://spark.apache.org/docs/latest/ is in a state of > >> planned-backwards-incompatible-changes every ~3mos, that doesn't sound > like > >> that serious an issue to me; anyone sending around permanent links to > >> things under /latest is already going to have those links break / not > make > >> sense in the near future. > >> > >> > >> On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas < > >> nicholas.cham...@gmail.com> wrote: > >> > >>> > >>> - currently the docs only contain information about building with > >>> maven, > >>> and even then don't cover many important cases > >>> > >>> All other points aside, I just want to point out that the docs > document > >>> both how to use Maven and SBT and clearly state > >>> <https://github.com/apache/spark/blob/master/docs/ > building-spark.md#building-with-sbt> > >>> that Maven is the "build of reference" while SBT may be preferable for > >>> day-to-day development. > >>> > >>> I believe the main reason most people miss this documentation is that, > >>> though it's up-to-date on GitHub, it has't been published yet to the > docs > >>> site. It should go out with the 1.2 release. > >>> > >>> Improvements to the documentation on building Spark belong here: > >>> https://github.com/apache/spark/blob/master/docs/building-spark.md > >>> > >>> If there are clear recommendations that come out of this thread but are > >>> not in that doc, they should be added in there. Other, less important > >>> details may possibly be better suited for the Contributing to Spark > >>> <https://cwiki.apache.org/confluence/display/SPARK/ > Contributing+to+Spark> > >>> guide. > >>> > >>> Nick > >>> > >>> > >>> On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell <pwend...@gmail.com> > >>> wrote: > >>> > >>>> Hey Ryan, > >>>> > >>>> A few more things here. You should feel free to send patches to > >>>> Jenkins to test them, since this is the reference environment in which > >>>> we regularly run tests. This is the normal workflow for most > >>>> developers and we spend a lot of effort provisioning/maintaining a > >>>> very large jenkins cluster to allow developers access this resource. A > >>>> common development approach is to locally run tests that you've added > >>>> in a patch, then send it to jenkins for the full run, and then try to > >>>> debug locally if you see specific unanticipated test failures. > >>>> > >>>> One challenge we have is that given the proliferation of OS versions, > >>>> Java versions, Python versions, ulimits, etc. there is a combinatorial > >>>> number of environments in which tests could be run. It is very hard in > >>>> some cases to figure out post-hoc why a given test is not working in a > >>>> specific environment. I think a good solution here would be to use a > >>>> standardized docker container for running Spark tests and asking folks > >>>> to use that locally if they are trying to run all of the hundreds of > >>>> Spark tests. > >>>> > >>>> Another solution would be to mock out every system interaction in > >>>> Spark's tests including e.g. filesystem interactions to try and reduce > >>>> variance across environments. However, that seems difficult. > >>>> > >>>> As the number of developers of Spark increases, it's definitely a good > >>>> idea for us to invest in developer infrastructure including things > >>>> like snapshot releases, better documentation, etc. Thanks for bringing > >>>> this up as a pain point. > >>>> > >>>> - Patrick > >>>> > >>>> > >>>> On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams > >>>> <ryan.blake.willi...@gmail.com> wrote: > >>>> > thanks for the info, Matei and Brennon. I will try to switch my > >>>> workflow to > >>>> > using sbt. Other potential action items: > >>>> > > >>>> > - currently the docs only contain information about building with > maven, > >>>> > and even then don't cover many important cases, as I described in my > >>>> > previous email. If SBT is as much better as you've described then > that > >>>> > should be made much more obvious. Wasn't it the case recently that > there > >>>> > was only a page about building with SBT, and not one about building > with > >>>> > maven? Clearer messaging around this needs to exist in the > >>>> documentation, > >>>> > not just on the mailing list, imho. > >>>> > > >>>> > - +1 to better distinguishing between unit and integration tests, > having > >>>> > separate scripts for each, improving documentation around common > >>>> workflows, > >>>> > expectations of brittleness with each kind of test, advisability of > just > >>>> > relying on Jenkins for certain kinds of tests to not waste too much > >>>> time, > >>>> > etc. Things like the compiler crash should be discussed in the > >>>> > documentation, not just in the mailing list archives, if new > >>>> contributors > >>>> > are likely to run into them through no fault of their own. > >>>> > > >>>> > - What is the algorithm you use to decide what tests you might have > >>>> broken? > >>>> > Can we codify it in some scripts that other people can use? > >>>> > > >>>> > > >>>> > > >>>> > On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia < > matei.zaha...@gmail.com > >>>> > > >>>> > wrote: > >>>> > > >>>> >> Hi Ryan, > >>>> >> > >>>> >> As a tip (and maybe this isn't documented well), I normally use > SBT for > >>>> >> development to avoid the slow build process, and use its > interactive > >>>> >> console to run only specific tests. The nice advantage is that SBT > can > >>>> keep > >>>> >> the Scala compiler loaded and JITed across builds, making it > faster to > >>>> >> iterate. To use it, you can do the following: > >>>> >> > >>>> >> - Start the SBT interactive console with sbt/sbt > >>>> >> - Build your assembly by running the "assembly" target in the > assembly > >>>> >> project: assembly/assembly > >>>> >> - Run all the tests in one module: core/test > >>>> >> - Run a specific suite: core/test-only > org.apache.spark.rdd.RDDSuite > >>>> (this > >>>> >> also supports tab completion) > >>>> >> > >>>> >> Running all the tests does take a while, and I usually just rely on > >>>> >> Jenkins for that once I've run the tests for the things I believed > my > >>>> patch > >>>> >> could break. But this is because some of them are integration tests > >>>> (e.g. > >>>> >> DistributedSuite, which creates multi-process mini-clusters). Many > of > >>>> the > >>>> >> individual suites run fast without requiring this, however, so you > can > >>>> pick > >>>> >> the ones you want. Perhaps we should find a way to tag them so > people > >>>> can > >>>> >> do a "quick-test" that skips the integration ones. > >>>> >> > >>>> >> The assembly builds are annoying but they only take about a minute > for > >>>> me > >>>> >> on a MacBook Pro with SBT warmed up. The assembly is actually only > >>>> required > >>>> >> for some of the "integration" tests (which launch new processes), > but > >>>> I'd > >>>> >> recommend doing it all the time anyway since it would be very > >>>> confusing to > >>>> >> run those with an old assembly. The Scala compiler crash issue can > >>>> also be > >>>> >> a problem, but I don't see it very often with SBT. If it happens, I > >>>> exit > >>>> >> SBT and do sbt clean. > >>>> >> > >>>> >> Anyway, this is useful feedback and I think we should try to > improve > >>>> some > >>>> >> of these suites, but hopefully you can also try the faster SBT > >>>> process. At > >>>> >> the end of the day, if we want integration tests, the whole test > >>>> process > >>>> >> will take an hour, but most of the developers I know leave that to > >>>> Jenkins > >>>> >> and only run individual tests locally before submitting a patch. > >>>> >> > >>>> >> Matei > >>>> >> > >>>> >> > >>>> >> > On Nov 30, 2014, at 2:39 PM, Ryan Williams < > >>>> >> ryan.blake.willi...@gmail.com> wrote: > >>>> >> > > >>>> >> > In the course of trying to make contributions to Spark, I have > had a > >>>> lot > >>>> >> of > >>>> >> > trouble running Spark's tests successfully. The main pain points > I've > >>>> >> > experienced are: > >>>> >> > > >>>> >> > 1) frequent, spurious test failures > >>>> >> > 2) high latency of running tests > >>>> >> > 3) difficulty running specific tests in an iterative fashion > >>>> >> > > >>>> >> > Here is an example series of failures that I encountered this > weekend > >>>> >> > (along with footnote links to the console output from each and > >>>> >> > approximately how long each took): > >>>> >> > > >>>> >> > - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not > seen > >>>> >> > before. > >>>> >> > - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure. > >>>> >> > - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: > >>>> BroadcastSuite > >>>> >> > passed, but scala compiler crashed on the "catalyst" project. > >>>> >> > - `mvn clean`: some attempts to run earlier commands (that > previously > >>>> >> > didn't crash the compiler) all result in the same compiler crash. > >>>> >> Previous > >>>> >> > discussion on this list implies this can only be solved by a `mvn > >>>> clean` > >>>> >> > [4]. > >>>> >> > - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately > post-clean, > >>>> >> > BroadcastSuite can't run because assembly is not built. > >>>> >> > - `./dev/run-tests` again [6]: pyspark tests fail, some messages > >>>> about > >>>> >> > version mismatches and python 2.6. The machine this ran on has > python > >>>> >> 2.7, > >>>> >> > so I don't know what that's about. > >>>> >> > - `./dev/run-tests` again [7]: "too many open files" errors in > >>>> several > >>>> >> > tests. `ulimit -a` shows a maximum of 4864 open files. Apparently > >>>> this is > >>>> >> > not enough, but only some of the time? I increased it to 8192 and > >>>> tried > >>>> >> > again. > >>>> >> > - `./dev/run-tests` again [8]: same pyspark errors as before. > This > >>>> seems > >>>> >> to > >>>> >> > be the issue from SPARK-3867 [9], which was supposedly fixed on > >>>> October > >>>> >> 14; > >>>> >> > not sure how I'm seeing it now. In any case, switched to Python > 2.6 > >>>> and > >>>> >> > installed unittest2, and python/run-tests seems to be unblocked. > >>>> >> > - `./dev/run-tests` again [10]: finally passes! > >>>> >> > > >>>> >> > This was on a spark checkout at ceb6281 (ToT Friday), with a few > >>>> trivial > >>>> >> > changes added on (that I wanted to test before sending out a > PR), on > >>>> a > >>>> >> > macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 > [11]. > >>>> >> > > >>>> >> > Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried > similar > >>>> >> commands > >>>> >> > from the same repo state: > >>>> >> > > >>>> >> > - `./dev/run-tests` [12]: YarnClusterSuite failure. > >>>> >> > - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know > I've > >>>> seen > >>>> >> > this one before on this machine and am guessing it actually > occurs > >>>> every > >>>> >> > time. > >>>> >> > - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran > one > >>>> more > >>>> >> > time from ceb6281, and saw the same failure. > >>>> >> > > >>>> >> > This was with java 1.7 and maven 3.2.3 [15]. In one final > attempt to > >>>> >> narrow > >>>> >> > down the linux YarnClusterSuite failure, I ran `./dev/run-tests` > on > >>>> my > >>>> >> mac, > >>>> >> > from ceb6281, with java 1.7 (instead of 1.8, which the previous > runs > >>>> >> used), > >>>> >> > and it passed [16], so the failure seems specific to my linux > >>>> >> machine/arch. > >>>> >> > > >>>> >> > At this point I believe that my changes don't break any tests > (the > >>>> >> > YarnClusterSuite failure on my linux presumably not being... > "real"), > >>>> >> and I > >>>> >> > am ready to send out a PR. Whew! > >>>> >> > > >>>> >> > However, reflecting on the 5 or 6 distinct failure-modes > represented > >>>> >> above: > >>>> >> > > >>>> >> > - One of them (too many files open), is something I can (and did, > >>>> >> > hopefully) fix once and for all. It cost me an ~hour this time > >>>> >> (approximate > >>>> >> > time of running ./dev/run-tests) and a few hours other times > when I > >>>> >> didn't > >>>> >> > fully understand/fix it. It doesn't happen deterministically > (why?), > >>>> but > >>>> >> > does happen somewhat frequently to people, having been discussed > on > >>>> the > >>>> >> > user list multiple times [17] and on SO [18]. Maybe some note in > the > >>>> >> > documentation advising people to check their ulimit makes sense? > >>>> >> > - One of them (unittest2 must be installed for python 2.6) was > >>>> supposedly > >>>> >> > fixed upstream of the commits I tested here; I don't know why I'm > >>>> still > >>>> >> > running into it. This cost me a few hours of running > >>>> `./dev/run-tests` > >>>> >> > multiple times to see if it was transient, plus some time > >>>> researching and > >>>> >> > working around it. > >>>> >> > - The original BroadcastSuite failure cost me a few hours and > went > >>>> away > >>>> >> > before I'd even run `mvn clean`. > >>>> >> > - A new incarnation of the sbt-compiler-crash phenomenon cost me > a > >>>> few > >>>> >> > hours of running `./dev/run-tests` in different ways before > deciding > >>>> >> that, > >>>> >> > as usual, there was no way around it and that I'd need to run > `mvn > >>>> clean` > >>>> >> > and start running tests from scratch. > >>>> >> > - The YarnClusterSuite failures on my linux box have cost me > hours of > >>>> >> > trying to figure out whether they're my fault. I've seen them > many > >>>> times > >>>> >> > over the past weeks/months, plus or minus other failures that > have > >>>> come > >>>> >> and > >>>> >> > gone, and was especially befuddled by them when I was seeing a > >>>> disjoint > >>>> >> set > >>>> >> > of reproducible failures on my mac [19] (the triaging of which > >>>> involved > >>>> >> > dozens of runs of `./dev/run-tests`). > >>>> >> > > >>>> >> > While I'm interested in digging into each of these issues, I also > >>>> want to > >>>> >> > discuss the frequency with which I've run into issues like these. > >>>> This is > >>>> >> > unfortunately not the first time in recent months that I've spent > >>>> days > >>>> >> > playing spurious-test-failure whack-a-mole with a 60-90min > >>>> dev/run-tests > >>>> >> > iteration time, which is no fun! So I am wondering/thinking: > >>>> >> > > >>>> >> > - Do other people experience this level of flakiness from spark > >>>> tests? > >>>> >> > - Do other people bother running dev/run-tests locally, or just > let > >>>> >> Jenkins > >>>> >> > do it during the CR process? > >>>> >> > - Needing to run a full assembly post-clean just to continue > running > >>>> one > >>>> >> > specific test case feels especially wasteful, and the failure > output > >>>> when > >>>> >> > naively attempting to run a specific test without having built an > >>>> >> assembly > >>>> >> > jar is not always clear about what the issue is or how to fix it; > >>>> even > >>>> >> the > >>>> >> > fact that certain tests require "building the world" is not > >>>> something I > >>>> >> > would have expected, and has cost me hours of confusion. > >>>> >> > - Should a person running spark tests assume that they must > build > >>>> an > >>>> >> > assembly JAR before running anything? > >>>> >> > - Are there some proper "unit" tests that are actually > >>>> self-contained > >>>> >> / > >>>> >> > able to be run without building an assembly jar? > >>>> >> > - Can we better document/demarcate which tests have which > >>>> >> dependencies? > >>>> >> > - Is there something finer-grained than building an assembly > JAR > >>>> that > >>>> >> > is sufficient in some cases? > >>>> >> > - If so, can we document that? > >>>> >> > - If not, can we move to a world of finer-grained > >>>> dependencies for > >>>> >> > some of these? > >>>> >> > - Leaving all of these spurious failures aside, the process of > >>>> assembling > >>>> >> > and testing a new JAR is not a quick one (40 and 60 mins for me > >>>> >> typically, > >>>> >> > respectively). I would guess that there are dozens (hundreds?) of > >>>> people > >>>> >> > who build a Spark assembly from various ToTs on any given day, > and > >>>> who > >>>> >> all > >>>> >> > wait on the exact same compilation / assembly steps to occur. > >>>> Expanding > >>>> >> on > >>>> >> > the recent work to publish nightly snapshots [20], can we do a > >>>> better job > >>>> >> > caching/sharing compilation artifacts at a more granular level > >>>> (pre-built > >>>> >> > assembly JARs at each SHA? pre-built JARs per-maven-module, > per-SHA? > >>>> more > >>>> >> > granular maven modules, plus the previous two?), or otherwise > save > >>>> some > >>>> >> of > >>>> >> > the considerable amount of redundant compilation work that I had > to > >>>> do > >>>> >> over > >>>> >> > the course of my odyssey this weekend? > >>>> >> > > >>>> >> > Ramping up on most projects involves some amount of > supplementing the > >>>> >> > documentation with trial and error to figure out what to run, > which > >>>> >> > "errors" are real errors and which can be ignored, etc., but > >>>> navigating > >>>> >> > that minefield on Spark has proved especially challenging and > >>>> >> > time-consuming for me. Some of that comes directly from scala's > >>>> >> relatively > >>>> >> > slow compilation times and immature build-tooling ecosystem, but > >>>> that is > >>>> >> > the world we live in and it would be nice if Spark took the > >>>> alleviation > >>>> >> of > >>>> >> > the resulting pain more seriously, as one of the more > interesting and > >>>> >> > well-known large scala projects around right now. The official > >>>> >> > documentation around how to build different subsets of the > codebase > >>>> is > >>>> >> > somewhat sparse [21], and there have been many mixed [22] > accounts > >>>> [23] > >>>> >> on > >>>> >> > this mailing list about preferred ways to build on mvn vs. sbt > (none > >>>> of > >>>> >> > which has made it into official documentation, as far as I've > seen). > >>>> >> > Expecting new contributors to piece together all of this received > >>>> >> > folk-wisdom about how to build/test in a sane way by trawling > mailing > >>>> >> list > >>>> >> > archives seems suboptimal. > >>>> >> > > >>>> >> > Thanks for reading, looking forward to hearing your ideas! > >>>> >> > > >>>> >> > -Ryan > >>>> >> > > >>>> >> > P.S. Is "best practice" for emailing this list to not > incorporate any > >>>> >> HTML > >>>> >> > in the body? It seems like all of the archives I've seen strip it > >>>> out, > >>>> >> but > >>>> >> > other people have used it and gmail displays it. > >>>> >> > > >>>> >> > > >>>> >> > [1] > >>>> >> > https://gist.githubusercontent.com/ryan-williams/ > >>>> 8a162367c4dc157d2479/ > >>>> >> raw/484c2fb8bc0efa0e39d142087eefa9c3d5292ea3/dev%20run-tests:% > 20fail > >>>> >> > (57 mins) > >>>> >> > [2] > >>>> >> > https://gist.githubusercontent.com/ryan-williams/ > >>>> 8a162367c4dc157d2479/ > >>>> >> raw/ce264e469be3641f061eabd10beb1d71ac243991/mvn%20test:%20fail > >>>> >> > (6 mins) > >>>> >> > [3] > >>>> >> > https://gist.githubusercontent.com/ryan-williams/ > >>>> 8a162367c4dc157d2479/ > >>>> >> raw/6bc76c67aeef9c57ddd9fb2ba260fb4189dbb927/mvn%20test%20case:% > >>>> >> 20pass%20test,%20fail%20subsequent%20compile > >>>> >> > (4 mins) > >>>> >> > [4] > >>>> >> > https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web& > >>>> >> cd=2&ved=0CCUQFjAB&url=http%3A%2F%2Fapache-spark-user- > >>>> >> list.1001560.n3.nabble.com%2Fscalac-crash-when-compiling- > >>>> >> DataTypeConversions-scala-td17083.html&ei=aRF6VJrpNKr- > >>>> >> iAKDgYGYBQ&usg=AFQjCNHjM9m__Hrumh-ecOsSE00-JkjKBQ&sig2= > >>>> >> zDeSqOgs02AXJXj78w5I9g&bvm=bv.80642063,d.cGE&cad=rja > >>>> >> > [5] > >>>> >> > https://gist.githubusercontent.com/ryan-williams/ > >>>> 8a162367c4dc157d2479/ > >>>> >> raw/4ab0bd6e76d9fc5745eb4b45cdf13195d10efaa2/mvn%20test,%20post% > >>>> >> 20clean,%20need%20dependencies%20built > >>>> >> > [6] > >>>> >> > https://gist.githubusercontent.com/ryan-williams/ > >>>> 8a162367c4dc157d2479/ > >>>> >> raw/f4c7e6fc8c301f869b00598c7b541dac243fb51e/dev%20run-tests,% > >>>> >> 20post%20clean > >>>> >> > (50 mins) > >>>> >> > [7] > >>>> >> > https://gist.github.com/ryan-williams/57f8bfc9328447fc5b97# > >>>> >> file-dev-run-tests-failure-too-many-files-open-then-hang-L5260 > >>>> >> > (1hr) > >>>> >> > [8] https://gist.github.com/ryan-williams/d0164194ad5de03f6e3f > (1hr) > >>>> >> > [9] https://issues.apache.org/jira/browse/SPARK-3867 > >>>> >> > [10] https://gist.github.com/ryan-williams/735adf543124c99647cc > >>>> >> > [11] https://gist.github.com/ryan-williams/8d149bbcd0c6689ad564 > >>>> >> > [12] > >>>> >> > https://gist.github.com/ryan-williams/07df5c583c9481fe1c14# > >>>> >> file-gistfile1-txt-L853 > >>>> >> > (~90 mins) > >>>> >> > [13] > >>>> >> > https://gist.github.com/ryan-williams/718f6324af358819b496# > >>>> >> file-gistfile1-txt-L852 > >>>> >> > (91 mins) > >>>> >> > [14] > >>>> >> > https://gist.github.com/ryan-williams/c06c1f4aa0b16f160965# > >>>> >> file-gistfile1-txt-L854 > >>>> >> > [15] https://gist.github.com/ryan-williams/f8d410b5b9f082039c73 > >>>> >> > [16] https://gist.github.com/ryan-williams/2e94f55c9287938cf745 > >>>> >> > [17] > >>>> >> > http://apache-spark-user-list.1001560.n3.nabble.com/quot- > >>>> >> Too-many-open-files-quot-exception-on-reduceByKey-td2462.html > >>>> >> > [18] > >>>> >> > http://stackoverflow.com/questions/25707629/why-does- > >>>> >> spark-job-fail-with-too-many-open-files > >>>> >> > [19] https://issues.apache.org/jira/browse/SPARK-4002 > >>>> >> > [20] https://issues.apache.org/jira/browse/SPARK-4542 > >>>> >> > [21] > >>>> >> > https://spark.apache.org/docs/latest/building-with-maven. > >>>> >> html#spark-tests-in-maven > >>>> >> > [22] https://www.mail-archive.com/dev@spark.apache.org/msg06443. > html > >>>> >> > [23] > >>>> >> > http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/% > >>>> >> 3CCAOhmDzeUNhuCr41B7KRPTEwMn4cga_2TNpZrWqQB8REekokxzg@mail. > gmail.com > >>>> %3E > >>>> >> > >>>> >> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >>>> For additional commands, e-mail: dev-h...@spark.apache.org > >>>> > >>>> >