This is great. I think the consensus from last time was that we would put performance stuff into spark-perf, so it is easy to test different Spark versions.
On Tue, Nov 11, 2014 at 5:03 AM, Ewan Higgs <ewan.hi...@ugent.be> wrote: > Hi all, > I saw that Reynold Xin had a Terasort example PR on Github[1]. It didn't > appear to be similar to the Hadoop Terasort example, so I've tried to brush > it into shape so it can generate Terasort files (teragen), sort the files > (terasort) and validate the files (teravalidate). My branch is available > here: > > https://github.com/ehiggs/spark/tree/terasort > > With this code, you can run the following: > > # Generate 1M 100 byte records: > ./bin/run-example terasort.TeraGen 100M ~/data/terasort_in > > # Sort the file: > MASTER=local[4] ./bin/run-example terasort.TeraSort ~/data/terasort_in > ~/data/terasort_out > > # Validate the file > MASTER=local[4] ./bin/run-example terasort.TeraValidate > ~/data/terasort_out ~/data/terasort_validate > > # Validate that an unsorted file is indeed not correctly sorted: > > MASTER=local[4] ./bin/run-example terasort.TeraValidate > ~/data/terasort_in ~/data/terasort_validate_bad > > This matches the interface for the Hadoop version of Terasort, except I > added the ability to use K,M,G,T for record sizes in TeraGen. This code > therefore makes a good example of how to use Spark, how to read and write > Hadoop files, and also a way to test some of the performance claims of > Spark. > > > That's great, but why is this on the mailing list and not submitted as a > PR? > > I suspect there are some rough edges and I'd really appreciate reviews. I > would also like to know if others can try it out on clusters and tell me if > it's performing as it should. > > For example, I find it runs fine on my local machine, but when I try to > sort 100G of data on a cluster of 16 nodes, I get >2900 file splits. This > really eats into the sort time. > > Another issue is that in TeraValidate, to work around SPARK-1018 I had to > clone each element. Does this /really/ need to be done? It's pretty lame. > > In any event, I know the Spark 1.2 merge window closed on Friday but as > this is only for the examples directory maybe we can slip it in if we can > bash it into shape quickly enough? > > Anyway, thanks to everyone on #apache-spark and #scala who helped me get > through learning some rudimentary Scala to get this far. > > Yours, > Ewan Higgs > > [1] https://github.com/apache/spark/pull/1242 > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >