This is great. I think the consensus from last time was that we would put
performance stuff into spark-perf, so it is easy to test different Spark
versions.


On Tue, Nov 11, 2014 at 5:03 AM, Ewan Higgs <ewan.hi...@ugent.be> wrote:

> Hi all,
> I saw that Reynold Xin had a Terasort example PR on Github[1]. It didn't
> appear to be similar to the Hadoop Terasort example, so I've tried to brush
> it into shape so it can generate Terasort files (teragen), sort the files
> (terasort) and validate the files (teravalidate). My branch is available
> here:
>
> https://github.com/ehiggs/spark/tree/terasort
>
> With this code, you can run the following:
>
> # Generate 1M 100 byte records:
>  ./bin/run-example terasort.TeraGen 100M ~/data/terasort_in
>
> # Sort the file:
> MASTER=local[4] ./bin/run-example terasort.TeraSort ~/data/terasort_in
> ~/data/terasort_out
>
> # Validate the file
> MASTER=local[4] ./bin/run-example terasort.TeraValidate
> ~/data/terasort_out  ~/data/terasort_validate
>
> # Validate that an unsorted file is indeed not correctly sorted:
>
> MASTER=local[4] ./bin/run-example terasort.TeraValidate
> ~/data/terasort_in  ~/data/terasort_validate_bad
>
> This matches the interface for the Hadoop version of Terasort, except I
> added the ability to use K,M,G,T for record sizes in TeraGen. This code
> therefore makes a good example of how to use Spark, how to read and write
> Hadoop files, and also a way to test some of the performance claims of
> Spark.
>
> > That's great, but why is this on the mailing list and not submitted as a
> PR?
>
> I suspect there are some rough edges and I'd really appreciate reviews. I
> would also like to know if others can try it out on clusters and tell me if
> it's performing as it should.
>
> For example, I find it runs fine on my local machine, but when I try to
> sort 100G of data on a cluster of 16 nodes, I get >2900 file splits. This
> really eats into the sort time.
>
> Another issue is that in TeraValidate, to work around SPARK-1018 I had to
> clone each element. Does this /really/ need to be done? It's pretty lame.
>
> In any event, I know the Spark 1.2 merge window closed on Friday but as
> this is only for the examples directory maybe we can slip it in if we can
> bash it into shape quickly enough?
>
> Anyway, thanks to everyone on #apache-spark and #scala who helped me get
> through learning some rudimentary Scala to get this far.
>
> Yours,
> Ewan Higgs
>
> [1] https://github.com/apache/spark/pull/1242
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Reply via email to