Shall I move the code to spark-perf then and submit a PR? Or shall I submit a PR to spark where it can remain an idiomatic example and we can clone it in spark-perf where it can potentially evolve non-idiomatic optimizations?

Yours,
Ewan

On 11/11/2014 07:58 PM, Reynold Xin wrote:
This is great. I think the consensus from last time was that we would put performance stuff into spark-perf, so it is easy to test different Spark versions.


On Tue, Nov 11, 2014 at 5:03 AM, Ewan Higgs <ewan.hi...@ugent.be <mailto:ewan.hi...@ugent.be>> wrote:

    Hi all,
    I saw that Reynold Xin had a Terasort example PR on Github[1]. It
    didn't appear to be similar to the Hadoop Terasort example, so
    I've tried to brush it into shape so it can generate Terasort
    files (teragen), sort the files (terasort) and validate the files
    (teravalidate). My branch is available here:

    https://github.com/ehiggs/spark/tree/terasort

    With this code, you can run the following:

    # Generate 1M 100 byte records:
     ./bin/run-example terasort.TeraGen 100M ~/data/terasort_in

    # Sort the file:
    MASTER=local[4] ./bin/run-example terasort.TeraSort
    ~/data/terasort_in  ~/data/terasort_out

    # Validate the file
    MASTER=local[4] ./bin/run-example terasort.TeraValidate
    ~/data/terasort_out  ~/data/terasort_validate

    # Validate that an unsorted file is indeed not correctly sorted:

    MASTER=local[4] ./bin/run-example terasort.TeraValidate
    ~/data/terasort_in  ~/data/terasort_validate_bad

    This matches the interface for the Hadoop version of Terasort,
    except I added the ability to use K,M,G,T for record sizes in
    TeraGen. This code therefore makes a good example of how to use
    Spark, how to read and write Hadoop files, and also a way to test
    some of the performance claims of Spark.

    > That's great, but why is this on the mailing list and not
    submitted as a PR?

    I suspect there are some rough edges and I'd really appreciate
    reviews. I would also like to know if others can try it out on
    clusters and tell me if it's performing as it should.

    For example, I find it runs fine on my local machine, but when I
    try to sort 100G of data on a cluster of 16 nodes, I get >2900
    file splits. This really eats into the sort time.

    Another issue is that in TeraValidate, to work around SPARK-1018 I
    had to clone each element. Does this /really/ need to be done?
    It's pretty lame.

    In any event, I know the Spark 1.2 merge window closed on Friday
    but as this is only for the examples directory maybe we can slip
    it in if we can bash it into shape quickly enough?

    Anyway, thanks to everyone on #apache-spark and #scala who helped
    me get through learning some rudimentary Scala to get this far.

    Yours,
    Ewan Higgs

    [1] https://github.com/apache/spark/pull/1242

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
    <mailto:dev-unsubscr...@spark.apache.org>
    For additional commands, e-mail: dev-h...@spark.apache.org
    <mailto:dev-h...@spark.apache.org>



Reply via email to