Re: Spark TeraSort source request

Tom Hubregtsen Mon, 13 Apr 2015 07:44:30 -0700

Thank you for your response Ewan. I quickly looked yesterday and it was
there, but today at work I tried to open it again to start working on it,
but it appears to be removed. Is this correct?


Thanks,

Tom

On 12 April 2015 at 06:58, Ewan Higgs <ewan.hi...@ugent.be> wrote:

>  Hi all.
> The code is linked from my repo:
>
> https://github.com/ehiggs/spark-terasort
> "
> This is an example Spark program for running TeraSort benchmarks. It is
> based on work from Reynold Xin's branch
> <https://github.com/rxin/spark/tree/terasort>, but it is not the same
> TeraSort program that currently holds the record
> <http://sortbenchmark.org/>. That program is here
> <https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort>
> .
> "
>
> "That program is here" links to:
>
> https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort
>
> I've been working on other projects at the moment so I haven't returned to
> the spark-terasort stuff. If you have any pull requests, I would be very
> grateful.
>
> Yours,
> Ewan
>
>
> On 08/04/15 03:26, Pramod Biligiri wrote:
>
> +1. I would love to have the code for this as well.
>
>  Pramod
>
> On Fri, Apr 3, 2015 at 12:47 PM, Tom <thubregt...@gmail.com> wrote:
>
>> Hi all,
>>
>> As we all know, Spark has set the record for sorting data, as published
>> on:
>> https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.
>>
>> Here at our group, we would love to verify these results, and compare
>> machine using this benchmark. We've spend quite some time trying to find
>> the
>> terasort source code that was used, but can not find it anywhere.
>>
>> We did find two candidates:
>>
>> A version posted by Reynold [1], the posted of the message above. This
>> version is stuck at "    // TODO: Add partition-local (external) sorting
>> using TeraSortRecordOrdering", only generating data.
>>
>> Here, Ewan noticed that "it didn't appear to be similar to Hadoop
>> TeraSort."
>> [2] After this he created a version on his own [3]. With this version, we
>> noticed problems with TeraValidate with datasets above ~10G (as mentioned
>> by
>> others at [4]. When examining the raw input and output files, it actually
>> appears that the input data is sorted and the output data unsorted in both
>> cases.
>>
>> Because of this, we believe we did not yet find the actual used source
>> code.
>> I've tried to search in the Spark User forum archive's, seeing request of
>> people, indicating a demand, but did not succeed in finding the actual
>> source code.
>>
>> My question:
>> Could you guys please make the source code of the used TeraSort program,
>> preferably with settings, available? If not, what are the reasons that
>> this
>> seems to be withheld?
>>
>> Thanks for any help,
>>
>> Tom Hubregtsen
>>
>> [1]
>>
>> https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87
>> [2]
>>
>> http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E
>> [3] https://github.com/ehiggs/spark-terasort
>> [4]
>>
>> http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>

Re: Spark TeraSort source request

Reply via email to