Thank you for your response Ewan. I quickly looked yesterday and it was there, but today at work I tried to open it again to start working on it, but it appears to be removed. Is this correct?
Thanks, Tom On 12 April 2015 at 06:58, Ewan Higgs <ewan.hi...@ugent.be> wrote: > Hi all. > The code is linked from my repo: > > https://github.com/ehiggs/spark-terasort > " > This is an example Spark program for running TeraSort benchmarks. It is > based on work from Reynold Xin's branch > <https://github.com/rxin/spark/tree/terasort>, but it is not the same > TeraSort program that currently holds the record > <http://sortbenchmark.org/>. That program is here > <https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort> > . > " > > "That program is here" links to: > > https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort > > I've been working on other projects at the moment so I haven't returned to > the spark-terasort stuff. If you have any pull requests, I would be very > grateful. > > Yours, > Ewan > > > On 08/04/15 03:26, Pramod Biligiri wrote: > > +1. I would love to have the code for this as well. > > Pramod > > On Fri, Apr 3, 2015 at 12:47 PM, Tom <thubregt...@gmail.com> wrote: > >> Hi all, >> >> As we all know, Spark has set the record for sorting data, as published >> on: >> https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html. >> >> Here at our group, we would love to verify these results, and compare >> machine using this benchmark. We've spend quite some time trying to find >> the >> terasort source code that was used, but can not find it anywhere. >> >> We did find two candidates: >> >> A version posted by Reynold [1], the posted of the message above. This >> version is stuck at " // TODO: Add partition-local (external) sorting >> using TeraSortRecordOrdering", only generating data. >> >> Here, Ewan noticed that "it didn't appear to be similar to Hadoop >> TeraSort." >> [2] After this he created a version on his own [3]. With this version, we >> noticed problems with TeraValidate with datasets above ~10G (as mentioned >> by >> others at [4]. When examining the raw input and output files, it actually >> appears that the input data is sorted and the output data unsorted in both >> cases. >> >> Because of this, we believe we did not yet find the actual used source >> code. >> I've tried to search in the Spark User forum archive's, seeing request of >> people, indicating a demand, but did not succeed in finding the actual >> source code. >> >> My question: >> Could you guys please make the source code of the used TeraSort program, >> preferably with settings, available? If not, what are the reasons that >> this >> seems to be withheld? >> >> Thanks for any help, >> >> Tom Hubregtsen >> >> [1] >> >> https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87 >> [2] >> >> http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E >> [3] https://github.com/ehiggs/spark-terasort >> [4] >> >> http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > >