Yup, the Hadoop nodes were from 2013, each with 64 GB RAM, 12 cores, 10 Gbps 
Ethernet and 12 disks. For 100 TB of data, the intermediate data could fit in 
memory on this cluster, which can make shuffle much faster than with 
intermediate data on SSDs. You can find the specs in 
http://sortbenchmark.org/Yahoo2013Sort.pdf. It just takes effort to utilize 
modern machines fully -- for instance the Yahoo! cluster had 1 TB/s network 
bandwidth, but only sorted data at 0.02 TB/s. Systems optimized for sorting, 
like TritonSort (which also won this year's benchmark), get much closer to full 
utilization.

Matei

> On Nov 5, 2014, at 4:10 PM, Reynold Xin <r...@databricks.com> wrote:
> 
> Steve,
> 
> I wouldn't say Hadoop MR is a 2001 Toyota Celica :) In either case, I
> updated the blog post to actually include CPU / disk / network measures.
> You should see that in any measure that matters to this benchmark, the old
> 2100 node cluster is vastly superior. The data even fit in memory!
> 
> 
> 
> On Wed, Nov 5, 2014 at 4:07 PM, Steve Nunez <snu...@hortonworks.com> wrote:
> 
>> Nicholas,
>> 
>> I never doubted the authenticity of the benchmark, nor the results. What I
>> think could be better is an objective analysis of the results. That post
>> neglected to point out the significant differences in hardware those two
>> benchmarks were run on. It is bit like bragging you broke the world record
>> at the Nürburgring in a 2014 1000hp LaFerrari and somehow forgetting to
>> mention that the last record was held by a 2001 Toyota Celica.
>> 
>> - Steve
>> 
>> 
>> From:  Nicholas Chammas <nicholas.cham...@gmail.com>
>> Date:  Wednesday, November 5, 2014 at 15:56
>> To:  Steve Nunez <snu...@hortonworks.com>
>> Cc:  Patrick Wendell <pwend...@gmail.com>, dev <dev@spark.apache.org>
>> Subject:  Re: Surprising Spark SQL benchmark
>> 
>>> Steve Nunez, I believe the information behind the links below should
>> address
>>> your concerns earlier about Databricks's submission to the Daytona Gray
>>> benchmark.
>>> 
>>> On Wed, Nov 5, 2014 at 6:43 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com>
>>> wrote:
>>>> On Fri, Oct 31, 2014 at 3:45 PM, Nicholas Chammas
>>>> <nicholas.cham...@gmail.com> wrote:
>>>> 
>>>>> I believe that benchmark has a pending certification on it. See
>>>>> http://sortbenchmark.org under "Process".
>>>> Regarding this comment, Reynold has just announced that this benchmark
>> is now
>>>> certified.
>>>> * Announcement:
>>>> 
>> http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-l
>>>> arge-scale-sorting.html
>>>> * Updated benchmark results page: http://sortbenchmark.org/
>>>> * Paper detailing Spark cluster configuration for the benchmark:
>>>> http://sortbenchmark.org/ApacheSpark2014.pdf
>>>> Nick
>>>> 
>>>> ​
>>> 
>> 
>> 
>> 
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to