There's been an effort in the AMPLab at Berkeley to set up a shared codebase that makes it easy to run TPC-DS on SparkSQL, since it's something we do frequently in the lab to evaluate new research. Based on this thread, it sounds like making this more widely-available is something that would be useful to folks for reproducing the results published by Databricks / Hortonworks / Cloudera / etc.; we'll share the code on the list as soon as we're done.
-Kay On Fri, Oct 31, 2014 at 12:45 PM, Nicholas Chammas < [email protected]> wrote: > I believe that benchmark has a pending certification on it. See > http://sortbenchmark.org under "Process". > > It's true they did not share enough details on the blog for readers to > reproduce the benchmark, but they will have to share enough with the > committee behind the benchmark in order to be certified. Given that this is > a benchmark not many people will be able to reproduce due to size and > complexity, I don't see it as a big negative that the details are not laid > out as long as there is independent certification from a third party. > > From what I've seen so far, the best big data benchmark anywhere is this: > https://amplab.cs.berkeley.edu/benchmark/ > > Is has all the details you'd expect, including hosted datasets, to allow > anyone to reproduce the full benchmark, covering a number of systems. I > look forward to the next update to that benchmark (a lot has changed since > Feb). And from what I can tell, it's produced by the same people behind > Spark (Patrick being among them). > > So I disagree that the Spark community "hasn't been any better" in this > regard. > > Nick > > > 2014년 10월 31일 금요일, Steve Nunez<[email protected]>님이 작성한 메시지: > > > To be fair, we (Spark community) haven’t been any better, for example > this > > benchmark: > > > > https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html > > > > > > For which no details or code have been released to allow others to > > reproduce it. I would encourage anyone doing a Spark benchmark in future > > to avoid the stigma of vendor reported benchmarks and publish enough > > information and code to let others repeat the exercise easily. > > > > - Steve > > > > > > > > On 10/31/14, 11:30, "Nicholas Chammas" <[email protected] > > <javascript:;>> wrote: > > > > >Thanks for the response, Patrick. > > > > > >I guess the key takeaways are 1) the tuning/config details are > everything > > >(they're not laid out here), 2) the benchmark should be reproducible > (it's > > >not), and 3) reach out to the relevant devs before publishing (didn't > > >happen). > > > > > >Probably key takeaways for any kind of benchmark, really... > > > > > >Nick > > > > > > > > >2014년 10월 31일 금요일, Patrick Wendell<[email protected] <javascript:;>>님이 > > 작성한 메시지: > > > > > >> Hey Nick, > > >> > > >> Unfortunately Citus Data didn't contact any of the Spark or Spark SQL > > >> developers when running this. It is really easy to make one system > > >> look better than others when you are running a benchmark yourself > > >> because tuning and sizing can lead to a 10X performance improvement. > > >> This benchmark doesn't share the mechanism in a reproducible way. > > >> > > >> There are a bunch of things that aren't clear here: > > >> > > >> 1. Spark SQL has optimized parquet features, were these turned on? > > >> 2. It doesn't mention computing statistics in Spark SQL, but it does > > >> this for Impala and Parquet. Statistics allow Spark SQL to broadcast > > >> small tables which can make a 10X difference in TPC-H. > > >> 3. For data larger than memory, Spark SQL often performs better if you > > >> don't call "cache", did they try this? > > >> > > >> Basically, a self-reported marketing benchmark like this that > > >> *shocker* concludes this vendor's solution is the best, is not > > >> particularly useful. > > >> > > >> If Citus data wants to run a credible benchmark, I'd invite them to > > >> directly involve Spark SQL developers in the future. Until then, I > > >> wouldn't give much credence to this or any other similar vendor > > >> benchmark. > > >> > > >> - Patrick > > >> > > >> On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas > > >> <[email protected] <javascript:;> <javascript:;>> wrote: > > >> > I know we don't want to be jumping at every benchmark someone posts > > >>out > > >> > there, but this one surprised me: > > >> > > > >> > > http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style > > >> > > > >> > This benchmark has Spark SQL failing to complete several queries in > > >>the > > >> > TPC-H benchmark. I don't understand much about the details of > > >>performing > > >> > benchmarks, but this was surprising. > > >> > > > >> > Are these results expected? > > >> > > > >> > Related HN discussion here: > > >>https://news.ycombinator.com/item?id=8539678 > > >> > > > >> > Nick > > >> > > > > > > > > -- > > CONFIDENTIALITY NOTICE > > NOTICE: This message is intended for the use of the individual or entity > to > > which it is addressed and may contain information that is confidential, > > privileged and exempt from disclosure under applicable law. If the reader > > of this message is not the intended recipient, you are hereby notified > that > > any printing, copying, dissemination, distribution, disclosure or > > forwarding of this communication is strictly prohibited. If you have > > received this communication in error, please contact the sender > immediately > > and delete it from your system. Thank You. > > >
