Hi Nick, No -- we're doing a much more constrained thing of just trying to get things set up to easily run TPC-DS on SparkSQL (which involves generating the data, storing it in HDFS, getting all the queries in the right format, etc.). Cloudera does have a repo here: https://github.com/cloudera/impala-tpcds-kit that we've found helpful in running TPC-DS on Hive (you should also be able to use that repo to run TPC-DS on Impala, although we haven't actually done this).
-Kay On Sat, Nov 1, 2014 at 10:50 AM, Nicholas Chammas < [email protected]> wrote: > Kay, > > Is this effort related to the existing AMPLab Big Data benchmark that > covers Spark, Redshift, Tez, and Impala? > > Nick > > > 2014년 10월 31일 금요일, Kay Ousterhout<[email protected]>님이 작성한 메시지: > > There's been an effort in the AMPLab at Berkeley to set up a shared >> codebase that makes it easy to run TPC-DS on SparkSQL, since it's something >> we do frequently in the lab to evaluate new research. Based on this >> thread, it sounds like making this more widely-available is something that >> would be useful to folks for reproducing the results published by >> Databricks / Hortonworks / Cloudera / etc.; we'll share the code on the >> list as soon as we're done. >> >> -Kay >> >> On Fri, Oct 31, 2014 at 12:45 PM, Nicholas Chammas < >> [email protected]> wrote: >> >>> I believe that benchmark has a pending certification on it. See >>> http://sortbenchmark.org under "Process". >>> >>> It's true they did not share enough details on the blog for readers to >>> reproduce the benchmark, but they will have to share enough with the >>> committee behind the benchmark in order to be certified. Given that this >>> is >>> a benchmark not many people will be able to reproduce due to size and >>> complexity, I don't see it as a big negative that the details are not >>> laid >>> out as long as there is independent certification from a third party. >>> >>> From what I've seen so far, the best big data benchmark anywhere is this: >>> https://amplab.cs.berkeley.edu/benchmark/ >>> >>> Is has all the details you'd expect, including hosted datasets, to allow >>> anyone to reproduce the full benchmark, covering a number of systems. I >>> look forward to the next update to that benchmark (a lot has changed >>> since >>> Feb). And from what I can tell, it's produced by the same people behind >>> Spark (Patrick being among them). >>> >>> So I disagree that the Spark community "hasn't been any better" in this >>> regard. >>> >>> Nick >>> >>> >>> 2014년 10월 31일 금요일, Steve Nunez<[email protected]>님이 작성한 메시지: >>> >>> > To be fair, we (Spark community) haven’t been any better, for example >>> this >>> > benchmark: >>> > >>> > >>> https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html >>> > >>> > >>> > For which no details or code have been released to allow others to >>> > reproduce it. I would encourage anyone doing a Spark benchmark in >>> future >>> > to avoid the stigma of vendor reported benchmarks and publish enough >>> > information and code to let others repeat the exercise easily. >>> > >>> > - Steve >>> > >>> > >>> > >>> > On 10/31/14, 11:30, "Nicholas Chammas" <[email protected] >>> > <javascript:;>> wrote: >>> > >>> > >Thanks for the response, Patrick. >>> > > >>> > >I guess the key takeaways are 1) the tuning/config details are >>> everything >>> > >(they're not laid out here), 2) the benchmark should be reproducible >>> (it's >>> > >not), and 3) reach out to the relevant devs before publishing (didn't >>> > >happen). >>> > > >>> > >Probably key takeaways for any kind of benchmark, really... >>> > > >>> > >Nick >>> > > >>> > > >>> > >2014년 10월 31일 금요일, Patrick Wendell<[email protected] >>> <javascript:;>>님이 >>> > 작성한 메시지: >>> > > >>> > >> Hey Nick, >>> > >> >>> > >> Unfortunately Citus Data didn't contact any of the Spark or Spark >>> SQL >>> > >> developers when running this. It is really easy to make one system >>> > >> look better than others when you are running a benchmark yourself >>> > >> because tuning and sizing can lead to a 10X performance improvement. >>> > >> This benchmark doesn't share the mechanism in a reproducible way. >>> > >> >>> > >> There are a bunch of things that aren't clear here: >>> > >> >>> > >> 1. Spark SQL has optimized parquet features, were these turned on? >>> > >> 2. It doesn't mention computing statistics in Spark SQL, but it does >>> > >> this for Impala and Parquet. Statistics allow Spark SQL to broadcast >>> > >> small tables which can make a 10X difference in TPC-H. >>> > >> 3. For data larger than memory, Spark SQL often performs better if >>> you >>> > >> don't call "cache", did they try this? >>> > >> >>> > >> Basically, a self-reported marketing benchmark like this that >>> > >> *shocker* concludes this vendor's solution is the best, is not >>> > >> particularly useful. >>> > >> >>> > >> If Citus data wants to run a credible benchmark, I'd invite them to >>> > >> directly involve Spark SQL developers in the future. Until then, I >>> > >> wouldn't give much credence to this or any other similar vendor >>> > >> benchmark. >>> > >> >>> > >> - Patrick >>> > >> >>> > >> On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas >>> > >> <[email protected] <javascript:;> <javascript:;>> wrote: >>> > >> > I know we don't want to be jumping at every benchmark someone >>> posts >>> > >>out >>> > >> > there, but this one surprised me: >>> > >> > >>> > >> > >>> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style >>> > >> > >>> > >> > This benchmark has Spark SQL failing to complete several queries >>> in >>> > >>the >>> > >> > TPC-H benchmark. I don't understand much about the details of >>> > >>performing >>> > >> > benchmarks, but this was surprising. >>> > >> > >>> > >> > Are these results expected? >>> > >> > >>> > >> > Related HN discussion here: >>> > >>https://news.ycombinator.com/item?id=8539678 >>> > >> > >>> > >> > Nick >>> > >> >>> > >>> > >>> > >>> > -- >>> > CONFIDENTIALITY NOTICE >>> > NOTICE: This message is intended for the use of the individual or >>> entity to >>> > which it is addressed and may contain information that is confidential, >>> > privileged and exempt from disclosure under applicable law. If the >>> reader >>> > of this message is not the intended recipient, you are hereby notified >>> that >>> > any printing, copying, dissemination, distribution, disclosure or >>> > forwarding of this communication is strictly prohibited. If you have >>> > received this communication in error, please contact the sender >>> immediately >>> > and delete it from your system. Thank You. >>> > >>> >> >>
