Re: Surprising Spark SQL benchmark

Kay Ousterhout Sat, 01 Nov 2014 12:36:46 -0700

Hi Nick,

No -- we're doing a much more constrained thing of just trying to get
things set up to easily run TPC-DS on SparkSQL (which involves generating
the data, storing it in HDFS, getting all the queries in the right format,
etc.).
Cloudera does have a repo here: https://github.com/cloudera/impala-tpcds-kit
that we've found helpful in running TPC-DS on Hive (you should also be able
to use that repo to run TPC-DS on Impala, although we haven't actually done
this).


-Kay

On Sat, Nov 1, 2014 at 10:50 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Kay,
>
> Is this effort related to the existing AMPLab Big Data benchmark that
> covers Spark, Redshift, Tez, and Impala?
>
> Nick
>
>
> 2014년 10월 31일 금요일, Kay Ousterhout<k...@eecs.berkeley.edu>님이 작성한 메시지:
>
> There's been an effort in the AMPLab at Berkeley to set up a shared
>> codebase that makes it easy to run TPC-DS on SparkSQL, since it's something
>> we do frequently in the lab to evaluate new research.  Based on this
>> thread, it sounds like making this more widely-available is something that
>> would be useful to folks for reproducing the results published by
>> Databricks / Hortonworks / Cloudera / etc.; we'll share the code on the
>> list as soon as we're done.
>>
>> -Kay
>>
>> On Fri, Oct 31, 2014 at 12:45 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I believe that benchmark has a pending certification on it. See
>>> http://sortbenchmark.org under "Process".
>>>
>>> It's true they did not share enough details on the blog for readers to
>>> reproduce the benchmark, but they will have to share enough with the
>>> committee behind the benchmark in order to be certified. Given that this
>>> is
>>> a benchmark not many people will be able to reproduce due to size and
>>> complexity, I don't see it as a big negative that the details are not
>>> laid
>>> out as long as there is independent certification from a third party.
>>>
>>> From what I've seen so far, the best big data benchmark anywhere is this:
>>> https://amplab.cs.berkeley.edu/benchmark/
>>>
>>> Is has all the details you'd expect, including hosted datasets, to allow
>>> anyone to reproduce the full benchmark, covering a number of systems. I
>>> look forward to the next update to that benchmark (a lot has changed
>>> since
>>> Feb). And from what I can tell, it's produced by the same people behind
>>> Spark (Patrick being among them).
>>>
>>> So I disagree that the Spark community "hasn't been any better" in this
>>> regard.
>>>
>>> Nick
>>>
>>>
>>> 2014년 10월 31일 금요일, Steve Nunez<snu...@hortonworks.com>님이 작성한 메시지:
>>>
>>> > To be fair, we (Spark community) haven’t been any better, for example
>>> this
>>> > benchmark:
>>> >
>>> >
>>> https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
>>> >
>>> >
>>> > For which no details or code have been released to allow others to
>>> > reproduce it. I would encourage anyone doing a Spark benchmark in
>>> future
>>> > to avoid the stigma of vendor reported benchmarks and publish enough
>>> > information and code to let others repeat the exercise easily.
>>> >
>>> >         - Steve
>>> >
>>> >
>>> >
>>> > On 10/31/14, 11:30, "Nicholas Chammas" <nicholas.cham...@gmail.com
>>> > <javascript:;>> wrote:
>>> >
>>> > >Thanks for the response, Patrick.
>>> > >
>>> > >I guess the key takeaways are 1) the tuning/config details are
>>> everything
>>> > >(they're not laid out here), 2) the benchmark should be reproducible
>>> (it's
>>> > >not), and 3) reach out to the relevant devs before publishing (didn't
>>> > >happen).
>>> > >
>>> > >Probably key takeaways for any kind of benchmark, really...
>>> > >
>>> > >Nick
>>> > >
>>> > >
>>> > >2014년 10월 31일 금요일, Patrick Wendell<pwend...@gmail.com
>>> <javascript:;>>님이
>>> > 작성한 메시지:
>>> > >
>>> > >> Hey Nick,
>>> > >>
>>> > >> Unfortunately Citus Data didn't contact any of the Spark or Spark
>>> SQL
>>> > >> developers when running this. It is really easy to make one system
>>> > >> look better than others when you are running a benchmark yourself
>>> > >> because tuning and sizing can lead to a 10X performance improvement.
>>> > >> This benchmark doesn't share the mechanism in a reproducible way.
>>> > >>
>>> > >> There are a bunch of things that aren't clear here:
>>> > >>
>>> > >> 1. Spark SQL has optimized parquet features, were these turned on?
>>> > >> 2. It doesn't mention computing statistics in Spark SQL, but it does
>>> > >> this for Impala and Parquet. Statistics allow Spark SQL to broadcast
>>> > >> small tables which can make a 10X difference in TPC-H.
>>> > >> 3. For data larger than memory, Spark SQL often performs better if
>>> you
>>> > >> don't call "cache", did they try this?
>>> > >>
>>> > >> Basically, a self-reported marketing benchmark like this that
>>> > >> *shocker* concludes this vendor's solution is the best, is not
>>> > >> particularly useful.
>>> > >>
>>> > >> If Citus data wants to run a credible benchmark, I'd invite them to
>>> > >> directly involve Spark SQL developers in the future. Until then, I
>>> > >> wouldn't give much credence to this or any other similar vendor
>>> > >> benchmark.
>>> > >>
>>> > >> - Patrick
>>> > >>
>>> > >> On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
>>> > >> <nicholas.cham...@gmail.com <javascript:;> <javascript:;>> wrote:
>>> > >> > I know we don't want to be jumping at every benchmark someone
>>> posts
>>> > >>out
>>> > >> > there, but this one surprised me:
>>> > >> >
>>> > >> >
>>> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
>>> > >> >
>>> > >> > This benchmark has Spark SQL failing to complete several queries
>>> in
>>> > >>the
>>> > >> > TPC-H benchmark. I don't understand much about the details of
>>> > >>performing
>>> > >> > benchmarks, but this was surprising.
>>> > >> >
>>> > >> > Are these results expected?
>>> > >> >
>>> > >> > Related HN discussion here:
>>> > >>https://news.ycombinator.com/item?id=8539678
>>> > >> >
>>> > >> > Nick
>>> > >>
>>> >
>>> >
>>> >
>>> > --
>>> > CONFIDENTIALITY NOTICE
>>> > NOTICE: This message is intended for the use of the individual or
>>> entity to
>>> > which it is addressed and may contain information that is confidential,
>>> > privileged and exempt from disclosure under applicable law. If the
>>> reader
>>> > of this message is not the intended recipient, you are hereby notified
>>> that
>>> > any printing, copying, dissemination, distribution, disclosure or
>>> > forwarding of this communication is strictly prohibited. If you have
>>> > received this communication in error, please contact the sender
>>> immediately
>>> > and delete it from your system. Thank You.
>>> >
>>>
>>
>>

Re: Surprising Spark SQL benchmark

Reply via email to