To be fair, we (Spark community) haven’t been any better, for example this
benchmark:

        https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html


For which no details or code have been released to allow others to
reproduce it. I would encourage anyone doing a Spark benchmark in future
to avoid the stigma of vendor reported benchmarks and publish enough
information and code to let others repeat the exercise easily.

        - Steve



On 10/31/14, 11:30, "Nicholas Chammas" <nicholas.cham...@gmail.com> wrote:

>Thanks for the response, Patrick.
>
>I guess the key takeaways are 1) the tuning/config details are everything
>(they're not laid out here), 2) the benchmark should be reproducible (it's
>not), and 3) reach out to the relevant devs before publishing (didn't
>happen).
>
>Probably key takeaways for any kind of benchmark, really...
>
>Nick
>
>
>2014년 10월 31일 금요일, Patrick Wendell<pwend...@gmail.com>님이 작성한 메시지:
>
>> Hey Nick,
>>
>> Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
>> developers when running this. It is really easy to make one system
>> look better than others when you are running a benchmark yourself
>> because tuning and sizing can lead to a 10X performance improvement.
>> This benchmark doesn't share the mechanism in a reproducible way.
>>
>> There are a bunch of things that aren't clear here:
>>
>> 1. Spark SQL has optimized parquet features, were these turned on?
>> 2. It doesn't mention computing statistics in Spark SQL, but it does
>> this for Impala and Parquet. Statistics allow Spark SQL to broadcast
>> small tables which can make a 10X difference in TPC-H.
>> 3. For data larger than memory, Spark SQL often performs better if you
>> don't call "cache", did they try this?
>>
>> Basically, a self-reported marketing benchmark like this that
>> *shocker* concludes this vendor's solution is the best, is not
>> particularly useful.
>>
>> If Citus data wants to run a credible benchmark, I'd invite them to
>> directly involve Spark SQL developers in the future. Until then, I
>> wouldn't give much credence to this or any other similar vendor
>> benchmark.
>>
>> - Patrick
>>
>> On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
>> <nicholas.cham...@gmail.com <javascript:;>> wrote:
>> > I know we don't want to be jumping at every benchmark someone posts
>>out
>> > there, but this one surprised me:
>> >
>> > http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
>> >
>> > This benchmark has Spark SQL failing to complete several queries in
>>the
>> > TPC-H benchmark. I don't understand much about the details of
>>performing
>> > benchmarks, but this was surprising.
>> >
>> > Are these results expected?
>> >
>> > Related HN discussion here:
>>https://news.ycombinator.com/item?id=8539678
>> >
>> > Nick
>>



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to