Hey Nick,

Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
developers when running this. It is really easy to make one system
look better than others when you are running a benchmark yourself
because tuning and sizing can lead to a 10X performance improvement.
This benchmark doesn't share the mechanism in a reproducible way.

There are a bunch of things that aren't clear here:

1. Spark SQL has optimized parquet features, were these turned on?
2. It doesn't mention computing statistics in Spark SQL, but it does
this for Impala and Parquet. Statistics allow Spark SQL to broadcast
small tables which can make a 10X difference in TPC-H.
3. For data larger than memory, Spark SQL often performs better if you
don't call "cache", did they try this?

Basically, a self-reported marketing benchmark like this that
*shocker* concludes this vendor's solution is the best, is not
particularly useful.

If Citus data wants to run a credible benchmark, I'd invite them to
directly involve Spark SQL developers in the future. Until then, I
wouldn't give much credence to this or any other similar vendor
benchmark.

- Patrick

On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
<nicholas.cham...@gmail.com> wrote:
> I know we don't want to be jumping at every benchmark someone posts out
> there, but this one surprised me:
>
> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
>
> This benchmark has Spark SQL failing to complete several queries in the
> TPC-H benchmark. I don't understand much about the details of performing
> benchmarks, but this was surprising.
>
> Are these results expected?
>
> Related HN discussion here: https://news.ycombinator.com/item?id=8539678
>
> Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to