Re: Please Help: Amplab Benchmark Performance

Christopher Nguyen Wed, 29 Jan 2014 21:00:47 -0800

Chen, interesting comparisons you're trying to make. It would be great to
share this somewhere when you're done.

Some suggestions of non-obvious things to consider:

In general there are any number of differences between Shark and some
"equivalent" Spark implementation of the same query.

Shark isn't necessarily what we may think of as "let's see which lines of
code accomplish the same thing in Spark". Its current implementation is
based on Hive which has its own query planning, optimization, and
execution. Shark's code has some of its own tricks. You can use "EXPLAIN"
to see Shark's execution plan, and compare to your Spark approach.

Further Shark has its own memory storage format, e.g.,
typed-column-oriented RDD[TablePartition], that can make it more
memory-efficient, and help execute many column aggregation queries a lot
faster than the row-oriented RDD[Array[String]] you may be using.

In short, Shark does a number of things that are smarter and more optimized
for SQL queries than a straightforward Spark RDD implementation of the same.
--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen

On Wed, Jan 29, 2014 at 8:10 PM, Chen Jin <karen...@gmail.com> wrote:

> Hi All,
>
> https://amplab.cs.berkeley.edu/benchmark/ has given a nice benchmark
> report. I am trying to reproduce the same set of queries in the
> spark-shell so that we can understand more about shark and spark and
> their performance on EC2.
>
> As for the Aggregation Query when X=8,  Shark-disk takes 210 seconds
> and Shark-mem takes 111 seconds. However, when I materialize the
> results to the disk, spark-shell takes more than 5 minutes
> (reduceByKey is used in the shell for aggregation) . Further, if I
> cache uservisits RDD, since the dataset is way too big, the
> performance deteriorates quite a lot.
>
> Can anybody shed some light on why there is a more than 2x difference
> between shark-disk and spark-shell-disk and how to cache data in spark
> correctly such that we can achieve comparable performance as
> shark-mem?
>
> Thank you very much,
>
> -chen
>

Re: Please Help: Amplab Benchmark Performance

Reply via email to