Chen, interesting comparisons you're trying to make. It would be great to share this somewhere when you're done.
Some suggestions of non-obvious things to consider: In general there are any number of differences between Shark and some "equivalent" Spark implementation of the same query. Shark isn't necessarily what we may think of as "let's see which lines of code accomplish the same thing in Spark". Its current implementation is based on Hive which has its own query planning, optimization, and execution. Shark's code has some of its own tricks. You can use "EXPLAIN" to see Shark's execution plan, and compare to your Spark approach. Further Shark has its own memory storage format, e.g., typed-column-oriented RDD[TablePartition], that can make it more memory-efficient, and help execute many column aggregation queries a lot faster than the row-oriented RDD[Array[String]] you may be using. In short, Shark does a number of things that are smarter and more optimized for SQL queries than a straightforward Spark RDD implementation of the same. -- Christopher T. Nguyen Co-founder & CEO, Adatao <http://adatao.com> linkedin.com/in/ctnguyen On Wed, Jan 29, 2014 at 8:10 PM, Chen Jin <karen...@gmail.com> wrote: > Hi All, > > https://amplab.cs.berkeley.edu/benchmark/ has given a nice benchmark > report. I am trying to reproduce the same set of queries in the > spark-shell so that we can understand more about shark and spark and > their performance on EC2. > > As for the Aggregation Query when X=8, Shark-disk takes 210 seconds > and Shark-mem takes 111 seconds. However, when I materialize the > results to the disk, spark-shell takes more than 5 minutes > (reduceByKey is used in the shell for aggregation) . Further, if I > cache uservisits RDD, since the dataset is way too big, the > performance deteriorates quite a lot. > > Can anybody shed some light on why there is a more than 2x difference > between shark-disk and spark-shell-disk and how to cache data in spark > correctly such that we can achieve comparable performance as > shark-mem? > > Thank you very much, > > -chen >