Hi All, https://amplab.cs.berkeley.edu/benchmark/ has given a nice benchmark report. I am trying to reproduce the same set of queries in the spark-shell so that we can understand more about shark and spark and their performance on EC2.
As for the Aggregation Query when X=8, Shark-disk takes 210 seconds and Shark-mem takes 111 seconds. However, when I materialize the results to the disk, spark-shell takes more than 5 minutes (reduceByKey is used in the shell for aggregation) . Further, if I cache uservisits RDD, since the dataset is way too big, the performance deteriorates quite a lot. Can anybody shed some light on why there is a more than 2x difference between shark-disk and spark-shell-disk and how to cache data in spark correctly such that we can achieve comparable performance as shark-mem? Thank you very much, -chen