Hi All,

https://amplab.cs.berkeley.edu/benchmark/ has given a nice benchmark
report. I am trying to reproduce the same set of queries in the
spark-shell so that we can understand more about shark and spark and
their performance on EC2.

As for the Aggregation Query when X=8,  Shark-disk takes 210 seconds
and Shark-mem takes 111 seconds. However, when I materialize the
results to the disk, spark-shell takes more than 5 minutes
(reduceByKey is used in the shell for aggregation) . Further, if I
cache uservisits RDD, since the dataset is way too big, the
performance deteriorates quite a lot.

Can anybody shed some light on why there is a more than 2x difference
between shark-disk and spark-shell-disk and how to cache data in spark
correctly such that we can achieve comparable performance as
shark-mem?

Thank you very much,

-chen

Reply via email to