The biggest difference I see is that Shark stores data in a Column-oriented form a la C-Store and Vertica, whereas Spark keeps things in row-oriented form. Chris pointed this out in the RDD[TablePartition] vs RDD[Array[String]] comparison.
I'd be interested in hearing how TablePartition compares to the Parquet format, which has been getting a lot of attention recently. https://github.com/Parquet/parquet-format Personally as far as performance goes, I remember once being surprised that Shark row counting query completed much faster than the equivalent Spark, even after I had both sitting in memory. This was a select count(*) from TABLE on a cached table in Spark vs a val rdd = sc.textFile(...).cache; rdd.count; in Shark. I attributed it to the column-oriented format at the time but didn't dig any deeper. On Wed, Jan 29, 2014 at 11:22 PM, Christopher Nguyen <c...@adatao.com> wrote: > Hi Chen, it's certainly correct to say it is hard to make an > apple-to-apple comparison in terms of being able to assume that there is an > implementation-equivalent for any given Shark query, in "Spark only". > > That said, I think the results of your comparisons could still be a > valuable reference. There are scenarios where perhaps someone wants to > consider the trade-offs between implementing some ETL operation with Shark > or with only Spark. Some sense of performance/cost difference would be > helpful in making that decision. > > -- > Christopher T. Nguyen > Co-founder & CEO, Adatao <http://adatao.com> > linkedin.com/in/ctnguyen > > > > On Wed, Jan 29, 2014 at 11:10 PM, Chen Jin <karen...@gmail.com> wrote: > >> Hi Christopher, >> >> Thanks a lot for taking time to explain some details under Shark's >> hood. It is probably very hard to make an apple-to-apple comparison >> between Shark and Spark since they might be suitable for different >> types of tasks. From what you have explained, is it OK to think Shark >> is better off for SQL-like tasks, while Spark is more for iterative >> machine learning algorithms? >> >> Cheers, >> >> -chen >> >> On Wed, Jan 29, 2014 at 8:59 PM, Christopher Nguyen <c...@adatao.com> >> wrote: >> > Chen, interesting comparisons you're trying to make. It would be great >> to >> > share this somewhere when you're done. >> > >> > Some suggestions of non-obvious things to consider: >> > >> > In general there are any number of differences between Shark and some >> > "equivalent" Spark implementation of the same query. >> > >> > Shark isn't necessarily what we may think of as "let's see which lines >> of >> > code accomplish the same thing in Spark". Its current implementation is >> > based on Hive which has its own query planning, optimization, and >> execution. >> > Shark's code has some of its own tricks. You can use "EXPLAIN" to see >> > Shark's execution plan, and compare to your Spark approach. >> > >> > Further Shark has its own memory storage format, e.g., >> typed-column-oriented >> > RDD[TablePartition], that can make it more memory-efficient, and help >> > execute many column aggregation queries a lot faster than the >> row-oriented >> > RDD[Array[String]] you may be using. >> > >> > In short, Shark does a number of things that are smarter and more >> optimized >> > for SQL queries than a straightforward Spark RDD implementation of the >> same. >> > -- >> > Christopher T. Nguyen >> > Co-founder & CEO, Adatao >> > linkedin.com/in/ctnguyen >> > >> > >> > >> > On Wed, Jan 29, 2014 at 8:10 PM, Chen Jin <karen...@gmail.com> wrote: >> >> >> >> Hi All, >> >> >> >> https://amplab.cs.berkeley.edu/benchmark/ has given a nice benchmark >> >> report. I am trying to reproduce the same set of queries in the >> >> spark-shell so that we can understand more about shark and spark and >> >> their performance on EC2. >> >> >> >> As for the Aggregation Query when X=8, Shark-disk takes 210 seconds >> >> and Shark-mem takes 111 seconds. However, when I materialize the >> >> results to the disk, spark-shell takes more than 5 minutes >> >> (reduceByKey is used in the shell for aggregation) . Further, if I >> >> cache uservisits RDD, since the dataset is way too big, the >> >> performance deteriorates quite a lot. >> >> >> >> Can anybody shed some light on why there is a more than 2x difference >> >> between shark-disk and spark-shell-disk and how to cache data in spark >> >> correctly such that we can achieve comparable performance as >> >> shark-mem? >> >> >> >> Thank you very much, >> >> >> >> -chen >> > >> > >> > >