I'm not answering your question but, could you give me more insight on where and how do you use spark? I know that spark has in memory capabilities.
Also, I have a similar question on ways to optimize hive queries and file storage. Which is better Orc vs parquet along with when to use compressions > On Jan 22, 2015, at 3:03 AM, "Saumitra Shahapure (Vizury)" > <saumitra.shahap...@vizury.com> wrote: > > Hello, > > We were comparing performance of some of our production hive queries between > Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both Spark 0.9 > and 1.1. We could see that the performance gains have been good in Spark. > > We tried a very simple query, > select count(*) from T where col3=123 > in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark > performance had been 2x better than Hive (120sec vs 60sec). Table T is stored > in S3 and contains 600MB single GZIP file. > > My question is, why Spark is faster than Hive here? In both of the cases, the > file will be downloaded, uncompressed and lines will be counted by a single > process. For Hive case, reducer will be identity function since hive.map.aggr > is true. > > Note that disk spills and network I/O are very less for Hive's case as well,