I'm not answering your question but, could you give me more insight on where 
and how do you use spark? I know that spark has in memory capabilities. 

Also, I have a similar question on ways to optimize hive queries and file 
storage. Which is better Orc vs parquet along with when to use compressions

> On Jan 22, 2015, at 3:03 AM, "Saumitra Shahapure (Vizury)" 
> <saumitra.shahap...@vizury.com> wrote:
> 
> Hello,
> 
> We were comparing performance of some of our production hive queries between 
> Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both Spark 0.9 
> and 1.1. We could see that the performance gains have been good in Spark.
>  
> We tried a very simple query, 
> select count(*) from T where col3=123 
> in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark 
> performance had been 2x better than Hive (120sec vs 60sec). Table T is stored 
> in S3 and contains 600MB single GZIP file.
> 
> My question is, why Spark is faster than Hive here? In both of the cases, the 
> file will be downloaded, uncompressed and lines will be counted by a single 
> process. For Hive case, reducer will be identity function since hive.map.aggr 
> is true.
> 
> Note that disk spills and network I/O are very less for Hive's case as well,

Reply via email to