Hey Matei, Thanks for your reply. We would keep in mind to use JDBC server for smaller queries.
For the mapreduce job start-up, are you pointing towards JVM initialization latencies in MR? Other than JVM initialization, does Spark do any optimization (that is not done by mapreduce) to speed up the startup? -- Regards, Saumitra Shahapure On Fri, Jan 23, 2015 at 2:08 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > It's hard to tell without more details, but the start-up latency in Hive > can sometimes be high, especially if you are running Hive on MapReduce. MR > just takes 20-30 seconds per job to spin up even if the job is doing > nothing. > > For real use of Spark SQL for short queries by the way, I'd recommend > using the JDBC server so that you can have a long-running Spark process. It > gets quite a bit faster after the first few queries. > > Matei > > > On Jan 22, 2015, at 10:22 PM, Saumitra Shahapure (Vizury) < > saumitra.shahap...@vizury.com> wrote: > > > > Hello, > > > > We were comparing performance of some of our production hive queries > > between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against > both > > Spark 0.9 and 1.1. We could see that the performance gains have been good > > in Spark. > > > > We tried a very simple query, > > select count(*) from T where col3=123 > > in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark > > performance had been 2x better than Hive (120sec vs 60sec). Table T is > > stored in S3 and contains 600MB single GZIP file. > > > > My question is, why Spark is faster than Hive here? In both of the cases, > > the file will be downloaded, uncompressed and lines will be counted by a > > single process. For Hive case, reducer will be identity function > > since hive.map.aggr is true. > > > > Note that disk spills and network I/O are very less for Hive's case as > well, > > -- > > Regards, > > Saumitra Shahapure > >