Re: Spark performance gains for small queries
Hey Matei, Thanks for your reply. We would keep in mind to use JDBC server for smaller queries. For the mapreduce job start-up, are you pointing towards JVM initialization latencies in MR? Other than JVM initialization, does Spark do any optimization (that is not done by mapreduce) to speed up the startup? -- Regards, Saumitra Shahapure On Fri, Jan 23, 2015 at 2:08 PM, Matei Zaharia wrote: > It's hard to tell without more details, but the start-up latency in Hive > can sometimes be high, especially if you are running Hive on MapReduce. MR > just takes 20-30 seconds per job to spin up even if the job is doing > nothing. > > For real use of Spark SQL for short queries by the way, I'd recommend > using the JDBC server so that you can have a long-running Spark process. It > gets quite a bit faster after the first few queries. > > Matei > > > On Jan 22, 2015, at 10:22 PM, Saumitra Shahapure (Vizury) < > saumitra.shahap...@vizury.com> wrote: > > > > Hello, > > > > We were comparing performance of some of our production hive queries > > between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against > both > > Spark 0.9 and 1.1. We could see that the performance gains have been good > > in Spark. > > > > We tried a very simple query, > > select count(*) from T where col3=123 > > in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark > > performance had been 2x better than Hive (120sec vs 60sec). Table T is > > stored in S3 and contains 600MB single GZIP file. > > > > My question is, why Spark is faster than Hive here? In both of the cases, > > the file will be downloaded, uncompressed and lines will be counted by a > > single process. For Hive case, reducer will be identity function > > since hive.map.aggr is true. > > > > Note that disk spills and network I/O are very less for Hive's case as > well, > > -- > > Regards, > > Saumitra Shahapure > >
Re: Spark performance gains for small queries
It's hard to tell without more details, but the start-up latency in Hive can sometimes be high, especially if you are running Hive on MapReduce. MR just takes 20-30 seconds per job to spin up even if the job is doing nothing. For real use of Spark SQL for short queries by the way, I'd recommend using the JDBC server so that you can have a long-running Spark process. It gets quite a bit faster after the first few queries. Matei > On Jan 22, 2015, at 10:22 PM, Saumitra Shahapure (Vizury) > wrote: > > Hello, > > We were comparing performance of some of our production hive queries > between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both > Spark 0.9 and 1.1. We could see that the performance gains have been good > in Spark. > > We tried a very simple query, > select count(*) from T where col3=123 > in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark > performance had been 2x better than Hive (120sec vs 60sec). Table T is > stored in S3 and contains 600MB single GZIP file. > > My question is, why Spark is faster than Hive here? In both of the cases, > the file will be downloaded, uncompressed and lines will be counted by a > single process. For Hive case, reducer will be identity function > since hive.map.aggr is true. > > Note that disk spills and network I/O are very less for Hive's case as well, > -- > Regards, > Saumitra Shahapure - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Spark performance gains for small queries
Hello, We were comparing performance of some of our production hive queries between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both Spark 0.9 and 1.1. We could see that the performance gains have been good in Spark. We tried a very simple query, select count(*) from T where col3=123 in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark performance had been 2x better than Hive (120sec vs 60sec). Table T is stored in S3 and contains 600MB single GZIP file. My question is, why Spark is faster than Hive here? In both of the cases, the file will be downloaded, uncompressed and lines will be counted by a single process. For Hive case, reducer will be identity function since hive.map.aggr is true. Note that disk spills and network I/O are very less for Hive's case as well, -- Regards, Saumitra Shahapure