Hey Matei,

Thanks for your reply. We would keep in mind to use JDBC server for smaller
queries.

For the mapreduce job start-up, are you pointing towards JVM initialization
latencies in MR? Other than JVM initialization, does Spark do any
optimization (that is not done by mapreduce) to speed up the startup?

--
Regards,
Saumitra Shahapure

On Fri, Jan 23, 2015 at 2:08 PM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> It's hard to tell without more details, but the start-up latency in Hive
> can sometimes be high, especially if you are running Hive on MapReduce. MR
> just takes 20-30 seconds per job to spin up even if the job is doing
> nothing.
>
> For real use of Spark SQL for short queries by the way, I'd recommend
> using the JDBC server so that you can have a long-running Spark process. It
> gets quite a bit faster after the first few queries.
>
> Matei
>
> > On Jan 22, 2015, at 10:22 PM, Saumitra Shahapure (Vizury) <
> saumitra.shahap...@vizury.com> wrote:
> >
> > Hello,
> >
> > We were comparing performance of some of our production hive queries
> > between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against
> both
> > Spark 0.9 and 1.1. We could see that the performance gains have been good
> > in Spark.
> >
> > We tried a very simple query,
> > select count(*) from T where col3=123
> > in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark
> > performance had been 2x better than Hive (120sec vs 60sec). Table T is
> > stored in S3 and contains 600MB single GZIP file.
> >
> > My question is, why Spark is faster than Hive here? In both of the cases,
> > the file will be downloaded, uncompressed and lines will be counted by a
> > single process. For Hive case, reducer will be identity function
> > since hive.map.aggr is true.
> >
> > Note that disk spills and network I/O are very less for Hive's case as
> well,
> > --
> > Regards,
> > Saumitra Shahapure
>
>

Reply via email to