Re: Spark performance gains for small queries

2015-01-23 Thread Saumitra Shahapure (Vizury)
Hey Matei,

Thanks for your reply. We would keep in mind to use JDBC server for smaller
queries.

For the mapreduce job start-up, are you pointing towards JVM initialization
latencies in MR? Other than JVM initialization, does Spark do any
optimization (that is not done by mapreduce) to speed up the startup?

--
Regards,
Saumitra Shahapure

On Fri, Jan 23, 2015 at 2:08 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 It's hard to tell without more details, but the start-up latency in Hive
 can sometimes be high, especially if you are running Hive on MapReduce. MR
 just takes 20-30 seconds per job to spin up even if the job is doing
 nothing.

 For real use of Spark SQL for short queries by the way, I'd recommend
 using the JDBC server so that you can have a long-running Spark process. It
 gets quite a bit faster after the first few queries.

 Matei

  On Jan 22, 2015, at 10:22 PM, Saumitra Shahapure (Vizury) 
 saumitra.shahap...@vizury.com wrote:
 
  Hello,
 
  We were comparing performance of some of our production hive queries
  between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against
 both
  Spark 0.9 and 1.1. We could see that the performance gains have been good
  in Spark.
 
  We tried a very simple query,
  select count(*) from T where col3=123
  in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark
  performance had been 2x better than Hive (120sec vs 60sec). Table T is
  stored in S3 and contains 600MB single GZIP file.
 
  My question is, why Spark is faster than Hive here? In both of the cases,
  the file will be downloaded, uncompressed and lines will be counted by a
  single process. For Hive case, reducer will be identity function
  since hive.map.aggr is true.
 
  Note that disk spills and network I/O are very less for Hive's case as
 well,
  --
  Regards,
  Saumitra Shahapure




Spark performance gains for small queries

2015-01-22 Thread Saumitra Shahapure (Vizury)
Hello,

We were comparing performance of some of our production hive queries
between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both
Spark 0.9 and 1.1. We could see that the performance gains have been good
in Spark.

We tried a very simple query,
select count(*) from T where col3=123
in both sparkSQL and Hive (with hive.map.aggr=true) and found that Spark
performance had been 2x better than Hive (120sec vs 60sec). Table T is
stored in S3 and contains 600MB single GZIP file.

My question is, why Spark is faster than Hive here? In both of the cases,
the file will be downloaded, uncompressed and lines will be counted by a
single process. For Hive case, reducer will be identity function
since hive.map.aggr is true.

Note that disk spills and network I/O are very less for Hive's case as well,
--
Regards,
Saumitra Shahapure


Sharing RDDs

2014-04-23 Thread Saumitra Shahapure (Vizury)
Hello,

Is it possible in spark to reuse cached RDDs generated in earlier run?

Specifically, I am trying to have a setup where first scala script
generates cached RDDs. If another scala script tries to perform same
operations on same dataset, it should be able to get results from cache
generated in earlier run.

Is there any direct/indirect way to do this?

--
Regards,
Saumitra Shahapure