Hi guys
I am using CDH 5.3.3 and that comes with Hive 0.13.1 and Spark 1.2
So to answer your question its not Tez (that I believe comes with HortonWorks)
This Hive query was run with hive defaults.
I used additional hive params right now to improve the timingsSET
mapreduce.job.reduces=16;SET mapreduce.tasktracker.map.tasks.maximum=24;SET
mapreduce.tasktracker.reduce.tasks.maximum=16;SET
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;SET
mapreduce.map.output.compress=true;
Now Time taken: 140.139 seconds, Fetched: 29597 row(s)(surprisingly close to
spark-sql now LOL. Time to tweak spark-sql now)
EARLIER RESULTS
Hive – 326.021 seconds, Fetched: 29597 row(s)
Impala – Fetched 27625 row(s) in 17.02s
spark-sql – Time taken: 120.236 seconds
I don't have the bandwidth to manage individual components on the cluster :-)
since I am solo doing all this and delivering ML solutions to production LOL.So
I depend on distribution such as CDH. The downside is that one is always couple
of versions behind.
Thanks for your questions.
regards
sanjay
From: Michael Armbrust mich...@databricks.com
To: user user@spark.apache.org
Sent: Thursday, June 18, 2015 3:25 PM
Subject: Re: Spark-sql versus Impala versus Hive
I would also love to see a more recent version of Spark SQL. There have been a
lot of performance improvements between 1.2 and 1.4 :)
On Thu, Jun 18, 2015 at 3:18 PM, Steve Nunez snu...@hortonworks.com wrote:
Interesting. What where the Hive settings? Specifically it would be useful to
know if this was Hive on Tez.
- Steve
From: Sanjay Subramanian
Reply-To: Sanjay Subramanian
Date: Thursday, June 18, 2015 at 11:08
To: user@spark.apache.org
Subject: Spark-sql versus Impala versus Hive
I just published results of my findings
herehttps://bigdatalatte.wordpress.com/2015/06/18/spark-sql-versus-impala-versus-hive/