Re: HIVE Query 25x faster than SPARK Query

2016-06-16 Thread Mich Talebzadeh
Hi, Your statement "I have a system with 64 GB RAM and SSD and its performance on local cluster SPARK is way better" Is this a host with 64GB of RAM and you data is stored on local Solid State Disks? Can you kindly provide the parameters you pass to spark-submit:

Re: HIVE Query 25x faster than SPARK Query

2016-06-16 Thread Gourav Sengupta
Hi, We do have a dimension table with around few hundred columns from which we need only a few columns to join with the main fact table which has a few million rows. I do not know how one off this case sounds like but since I have been working in data warehousing it sounds like a fairly general

Re: HIVE Query 25x faster than SPARK Query

2016-06-16 Thread Mich Talebzadeh
sounds like this is a one off case. Do you have any other use case where you have Hive on MR outperforms Spark? I did some tests on 1 billion row table getting the selectivity of a column using Hive on MR, Hive on Spark engine and Spark running on local mode (to keep it simple) Hive 2, Spark

Re: HIVE Query 25x faster than SPARK Query

2016-06-16 Thread Jörn Franke
I agree here. However it depends always on your use case ! Best regards > On 16 Jun 2016, at 04:58, Gourav Sengupta wrote: > > Hi Mahender, > > please ensure that for dimension tables you are enabling the broadcast > method. You must be able to see surprising

Re: HIVE Query 25x faster than SPARK Query

2016-06-15 Thread Gourav Sengupta
Hi Mahender, please ensure that for dimension tables you are enabling the broadcast method. You must be able to see surprising gains @12x. Overall I think that SPARK cannot figure out whether to scan all the columns in a table or just the ones which are being used causing this issue. When you

Re: HIVE Query 25x faster than SPARK Query

2016-06-15 Thread Mahender Sarangam
+1, Even see performance degradation while comparing SPark SQL with Hive. We have table of 260 columns. We have executed in hive and SPARK. In Hive, it is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4 mins of time. On 6/9/2016 3:19 PM, Gavin Yue wrote: Could you print out the

Re: HIVE Query 25x faster than SPARK Query

2016-06-10 Thread Gourav Sengupta
Hi Gavin, for the first time someone is responding to this thread with a meaningful conversation - thanks for that. Okay, I did not tweak the spark.sql.autoBroadcastJoinThreshold parameter and since the cached field was around 75 MB therefore I do not think that broadcast join was used. But I

Re: HIVE Query 25x faster than SPARK Query

2016-06-10 Thread Gavin Yue
Yes. because in the second query, you did a (select PK from A) A . I guess it could the the subquery makes the results much smaller and make the broadcastJoin, so it is much faster. you could use sql.describe() to check the execution plan. On Fri, Jun 10, 2016 at 1:41 AM, Gourav Sengupta

Re: HIVE Query 25x faster than SPARK Query

2016-06-10 Thread Gourav Sengupta
Hi, I think if we try to see why is Query 2 faster than Query 1 then all the answers will be given without beating around the bush. That is the right way to find out what is happening and why. Regards, Gourav On Thu, Jun 9, 2016 at 11:19 PM, Gavin Yue wrote: > Could

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gavin Yue
Could you print out the sql execution plan? My guess is about broadcast join. > On Jun 9, 2016, at 07:14, Gourav Sengupta wrote: > > Hi, > > Query1 is almost 25x faster in HIVE than in SPARK. What is happening here and > is there a way we can optimize the queries

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Mich Talebzadeh
I still fail to see how Hive can do orders of magnitude faster compared to Spark. Assuming that Hive is using map-reduce, I cannot see a real case for Hive to do faster than at least under normal operations Don't take me wrong. I am a fan of Hive. The performance of Hive comes from deploying the

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gourav Sengupta
Hi Stephen, How can a single gzipped CSV file be partitioned and who partitions tables based on Primary Key in Hive? If you read the environments section you will be able to see that all the required details are mentioned. As far as I understand that Hive does work 25x faster (in these

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Stephen Boesch
ooc are the tables partitioned on a.pk and b.fk? Hive might be using copartitioning in that case: it is one of hive's strengths. 2016-06-09 7:28 GMT-07:00 Gourav Sengupta : > Hi Mich, > > does not Hive use map-reduce? I thought it to be so. And since I am > running

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gourav Sengupta
Hi Mich, does not Hive use map-reduce? I thought it to be so. And since I am running the queries in EMR 4.6 therefore HIVE is not using TEZ. Regards, Gourav On Thu, Jun 9, 2016 at 3:25 PM, Mich Talebzadeh wrote: > are you using map-reduce with Hive? > > Dr Mich

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Mich Talebzadeh
are you using map-reduce with Hive? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 9 June 2016 at

HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gourav Sengupta
Hi, Query1 is almost 25x faster in HIVE than in SPARK. What is happening here and is there a way we can optimize the queries in SPARK without the obvious hack in Query2. --- ENVIRONMENT: --- > Table A 533 columns x 24 million rows and Table B has 2