Re: Hive on Spark performance

sjayatheertha Mon, 14 Mar 2016 09:07:50 -0700

Thanks for your response. We were evaluating Spark and were curious to know how 
it is used today and the lowest latency it can provide.


> On Mar 14, 2016, at 8:37 AM, Mich Talebzadeh <[email protected]> 
> wrote:
> 
> Hi Wlodeck,
> 
> Let us look at this.
> 
> In Oracle I have two tables channels and sales. This code works in Oracle
> 
>   1  select c.channel_id, sum(c.channel_id * (select count(1) from sales s 
> WHERE c.channel_id = s.channel_id)) As R
>   2  from channels c
>   3* group by c.channel_id
> [email protected]> /
> CHANNEL_ID          R
> ---------- ----------
>          2     516050
>          3    1620984
>          4     473664
>          5          0
>          9      18666
> 
> I have the same tables In Hive but the same query crashes!
> 
> hive> select c.channel_id, sum(c.channel_id * (select count(1) from sales s 
> WHERE c.channel_id = s.channel_id)) As R
>     > from channels c
>     > group by c.channel_id
>     > ;
> NoViableAltException(232@[435:1: precedenceEqualExpression : ( ( LPAREN 
> precedenceBitwiseOrExpression COMMA )=> precedenceEqualExpressionMutiple | 
> precedenceEqualExpressionSingle );])
> 
> The solution is to use a temporary table to keep the sum/group by from sales 
> table as an intermediate stage  (temporary tables are session specific and 
> they are created and dropped after you finish the session)
> 
> hive> create temporary table tmp as select channel_id, count(channel_id) as 
> total from sales group by channel_id;
> 
> 
> Ok the rest is pretty easy
> 
> hive> select c.channel_id, c.channel_id * t.total as results
>     > from channels c, tmp t
>     > where c.channel_id = t.channel_id;
> 
> 2.0     2800432.0
> 3.0     8802300.0
> 4.0     2583552.0
> 9.0     104013.0
> 
> HTH
> 
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 14 March 2016 at 14:22, ws <[email protected]> wrote:
>> Hive 1.2.1.2.3.4.0-3485
>> Spark 1.5.2
>> Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
>> 
>> ### 
>> SELECT 
>>      f.description,
>>      f.item_number,
>>      sum(f.df_a * (select count(1) from e.mv_A_h_a where hb_h_name = 
>> r.h_id)) as df_a
>> FROM e.eng_fac_atl_sc_bf_qty f, wv_ATL_2_qty_df_rates r
>> where f.item_number NOT LIKE 'HR%' AND f.item_number NOT LIKE 'UG%' AND 
>> f.item_number NOT LIKE 'DEV%'
>> group by 
>>      f.description,
>>      f.item_number
>> ###
>> 
>> This query works fine in oracle but not Hive or Spark.
>> So the problem is: "sum(f.df_a * (select count(1) from e.mv_A_h_a where 
>> hb_h_name = r.h_id)) as df_a" field.
>> 
>> 
>> Thanks,
>> Wlodek
>> --
>> 
>> 
>> On Sunday, March 13, 2016 7:36 PM, Mich Talebzadeh 
>> <[email protected]> wrote:
>> 
>> 
>> Depending on the version of Hive on Spark engine.
>> 
>> As far as I am aware the latest version of Hive that I am using (Hive 2) has 
>> improvements compared to the previous versions of Hive (0.14,1.2.1) on Spark 
>> engine.
>> 
>> As of today I have managed to use Hive 2.0 on Spark version 1.3.1. So it is 
>> not the latest Spark but it is pretty good.
>> 
>> What specific concerns do you have in mind?
>> 
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>>  
>> 
>> On 13 March 2016 at 23:27, sjayatheertha <[email protected]> wrote:
>> Just curious if you could share your experience on the performance of spark 
>> in your company? How much data do you process? And what's the latency you 
>> are getting with spark engine?
>> 
>> Vidya
>

Re: Hive on Spark performance

Reply via email to