Re: Optimize Hive Query

Jörn Franke Thu, 23 Jun 2016 09:51:06 -0700

The query looks a little bit too complex from what it is supposed to do. Can 
you reformulate and restrict the data in a where clause (highest restriction 
first). Another hint would be to use the Orc format (with indexes and 
optionally bloom filters) with snappy compression as well as sorting the data 
on the column you choose to restrict in the where part.


> On 23 Jun 2016, at 02:42, @Sanjiv Singh <sanjiv.is...@gmail.com> wrote:
> 
> Hi All,
> 
> I am running performance issue with below query. Its took 2-3 hours to 
> complete in hive.
> 
> Try tried to partition and bucketing changes on this tables, but without luck.
> 
> Please help me in optimizing this query. 
> 
> what schema level changes can be done ? 
> other parameters recommendations ?
> 
> 
> Below are complete details :
> 
> Hive Table DDL :
> 
>> CREATE TABLE `tuning_dd_key`(
>>   m_d_key smallint,
>>   sb_gu_key bigint, 
>>   t_ev_st_dt date, 
>>   a_z_key int, 
>>   c_dt date, 
>>   e_p_dt date, 
>>   sq_nbr int);
>   
> Total data size : 
>> 250 GB  
>   
> Long running query :
>   
>> SELECT 
>> sb_gu_key, m_d_key, t_ev_st_dt, 
>> LAG( t_ev_st_dt )  OVER ( PARTITION BY  m_d_key , sb_gu_key  ORDER BY  
>> t_ev_st_dt ) AS LAG_START_DT, 
>> a_z_key, 
>> c_dt, 
>> e_p_dt, 
>> sq_nbr, 
>> CASE WHEN LAG( t_ev_st_dt )  OVER ( PARTITION BY  m_d_key , sb_gu_key  ORDER 
>> BY  t_ev_st_dt ) IS NULL  OR a_z_key <> LAG( a_z_key , 1 , -999 )  OVER ( 
>> PARTITION BY  m_d_key , sb_gu_key  ORDER BY  t_ev_st_dt )  THEN 'S'  ELSE 
>> NULL  END AS ST_FLAG 
>> FROM  `PRDDB`.tuning_dd_key ;
>  
> 
> More info :
> 
>> number of distinct value in column m_d_key : 29
>> number of distinct value in column sb_gu_key : 15434343
> 
> 
> 
> 
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339

Re: Optimize Hive Query

Reply via email to