Re: Using Cluster by to improve Group by Performance
>From perspective of RS, two query is just different in hash code of RS key. The cost of calculating hash of col3 and col4 might be negligible, I think. 2013/11/1 KayVajj > Any response or pointers to understand how Cluster By in sub queries can > affect the performance/speed of outer queries is helpful. > > Thanks > Kay > > > On Mon, Oct 28, 2013 at 1:17 PM, KayVajj wrote: > >> Hi, >> >> I have a question if I could use the cluster by clause in a sub query to >> improve the performance of a group by query in hive >> >> Lets I have a Table A with columns (all strings) col1..col5 and the table >> is not "Clustered" >> >> now I 'm trying to run the below query >> >> select >>> col1, >>> col2, >>> col3, >>> col4, >>> concat_ws(',', collect_set(col5)) >>> from A >>> group by >>> col1, >>> col2, >>> col3, >>> col4 >> >> >> >> Would the below query optimize the above query and if not what is the >> best practice to optimize this query. Assuming only col1 & col2 are the >> uniquely identifying columns >> >> >> >> >> select >>> ct.col1, >>> ct.col2, >>> ct.col3, >>> ct.col4, >>> concat_ws(',', collect_set(ct.col5)) >>> from >>> ( >>> select >>> col1, >>> col2, >>> col3, >>> col4, >>> col5 >>> from A >>> cluster by col1, col2 >>> ) ct >>> group by >>> ct.col1, >>> ct.col2, >>> ct.col3, >>> ct.col4. >> >> >> Thanks for your responses. >> >> >
Re: Using Cluster by to improve Group by Performance
Any response or pointers to understand how Cluster By in sub queries can affect the performance/speed of outer queries is helpful. Thanks Kay On Mon, Oct 28, 2013 at 1:17 PM, KayVajj wrote: > Hi, > > I have a question if I could use the cluster by clause in a sub query to > improve the performance of a group by query in hive > > Lets I have a Table A with columns (all strings) col1..col5 and the table > is not "Clustered" > > now I 'm trying to run the below query > > select >> col1, >> col2, >> col3, >> col4, >> concat_ws(',', collect_set(col5)) >> from A >> group by >> col1, >> col2, >> col3, >> col4 > > > > Would the below query optimize the above query and if not what is the best > practice to optimize this query. Assuming only col1 & col2 are the uniquely > identifying columns > > > > > select >> ct.col1, >> ct.col2, >> ct.col3, >> ct.col4, >> concat_ws(',', collect_set(ct.col5)) >> from >> ( >> select >> col1, >> col2, >> col3, >> col4, >> col5 >> from A >> cluster by col1, col2 >> ) ct >> group by >> ct.col1, >> ct.col2, >> ct.col3, >> ct.col4. > > > Thanks for your responses. > >
Using Cluster by to improve Group by Performance
Hi, I have a question if I could use the cluster by clause in a sub query to improve the performance of a group by query in hive Lets I have a Table A with columns (all strings) col1..col5 and the table is not "Clustered" now I 'm trying to run the below query select > col1, > col2, > col3, > col4, > concat_ws(',', collect_set(col5)) > from A > group by > col1, > col2, > col3, > col4 Would the below query optimize the above query and if not what is the best practice to optimize this query. Assuming only col1 & col2 are the uniquely identifying columns select > ct.col1, > ct.col2, > ct.col3, > ct.col4, > concat_ws(',', collect_set(ct.col5)) > from > ( > select > col1, > col2, > col3, > col4, > col5 > from A > cluster by col1, col2 > ) ct > group by > ct.col1, > ct.col2, > ct.col3, > ct.col4. Thanks for your responses.