Re: Using Cluster by to improve Group by Performance

2013-10-31 Thread Navis류승우
>From perspective of RS, two query is just different in hash code of RS key.
The cost of calculating hash of col3 and col4 might be negligible, I think.

2013/11/1 KayVajj 

> Any response or pointers to understand how Cluster By in sub queries can
> affect the performance/speed of outer queries is helpful.
>
> Thanks
> Kay
>
>
> On Mon, Oct 28, 2013 at 1:17 PM, KayVajj  wrote:
>
>> Hi,
>>
>> I have a question if I could use the cluster by clause in a sub query to
>> improve the performance of a group by query in hive
>>
>> Lets I have a Table A with columns (all strings) col1..col5 and the table
>> is not "Clustered"
>>
>> now I 'm trying to run the below query
>>
>> select
>>> col1,
>>> col2,
>>> col3,
>>> col4,
>>> concat_ws(',', collect_set(col5))
>>> from A
>>> group by
>>> col1,
>>> col2,
>>> col3,
>>> col4
>>
>>
>>
>> Would the below query optimize the above query and if not what is the
>> best practice to optimize this query. Assuming only col1 & col2 are the
>> uniquely identifying columns
>>
>>
>>
>>
>> select
>>> ct.col1,
>>> ct.col2,
>>> ct.col3,
>>> ct.col4,
>>> concat_ws(',', collect_set(ct.col5))
>>> from
>>> (
>>> select
>>> col1,
>>> col2,
>>> col3,
>>> col4,
>>> col5
>>> from A
>>> cluster by col1, col2
>>> ) ct
>>> group by
>>> ct.col1,
>>> ct.col2,
>>> ct.col3,
>>> ct.col4.
>>
>>
>> Thanks for your responses.
>>
>>
>


Re: Using Cluster by to improve Group by Performance

2013-10-31 Thread KayVajj
Any response or pointers to understand how Cluster By in sub queries can
affect the performance/speed of outer queries is helpful.

Thanks
Kay


On Mon, Oct 28, 2013 at 1:17 PM, KayVajj  wrote:

> Hi,
>
> I have a question if I could use the cluster by clause in a sub query to
> improve the performance of a group by query in hive
>
> Lets I have a Table A with columns (all strings) col1..col5 and the table
> is not "Clustered"
>
> now I 'm trying to run the below query
>
> select
>> col1,
>> col2,
>> col3,
>> col4,
>> concat_ws(',', collect_set(col5))
>> from A
>> group by
>> col1,
>> col2,
>> col3,
>> col4
>
>
>
> Would the below query optimize the above query and if not what is the best
> practice to optimize this query. Assuming only col1 & col2 are the uniquely
> identifying columns
>
>
>
>
> select
>> ct.col1,
>> ct.col2,
>> ct.col3,
>> ct.col4,
>> concat_ws(',', collect_set(ct.col5))
>> from
>> (
>> select
>> col1,
>> col2,
>> col3,
>> col4,
>> col5
>> from A
>> cluster by col1, col2
>> ) ct
>> group by
>> ct.col1,
>> ct.col2,
>> ct.col3,
>> ct.col4.
>
>
> Thanks for your responses.
>
>


Using Cluster by to improve Group by Performance

2013-10-28 Thread KayVajj
Hi,

I have a question if I could use the cluster by clause in a sub query to
improve the performance of a group by query in hive

Lets I have a Table A with columns (all strings) col1..col5 and the table
is not "Clustered"

now I 'm trying to run the below query

select
> col1,
> col2,
> col3,
> col4,
> concat_ws(',', collect_set(col5))
> from A
> group by
> col1,
> col2,
> col3,
> col4



Would the below query optimize the above query and if not what is the best
practice to optimize this query. Assuming only col1 & col2 are the uniquely
identifying columns




select
> ct.col1,
> ct.col2,
> ct.col3,
> ct.col4,
> concat_ws(',', collect_set(ct.col5))
> from
> (
> select
> col1,
> col2,
> col3,
> col4,
> col5
> from A
> cluster by col1, col2
> ) ct
> group by
> ct.col1,
> ct.col2,
> ct.col3,
> ct.col4.


Thanks for your responses.