[ 
https://issues.apache.org/jira/browse/HIVE-223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669165#action_12669165
 ] 

Joydeep Sen Sarma commented on HIVE-223:
----------------------------------------

on very high cardinality dimensions - map-side aggregations do not work or 
require too much memory to make work. as we are also painfully learning tonight 
- lots of mappers run concurrently and using a lot of memory is a big problem. 
so we definitely need sort based techniques for such scenarios and that option 
needs to be available. however - in the same scenario - the risk of skew is 
very low - so a single map-reduce task is sufficient. we don't support that 
currently - but we should. So for simple group by's - please support:

1a. map-side aggregations with 1 map-reduce
1b. sort based aggregations with 1 map-reduce

i think for the distinct case - we can simplify/refine further. if the 
grouping+distinct keyset is low cardinality - option 1a above suffices. If the 
grouping+distinct keyset is high cardinality - then there are 3 sub-cases:
a) skew on none of neither distinct nor grouping keys : in this case map-side 
aggregates don't work and we need 1b again (distribute by grouping, sort by 
grouping,distincts)
b) skew on either one of distinct or grouping keys. in this case - data can be 
sprayed on whatever doesn't have skew. this will require 2 map-reduce if we 
have to spray on distinct column or 1 map-reduce if we can spray on grouping 
columns
c) skew on both the distinct and grouping keys. in this case map-side 
aggregation (1a) should work again (since substantial reduction can happen when 
the popular distinct keys overlap with popular grouping keys).
so in addition to 1a and 1b above it seems we should, for distincts, offer:

2a - distribute by grouping columns, sort by grouping, distincts with 1 
map-reduce
2b - distribute and sort by distinct columns, hash aggregation in reduce side, 
- 2 map-reduce

between these 4 (1a/b, 2c/d) options - i think all cases are covered. sort 
based group-by's for non-distinct case would probably benefit from the use of a 
combiner (longer term).


> when using map-side aggregates - perform single map-reduce group-by
> -------------------------------------------------------------------
>
>                 Key: HIVE-223
>                 URL: https://issues.apache.org/jira/browse/HIVE-223
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Namit Jain
>
> today even when we do map side aggregates - we do multiple map-reduce jobs. 
> however - the reason for doing multiple map-reduce group-bys (for single 
> group-bys) was the fear of skews. When we are doing map side aggregates - 
> skews should not exist for the most part. There can be two reason for skews:
> - large number of entries for a single grouping set - map side aggregates 
> should take care of this
> - badness in hash function that sends too much stuff to one reducer - we 
> should be able to take care of this by having good hash functions (and prime 
> number reducer counts)
> So i think we should be able to do a single stage map-reduce when doing 
> map-side aggregates.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to