[
https://issues.apache.org/jira/browse/HIVE-223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669165#action_12669165
]
Joydeep Sen Sarma commented on HIVE-223:
----------------------------------------
on very high cardinality dimensions - map-side aggregations do not work or
require too much memory to make work. as we are also painfully learning tonight
- lots of mappers run concurrently and using a lot of memory is a big problem.
so we definitely need sort based techniques for such scenarios and that option
needs to be available. however - in the same scenario - the risk of skew is
very low - so a single map-reduce task is sufficient. we don't support that
currently - but we should. So for simple group by's - please support:
1a. map-side aggregations with 1 map-reduce
1b. sort based aggregations with 1 map-reduce
i think for the distinct case - we can simplify/refine further. if the
grouping+distinct keyset is low cardinality - option 1a above suffices. If the
grouping+distinct keyset is high cardinality - then there are 3 sub-cases:
a) skew on none of neither distinct nor grouping keys : in this case map-side
aggregates don't work and we need 1b again (distribute by grouping, sort by
grouping,distincts)
b) skew on either one of distinct or grouping keys. in this case - data can be
sprayed on whatever doesn't have skew. this will require 2 map-reduce if we
have to spray on distinct column or 1 map-reduce if we can spray on grouping
columns
c) skew on both the distinct and grouping keys. in this case map-side
aggregation (1a) should work again (since substantial reduction can happen when
the popular distinct keys overlap with popular grouping keys).
so in addition to 1a and 1b above it seems we should, for distincts, offer:
2a - distribute by grouping columns, sort by grouping, distincts with 1
map-reduce
2b - distribute and sort by distinct columns, hash aggregation in reduce side,
- 2 map-reduce
between these 4 (1a/b, 2c/d) options - i think all cases are covered. sort
based group-by's for non-distinct case would probably benefit from the use of a
combiner (longer term).
> when using map-side aggregates - perform single map-reduce group-by
> -------------------------------------------------------------------
>
> Key: HIVE-223
> URL: https://issues.apache.org/jira/browse/HIVE-223
> Project: Hadoop Hive
> Issue Type: Improvement
> Components: Query Processor
> Reporter: Joydeep Sen Sarma
> Assignee: Namit Jain
>
> today even when we do map side aggregates - we do multiple map-reduce jobs.
> however - the reason for doing multiple map-reduce group-bys (for single
> group-bys) was the fear of skews. When we are doing map side aggregates -
> skews should not exist for the most part. There can be two reason for skews:
> - large number of entries for a single grouping set - map side aggregates
> should take care of this
> - badness in hash function that sends too much stuff to one reducer - we
> should be able to take care of this by having good hash functions (and prime
> number reducer counts)
> So i think we should be able to do a single stage map-reduce when doing
> map-side aggregates.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.