[ 
https://issues.apache.org/jira/browse/HIVE-223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668917#action_12668917
 ] 

Namit Jain commented on HIVE-223:
---------------------------------

Currently, there are too many options for the user: I think we should do the 
following:

1. Provide a hint to the user where he specifies whether there is skew in the 
data or not. It would be better than a configurable variable, since this is 
specific to a query block and not for the whole query.
2. Provide a hint for the user whether he wants to do map-side aggregation or 
not. Again, for the same reason as mentioned above, it might be better to have 
it as a hint than the current approach of configurable variable. 
    However, this can be postponed.
3. The default behavior is based on the fact that there is skew in the data.

The behavior will be:

1. If the query does not have a distinct, use map-side aggregation with 1 
map-reduce job. Although, we can leave the existing option of no map-side 
aggregation for this scenario, I cant think of any reason why this 
    would be useful.
2. If the query has a distinct:
    a. default behavior: skew with map-side aggr: 2 map-reduce jobs
    b. no skew with map-side aggr: 1 map-reduce job
    c. skew with no map-side aggr: 2 map-reduce jobs
    d. no skew with no map-side aggr: 1 map-reduce job

> when using map-side aggregates - perform single map-reduce group-by
> -------------------------------------------------------------------
>
>                 Key: HIVE-223
>                 URL: https://issues.apache.org/jira/browse/HIVE-223
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Namit Jain
>
> today even when we do map side aggregates - we do multiple map-reduce jobs. 
> however - the reason for doing multiple map-reduce group-bys (for single 
> group-bys) was the fear of skews. When we are doing map side aggregates - 
> skews should not exist for the most part. There can be two reason for skews:
> - large number of entries for a single grouping set - map side aggregates 
> should take care of this
> - badness in hash function that sends too much stuff to one reducer - we 
> should be able to take care of this by having good hash functions (and prime 
> number reducer counts)
> So i think we should be able to do a single stage map-reduce when doing 
> map-side aggregates.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to