[ https://issues.apache.org/jira/browse/HIVE-6057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xuefu Zhang updated HIVE-6057: ------------------------------ Description: Currently, you cannot use bucketed SMJ when joining subquery results. It would make sense to be able to explicitly specify bucketing of the intermediate output from a subquery to enable bucketed SMJ. For example, the following query will NOT use bucketed SMJ: (gameends and dummymapping are clustered and sorted by hashid into 128 buckets) {code} select * from (select hashid,count(*) as c from gameends group by hashid distribute by hashid sort by hashid) e join dummymapping m on e.hashid=m.hashid Suggestion: Implement an INTO n BUCKETS syntax for subqueries to enable bucketed SMJ: select * from (select hashid,count(*) as c from gameends group by hashid distribute by hashid sort by hashid INTO 128 BUCKETS) e join dummymapping m on e.hashid=m.hashid {code} was: Currently, you cannot use bucketed SMJ when joining subquery results. It would make sense to be able to explicitly specify bucketing of the intermediate output from a subquery to enable bucketed SMJ. For example, the following query will NOT use bucketed SMJ: (gameends and dummymapping are clustered and sorted by hashid into 128 buckets) select * from (select hashid,count(*) as c from gameends group by hashid distribute by hashid sort by hashid) e join dummymapping m on e.hashid=m.hashid Suggestion: Implement an INTO n BUCKETS syntax for subqueries to enable bucketed SMJ: select * from (select hashid,count(*) as c from gameends group by hashid distribute by hashid sort by hashid INTO 128 BUCKETS) e join dummymapping m on e.hashid=m.hashid > Enable bucketed sorted merge joins of arbitrary subqueries > ---------------------------------------------------------- > > Key: HIVE-6057 > URL: https://issues.apache.org/jira/browse/HIVE-6057 > Project: Hive > Issue Type: Improvement > Components: Query Processor > Affects Versions: 0.12.0 > Reporter: Jan-Erik Hedbom > Priority: Minor > > Currently, you cannot use bucketed SMJ when joining subquery results. It > would make sense to be able to explicitly specify bucketing of the > intermediate output from a subquery to enable bucketed SMJ. > For example, the following query will NOT use bucketed SMJ: > (gameends and dummymapping are clustered and sorted by hashid into 128 > buckets) > {code} > select * from (select hashid,count(*) as c from gameends group by hashid > distribute by hashid sort by hashid) e join dummymapping m on > e.hashid=m.hashid > Suggestion: Implement an INTO n BUCKETS syntax for subqueries to enable > bucketed SMJ: > select * from (select hashid,count(*) as c from gameends group by hashid > distribute by hashid sort by hashid INTO 128 BUCKETS) e join dummymapping m > on e.hashid=m.hashid > {code} -- This message was sent by Atlassian JIRA (v6.1.4#6159)