Maciek Kocon created HIVE-9523:
----------------------------------
Summary: when columns on which tables are partitioned are used in
the join condition same join optimizations as for bucketed tables should be
applied
Key: HIVE-9523
URL: https://issues.apache.org/jira/browse/HIVE-9523
Project: Hive
Issue Type: Improvement
Components: Logical Optimizer, Physical Optimizer, SQL
Affects Versions: 0.13.1, 0.14.0, 0.13.0
Reporter: Maciek Kocon
For JOIN conditions where partitioning criteria are used respectively:
⋮
FROM TabA JOIN TabB
ON TabA.partCol1 = TabB.partCol2
AND TabA.partCol2 = TabB.partCol2
the optimizer could/should choose to treat it the same way as with bucketed
tables: ⋮
FROM TabC
JOIN TabD
ON TabC.clusteredByCol1 = TabD.clusteredByCol2
AND TabC.clusteredByCol2 = TabD.clusteredByCol2
and use either Bucket Map Join or better, the Sort Merge Bucket Map Join.
This is based on fact that same way as buckets translate to separate files, the
partitions essentially provide the same mapping.
When data locality is known the optimizer could focus only on joining
corresponding partitions rather than whole data sets.
#side notes:
⦿ Currently Table DDL Syntax where Partitioning and Bucketing defined at the
same time is allowed:
CREATE TABLE
⋮
PARTITIONED BY(…) CLUSTERED BY(…) INTO … BUCKETS;
But in this case optimizer never chooses to use Bucket Map Join or Sort Merge
Bucket Map Join which defeats the purpose of creating BUCKETed tables in such
scenarios. Should that be raised as a separate BUG?
⦿ Currently partitioning and bucketing are two separate things but serve same
purpose - shouldn't the concept be merged (explicit/implicit partitions?)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)