Deepak Jaiswal created HIVE-21196: ------------------------------------- Summary: Support semijoin reduction on multiple column join Key: HIVE-21196 URL: https://issues.apache.org/jira/browse/HIVE-21196 Project: Hive Issue Type: Bug Reporter: Deepak Jaiswal Assignee: Deepak Jaiswal
Currently for a query involving join on multiple columns creates separate semi join edges for each key which in turn create a bloom filter for each of them, like below, EXPLAIN select count(*) from srcpart_date_n7 join srcpart_small_n3 on (srcpart_date_n7.key = srcpart_small_n3.key1 and srcpart_date_n7.value = srcpart_small_n3.value1) {code:java} Map 1 <- Reducer 5 (BROADCAST_EDGE) Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE) Reducer 5 <- Map 4 (CUSTOM_SIMPLE_EDGE) #### A masked pattern was here #### Vertices: Map 1 Map Operator Tree: TableScan alias: srcpart_date_n7 filterExpr: (key is not null and value is not null and (key BETWEEN DynamicValue(RS_7_srcpart_small_n3_key1_min) AND DynamicValue(RS_7_srcpart_small_n3_key1_max) and in_bloom_filter(key, DynamicValue(RS_7_srcpart_small_n3_key1_bloom_filter)))) (type: boolean) Statistics: Num rows: 2000 Data size: 356000 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: ((key BETWEEN DynamicValue(RS_7_srcpart_small_n3_key1_min) AND DynamicValue(RS_7_srcpart_small_n3_key1_max) and in_bloom_filter(key, DynamicValue(RS_7_srcpart_small_n3_key1_bloom_filter))) and key is not null and value is not null) (type: boolean) Statistics: Num rows: 2000 Data size: 356000 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: key (type: string), value (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 2000 Data size: 356000 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: string), _col1 (type: string) sort order: ++ Map-reduce partition columns: _col0 (type: string), _col1 (type: string) Statistics: Num rows: 2000 Data size: 356000 Basic stats: COMPLETE Column stats: COMPLETE Execution mode: vectorized, llap LLAP IO: all inputs Map 4 Map Operator Tree: TableScan alias: srcpart_small_n3 filterExpr: (key1 is not null and value1 is not null) (type: boolean) Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL Filter Operator predicate: (key1 is not null and value1 is not null) (type: boolean) Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL Select Operator expressions: key1 (type: string), value1 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL Reduce Output Operator key expressions: _col0 (type: string), _col1 (type: string) sort order: ++ Map-reduce partition columns: _col0 (type: string), _col1 (type: string) Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL Select Operator expressions: _col0 (type: string) outputColumnNames: _col0 Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL Column stats: PARTIAL Group By Operator aggregations: min(_col0), max(_col0), bloom_filter(_col0, expectedEntries=20) mode: hash outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL Column stats: PARTIAL Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL Column stats: PARTIAL value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: binary) Execution mode: vectorized, llap LLAP IO: all inputs Reducer 2 Execution mode: llap Reduce Operator Tree: Merge Join Operator condition map: Inner Join 0 to 1 keys: 0 _col0 (type: string), _col1 (type: string) 1 _col0 (type: string), _col1 (type: string) Statistics: Num rows: 2200 Data size: 391600 Basic stats: PARTIAL Column stats: NONE Group By Operator aggregations: count() mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL Column stats: NONE value expressions: _col0 (type: bigint) Reducer 3 Execution mode: vectorized, llap Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Reducer 5 Execution mode: vectorized, llap Reduce Operator Tree: Group By Operator aggregations: min(VALUE._col0), max(VALUE._col1), bloom_filter(VALUE._col2, expectedEntries=20) mode: final outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL Column stats: PARTIAL Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL Column stats: PARTIAL value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: binary) {code} Instead it should create one branch for a join with one bloom filter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)