[jira] [Created] (HIVE-21196) Support semijoin reduction on multiple column join

Deepak Jaiswal (JIRA) Thu, 31 Jan 2019 15:44:03 -0800

Deepak Jaiswal created HIVE-21196:
-------------------------------------

             Summary: Support semijoin reduction on multiple column join
                 Key: HIVE-21196
                 URL: https://issues.apache.org/jira/browse/HIVE-21196
             Project: Hive
          Issue Type: Bug
            Reporter: Deepak Jaiswal
            Assignee: Deepak Jaiswal



Currently for a query involving join on multiple columns creates  separate semi 
join edges for each key which in turn create a bloom filter for each of them, 
like below,

EXPLAIN select count(*) from srcpart_date_n7 join srcpart_small_n3 on 
(srcpart_date_n7.key = srcpart_small_n3.key1 and srcpart_date_n7.value = 
srcpart_small_n3.value1)
{code:java}

Map 1 <- Reducer 5 (BROADCAST_EDGE)
        Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)
        Reducer 5 <- Map 4 (CUSTOM_SIMPLE_EDGE)
#### A masked pattern was here ####
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date_n7
                  filterExpr: (key is not null and value is not null and (key 
BETWEEN DynamicValue(RS_7_srcpart_small_n3_key1_min) AND 
DynamicValue(RS_7_srcpart_small_n3_key1_max) and in_bloom_filter(key, 
DynamicValue(RS_7_srcpart_small_n3_key1_bloom_filter)))) (type: boolean)
                  Statistics: Num rows: 2000 Data size: 356000 Basic stats: 
COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: ((key BETWEEN 
DynamicValue(RS_7_srcpart_small_n3_key1_min) AND 
DynamicValue(RS_7_srcpart_small_n3_key1_max) and in_bloom_filter(key, 
DynamicValue(RS_7_srcpart_small_n3_key1_bloom_filter))) and key is not null and 
value is not null) (type: boolean)
                    Statistics: Num rows: 2000 Data size: 356000 Basic stats: 
COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: key (type: string), value (type: string)
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 2000 Data size: 356000 Basic stats: 
COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: string), _col1 (type: 
string)
                        sort order: ++
                        Map-reduce partition columns: _col0 (type: string), 
_col1 (type: string)
                        Statistics: Num rows: 2000 Data size: 356000 Basic 
stats: COMPLETE Column stats: COMPLETE
            Execution mode: vectorized, llap
            LLAP IO: all inputs
        Map 4 
            Map Operator Tree:
                TableScan
                  alias: srcpart_small_n3
                  filterExpr: (key1 is not null and value1 is not null) (type: 
boolean)
                  Statistics: Num rows: 20 Data size: 3560 Basic stats: PARTIAL 
Column stats: PARTIAL
                  Filter Operator
                    predicate: (key1 is not null and value1 is not null) (type: 
boolean)
                    Statistics: Num rows: 20 Data size: 3560 Basic stats: 
PARTIAL Column stats: PARTIAL
                    Select Operator
                      expressions: key1 (type: string), value1 (type: string)
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 20 Data size: 3560 Basic stats: 
PARTIAL Column stats: PARTIAL
                      Reduce Output Operator
                        key expressions: _col0 (type: string), _col1 (type: 
string)
                        sort order: ++
                        Map-reduce partition columns: _col0 (type: string), 
_col1 (type: string)
                        Statistics: Num rows: 20 Data size: 3560 Basic stats: 
PARTIAL Column stats: PARTIAL
                      Select Operator
                        expressions: _col0 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 20 Data size: 3560 Basic stats: 
PARTIAL Column stats: PARTIAL
                        Group By Operator
                          aggregations: min(_col0), max(_col0), 
bloom_filter(_col0, expectedEntries=20)
                          mode: hash
                          outputColumnNames: _col0, _col1, _col2
                          Statistics: Num rows: 1 Data size: 730 Basic stats: 
PARTIAL Column stats: PARTIAL
                          Reduce Output Operator
                            sort order: 
                            Statistics: Num rows: 1 Data size: 730 Basic stats: 
PARTIAL Column stats: PARTIAL
                            value expressions: _col0 (type: string), _col1 
(type: string), _col2 (type: binary)
            Execution mode: vectorized, llap
            LLAP IO: all inputs
        Reducer 2 
            Execution mode: llap
            Reduce Operator Tree:
              Merge Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 _col0 (type: string), _col1 (type: string)
                  1 _col0 (type: string), _col1 (type: string)
                Statistics: Num rows: 2200 Data size: 391600 Basic stats: 
PARTIAL Column stats: NONE
                Group By Operator
                  aggregations: count()
                  mode: hash
                  outputColumnNames: _col0
                  Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL 
Column stats: NONE
                  Reduce Output Operator
                    sort order: 
                    Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL 
Column stats: NONE
                    value expressions: _col0 (type: bigint)
        Reducer 3 
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL 
Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL 
Column stats: NONE
                  table:
                      input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
        Reducer 5 
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Group By Operator
                aggregations: min(VALUE._col0), max(VALUE._col1), 
bloom_filter(VALUE._col2, expectedEntries=20)
                mode: final
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL 
Column stats: PARTIAL
                Reduce Output Operator
                  sort order: 
                  Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL 
Column stats: PARTIAL
                  value expressions: _col0 (type: string), _col1 (type: 
string), _col2 (type: binary)
{code}
Instead it should create one branch for a join with one bloom filter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HIVE-21196) Support semijoin reduction on multiple column join

Reply via email to