[ https://issues.apache.org/jira/browse/HIVE-23893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162744#comment-17162744 ]
Zhihua Deng edited comment on HIVE-23893 at 7/22/20, 12:08 PM: --------------------------------------------------------------- [~pgaref] the problem can be reproduced on master, like query: set hive.cbo.enable=false; create table a(k string) partitioned by(hs int); create table b(k string) partitioned by(hs int); explain extended select a.**, b.** from a join b on a.k = b.k where rand(100) < 0.1 and a.hs = 11 and b.hs = 10; Query Plan: {noformat} STAGE PLANS: Stage: Stage-1 Tez #### A masked pattern was here #### Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE) #### A masked pattern was here #### Vertices: Map 1 Map Operator Tree: TableScan alias: a filterExpr: k is not null (type: boolean) Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL GatherStats: false Filter Operator isSamplingPred: false predicate: k is not null (type: boolean) Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL Reduce Output Operator bucketingVersion: 2 key expressions: k (type: string) null sort order: z numBuckets: -1 sort order: + Map-reduce partition columns: k (type: string) Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL tag: 0 value expressions: hs (type: int) auto parallelism: true Execution mode: vectorized, llap LLAP IO: unknown Map 3 Map Operator Tree: TableScan alias: b filterExpr: k is not null (type: boolean) Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL GatherStats: false Filter Operator isSamplingPred: false predicate: k is not null (type: boolean) Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL Reduce Output Operator bucketingVersion: 2 key expressions: k (type: string) null sort order: z numBuckets: -1 sort order: + Map-reduce partition columns: k (type: string) Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL tag: 1 value expressions: hs (type: int) auto parallelism: true Execution mode: vectorized, llap LLAP IO: unknown Reducer 2 Execution mode: llap Needs Tagging: false Reduce Operator Tree: Merge Join Operator condition map: Inner Join 0 to 1 keys: 0 k (type: string) 1 k (type: string) outputColumnNames: _col0, _col1, _col5, _col6 Position of Big Table: 0 Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL Filter Operator isSamplingPred: false predicate: ((rand(100) < 0.1) and (_col1 = 11) and (_col6 = 10)) (type: boolean) Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL Select Operator expressions: _col0 (type: string), 11 (type: int), _col5 (type: string), 10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3 Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL File Output Operator bucketingVersion: 2 compressed: false GlobalTableId: 0 #### A masked pattern was here #### NumFilesPerFileSink: 1 Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL #### A masked pattern was here #### table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat properties: bucketing_version -1 columns _col0,_col1,_col2,_col3 columns.types string:int:string:int escape.delim \ hive.serialization.extend.additional.nesting.levels true serialization.escape.crlf true serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe TotalFiles: 1 GatherStats: false MultiFileSpray: false{noformat} was (Author: dengzh): [~pgaref] the problem can be reproduced on master, like query: set hive.cbo.enable=false; create table a(k string) partitioned by(hs int); create table b(k string) partitioned by(hs int); explain extended select a.*, b.* from a join b on a.k = b.k where rand(100) < 0.1 and a.hs = 11 and b.hs = 10; Query Plan: {noformat} STAGE PLANS: Stage: Stage-1 Tez #### A masked pattern was here #### Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE) #### A masked pattern was here #### Vertices: Map 1 Map Operator Tree: TableScan alias: a filterExpr: k is not null (type: boolean) Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL GatherStats: false Filter Operator isSamplingPred: false predicate: k is not null (type: boolean) Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL Reduce Output Operator bucketingVersion: 2 key expressions: k (type: string) null sort order: z numBuckets: -1 sort order: + Map-reduce partition columns: k (type: string) Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL tag: 0 value expressions: hs (type: int) auto parallelism: true Execution mode: vectorized, llap LLAP IO: unknown Map 3 Map Operator Tree: TableScan alias: b filterExpr: k is not null (type: boolean) Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL GatherStats: false Filter Operator isSamplingPred: false predicate: k is not null (type: boolean) Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL Reduce Output Operator bucketingVersion: 2 key expressions: k (type: string) null sort order: z numBuckets: -1 sort order: + Map-reduce partition columns: k (type: string) Statistics: Num rows: 1 Data size: 188 Basic stats: COMPLETE Column stats: PARTIAL tag: 1 value expressions: hs (type: int) auto parallelism: true Execution mode: vectorized, llap LLAP IO: unknown Reducer 2 Execution mode: llap Needs Tagging: false Reduce Operator Tree: Merge Join Operator condition map: Inner Join 0 to 1 keys: 0 k (type: string) 1 k (type: string) outputColumnNames: _col0, _col1, _col5, _col6 Position of Big Table: 0 Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL Filter Operator isSamplingPred: false predicate: ((rand(100) < 0.1) and (_col1 = 11) and (_col6 = 10)) (type: boolean) Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL Select Operator expressions: _col0 (type: string), 11 (type: int), _col5 (type: string), 10 (type: int) outputColumnNames: _col0, _col1, _col2, _col3 Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL File Output Operator bucketingVersion: 2 compressed: false GlobalTableId: 0 #### A masked pattern was here #### NumFilesPerFileSink: 1 Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: PARTIAL #### A masked pattern was here #### table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat properties: bucketing_version -1 columns _col0,_col1,_col2,_col3 columns.types string:int:string:int escape.delim \ hive.serialization.extend.additional.nesting.levels true serialization.escape.crlf true serialization.format 1 serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe TotalFiles: 1 GatherStats: false MultiFileSpray: false{noformat} > Extract deterministic conditions for pdd when the predicate contains > non-deterministic function > ----------------------------------------------------------------------------------------------- > > Key: HIVE-23893 > URL: https://issues.apache.org/jira/browse/HIVE-23893 > Project: Hive > Issue Type: Improvement > Components: Logical Optimizer > Reporter: Zhihua Deng > Priority: Major > > Taken the following query for example, assume unix_timestamp is > non-deterministic before version 1.3.0: > > {{SELECT}} > {{ from_unixtime(unix_timestamp(a.first_dt), 'yyyyMMdd') AS ft,}} > {{ b.game_id AS game_id,}} > {{ b.game_name AS game_name,}} > {{ count(DISTINCT a.sha1_imei) uv}} > {{FROM}} > {{ gamesdk_userprofile a}} > {{ JOIN game_info_all b ON a.appid = b.dev_app_id}} > {{WHERE}} > {{ a.date = 20200704}} > {{ AND from_unixtime(unix_timestamp(a.first_dt), 'yyyyMMdd') = > 20200704}} > {{ AND b.date = 20200704}} > {{GROUP BY}} > {{ from_unixtime(unix_timestamp(a.first_dt), 'yyyyMMdd'),}} > {{ b.game_id,}} > {{ b.game_name}} > {{ORDER BY}} > {{ uv DESC}} > {{LIMIT 200;}} > > The predicates(a.date = 20200704, b.date = 20200704) are unable to push down > to join op, make the optimizer unable to prune partitions, which may result > to a full scan on tables gamesdk_userprofile and game_info_all. -- This message was sent by Atlassian Jira (v8.3.4#803005)