Krisztian Kasa created HIVE-27527: ------------------------------------- Summary: Order of records are not ensured in delete delta files when reduce deduplication is off Key: HIVE-27527 URL: https://issues.apache.org/jira/browse/HIVE-27527 Project: Hive Issue Type: Bug Reporter: Krisztian Kasa Assignee: Krisztian Kasa
When {code} set hive.optimize.reducededuplication=false; {code} Reduce sink operators in delete statements are not merged. Delete delta files must be sorted by RowID and this is ensured by the parent Reduce sink operators. In this case the child Reduce sink operator has only partition key column: {{UDFToInteger(_col0)}} and sort order may broken and invalid delete delta files are written. {{Reduce Output Operators}} in {{Map 1}} has sort keys defined (RowId) but the one in {{Reducer 2}} has only Map-reduce partition columns. {code} POSTHOOK: query: explain delete from t1 where a = 3 POSTHOOK: type: QUERY POSTHOOK: Input: default@t1 POSTHOOK: Output: default@t1 STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 depends on stages: Stage-2 Stage-3 depends on stages: Stage-0 STAGE PLANS: Stage: Stage-1 Tez #### A masked pattern was here #### Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE) #### A masked pattern was here #### Vertices: Map 1 Map Operator Tree: TableScan alias: t1 filterExpr: (a = 3) (type: boolean) Statistics: Num rows: 30 Data size: 120 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: (a = 3) (type: boolean) Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: ROW__ID (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) null sort order: z sort order: + Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE Column stats: COMPLETE Execution mode: vectorized, llap LLAP IO: may be used (ACID table) Reducer 2 Execution mode: vectorized, llap Reduce Operator Tree: Select Operator expressions: KEY.reducesinkkey0 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator null sort order: sort order: Map-reduce partition columns: UDFToInteger(_col0) (type: int) Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) Reducer 3 Execution mode: vectorized, llap Reduce Operator Tree: Select Operator expressions: VALUE._col0 (type: struct<writeid:bigint,bucketid:int,rowid:bigint>) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde name: default.t1 Write Type: DELETE Stage: Stage-2 Dependency Collection Stage: Stage-0 Move Operator tables: replace: false table: input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde name: default.t1 Write Type: DELETE Stage: Stage-3 Stats Work Basic Stats Work: {code} Normally reduce sink deduplication optimization merges these Reduce Sink operators. This jira tries to cover the case when this optimization is turned off. -- This message was sent by Atlassian Jira (v8.20.10#820010)