Krisztian Kasa created HIVE-27527:
-------------------------------------
Summary: Order of records are not ensured in delete delta files
when reduce deduplication is off
Key: HIVE-27527
URL: https://issues.apache.org/jira/browse/HIVE-27527
Project: Hive
Issue Type: Bug
Reporter: Krisztian Kasa
Assignee: Krisztian Kasa
When
{code}
set hive.optimize.reducededuplication=false;
{code}
Reduce sink operators in delete statements are not merged. Delete delta files
must be sorted by RowID and this is ensured by the parent Reduce sink
operators. In this case the child Reduce sink operator has only partition key
column: {{UDFToInteger(_col0)}} and sort order may broken and invalid delete
delta files are written.
{{Reduce Output Operators}} in {{Map 1}} has sort keys defined (RowId) but the
one in {{Reducer 2}} has only Map-reduce partition columns.
{code}
POSTHOOK: query: explain
delete from t1 where a = 3
POSTHOOK: type: QUERY
POSTHOOK: Input: default@t1
POSTHOOK: Output: default@t1
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 depends on stages: Stage-2
Stage-3 depends on stages: Stage-0
STAGE PLANS:
Stage: Stage-1
Tez
#### A masked pattern was here ####
Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)
#### A masked pattern was here ####
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: t1
filterExpr: (a = 3) (type: boolean)
Statistics: Num rows: 30 Data size: 120 Basic stats: COMPLETE
Column stats: COMPLETE
Filter Operator
predicate: (a = 3) (type: boolean)
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
Column stats: COMPLETE
Select Operator
expressions: ROW__ID (type:
struct<writeid:bigint,bucketid:int,rowid:bigint>)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 76 Basic stats:
COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type:
struct<writeid:bigint,bucketid:int,rowid:bigint>)
null sort order: z
sort order: +
Statistics: Num rows: 1 Data size: 76 Basic stats:
COMPLETE Column stats: COMPLETE
Execution mode: vectorized, llap
LLAP IO: may be used (ACID table)
Reducer 2
Execution mode: vectorized, llap
Reduce Operator Tree:
Select Operator
expressions: KEY.reducesinkkey0 (type:
struct<writeid:bigint,bucketid:int,rowid:bigint>)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE
Column stats: COMPLETE
Reduce Output Operator
null sort order:
sort order:
Map-reduce partition columns: UDFToInteger(_col0) (type: int)
Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE
Column stats: COMPLETE
value expressions: _col0 (type:
struct<writeid:bigint,bucketid:int,rowid:bigint>)
Reducer 3
Execution mode: vectorized, llap
Reduce Operator Tree:
Select Operator
expressions: VALUE._col0 (type:
struct<writeid:bigint,bucketid:int,rowid:bigint>)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE
Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE
Column stats: COMPLETE
table:
input format:
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
output format:
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
name: default.t1
Write Type: DELETE
Stage: Stage-2
Dependency Collection
Stage: Stage-0
Move Operator
tables:
replace: false
table:
input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
name: default.t1
Write Type: DELETE
Stage: Stage-3
Stats Work
Basic Stats Work:
{code}
Normally reduce sink deduplication optimization merges these Reduce Sink
operators. This jira tries to cover the case when this optimization is turned
off.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)