[jira] [Created] (HIVE-27527) Order of records are not ensured in delete delta files when reduce deduplication is off

Krisztian Kasa (Jira) Mon, 24 Jul 2023 06:44:04 -0700

Krisztian Kasa created HIVE-27527:
-------------------------------------

             Summary: Order of records are not ensured in delete delta files 
when reduce deduplication is off
                 Key: HIVE-27527
                 URL: https://issues.apache.org/jira/browse/HIVE-27527
             Project: Hive
          Issue Type: Bug
            Reporter: Krisztian Kasa
            Assignee: Krisztian Kasa



When 
{code}
set hive.optimize.reducededuplication=false;
{code}
Reduce sink operators in delete statements are not merged. Delete delta files 
must be sorted by RowID and this is ensured by the parent Reduce sink 
operators. In this case the child Reduce sink operator has only partition key 
column: {{UDFToInteger(_col0)}} and sort order may broken and invalid delete 
delta files are written.

{{Reduce Output Operators}} in {{Map 1}} has sort keys defined (RowId) but the 
one in {{Reducer 2}} has only Map-reduce partition columns.

{code}
POSTHOOK: query: explain
delete from t1 where a = 3
POSTHOOK: type: QUERY
POSTHOOK: Input: default@t1
POSTHOOK: Output: default@t1
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 depends on stages: Stage-2
  Stage-3 depends on stages: Stage-0

STAGE PLANS:
  Stage: Stage-1
    Tez
#### A masked pattern was here ####
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)
#### A masked pattern was here ####
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: t1
                  filterExpr: (a = 3) (type: boolean)
                  Statistics: Num rows: 30 Data size: 120 Basic stats: COMPLETE 
Column stats: COMPLETE
                  Filter Operator
                    predicate: (a = 3) (type: boolean)
                    Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
Column stats: COMPLETE
                    Select Operator
                      expressions: ROW__ID (type: 
struct<writeid:bigint,bucketid:int,rowid:bigint>)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 76 Basic stats: 
COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: 
struct<writeid:bigint,bucketid:int,rowid:bigint>)
                        null sort order: z
                        sort order: +
                        Statistics: Num rows: 1 Data size: 76 Basic stats: 
COMPLETE Column stats: COMPLETE
            Execution mode: vectorized, llap
            LLAP IO: may be used (ACID table)
        Reducer 2 
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Select Operator
                expressions: KEY.reducesinkkey0 (type: 
struct<writeid:bigint,bucketid:int,rowid:bigint>)
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE 
Column stats: COMPLETE
                Reduce Output Operator
                  null sort order: 
                  sort order: 
                  Map-reduce partition columns: UDFToInteger(_col0) (type: int)
                  Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE 
Column stats: COMPLETE
                  value expressions: _col0 (type: 
struct<writeid:bigint,bucketid:int,rowid:bigint>)
        Reducer 3 
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Select Operator
                expressions: VALUE._col0 (type: 
struct<writeid:bigint,bucketid:int,rowid:bigint>)
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE 
Column stats: COMPLETE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 76 Basic stats: COMPLETE 
Column stats: COMPLETE
                  table:
                      input format: 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                      serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
                      name: default.t1
                  Write Type: DELETE

  Stage: Stage-2
    Dependency Collection

  Stage: Stage-0
    Move Operator
      tables:
          replace: false
          table:
              input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
              output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
              serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
              name: default.t1
          Write Type: DELETE

  Stage: Stage-3
    Stats Work
      Basic Stats Work:
{code}

Normally reduce sink deduplication optimization merges these Reduce Sink 
operators. This jira tries to cover the case when this optimization is turned 
off.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27527) Order of records are not ensured in delete delta files when reduce deduplication is off

Reply via email to