[
https://issues.apache.org/jira/browse/HIVE-29598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Soumyakanti Das resolved HIVE-29598.
------------------------------------
Fix Version/s: 4.3.0
Resolution: Fixed
Thank you [~ryu_kobayashi] for the PR, and [~konstantinb] for reporting,
reviewing, and co-authoring it!
Merged to master at
https://github.com/apache/hive/commit/9e316c3203acc5631082b0bb3c976385164d431b
> Vectorization: Scratch column reuse sometimes produces wrong results
> --------------------------------------------------------------------
>
> Key: HIVE-29598
> URL: https://issues.apache.org/jira/browse/HIVE-29598
> Project: Hive
> Issue Type: Bug
> Reporter: Konstantin Bereznyakov
> Assignee: Ryu Kobayashi
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.3.0
>
>
> h2. Description
> The vectorizer can assign the same physical {{LongColumnVector}} slot to two
> different logical columns: an intermediate scratch result (e.g. {{cast(STRING
> AS INT)}}) and a small-side broadcast value column carried by a
> {{VectorMapJoinOuter*Operator}}. The intermediate writes a typed value (an
> {{int}}); the broadcast then writes a typed value of a different logical type
> (e.g. a {{boolean}}) to the same slot for matched rows.
> For *unmatched* rows the broadcast write is skipped — only {{isNull[i]}} is
> flipped — and the slot retains the stale intermediate value. Any downstream
> operator that reads {{vector[i]}} without first checking {{isNull[i]}} then
> propagates a value of the wrong logical type into the query result. The bug
> is silent.
> Reproduced on Apache Hive master, commit {{6f2355e22c}} ({{pom.xml}} declares
> version {{4.3.0-SNAPSHOT}}). Released 4.x versions sharing this
> {{Vectorizer}} / {{VectorMapJoinOuter*Operator}} / {{ColOrCol}} code are
> likely also exposed but were not separately verified by this filing.
> The repro below uses the projection-mode {{ColOrCol}} path; analogous
> projection-mode operators that read {{vector[i]}} directly ({{ColAndCol}},
> long-column arithmetic, equality/comparison expressions,
> {{VectorUDFAdaptor}}-routed UDFs) are exposed by the same root cause but are
> not separately verified by this filing.
> h2. When does the bug fire?
> All of the following co-occurring conditions are required:
> * Vectorized execution is on (default).
> * A vectorized OUTER-join MapJoin runs ({{VectorMapJoinOuter*Operator}}).
> INNER / LEFT SEMI / ANTI variants do not propagate stale slot data through
> unmatched rows.
> * The vectorizer aliases a scratch column with a {{smallTableValueMapping}}
> target — typical when a {{LongColumnVector}}-backed type (e.g. {{int}}) is
> computed for the join-key expression and a {{LongColumnVector}}-backed
> broadcast value (e.g. {{boolean}}) is also produced from the small side.
> * Some output row is unmatched on the join keys, going through
> {{generateOuterNulls}}.
> * A downstream operator reads {{vector[i]}} of the affected slot without
> consulting {{isNull[i]}}.
> * The reader's contract is violated by the dirty value (e.g. a strict {{==
> 1}} check fails on a long that is neither 0 nor 1).
> Removing any one condition masks the bug.
> h2. Reproduction
> {code:sql}
> SET hive.auto.convert.join=true; /* default true */
> SET hive.auto.convert.join.noconditionaltask=true; /* default true */
> CREATE TABLE t (k STRING, v STRING) STORED AS ORC;
> INSERT INTO t VALUES
> ('A','1'),('A','2'),('A','3'),
> ('B','2'),('B','3'),
> ('C','3'),
> ('D','1'),('D','3');
> WITH
> probe AS (
> SELECT k, v, (CAST(v AS INT) > 0) AS p_bool
> FROM t WHERE CAST(v AS INT) >= 3
> ),
> small_side AS (
> SELECT k, v, (CAST(v AS INT) > 9999) AS s_bool
> FROM t
> ),
> classified AS (
> SELECT p.k, p.v, CAST((s.s_bool OR p.p_bool) AS INT) AS observed_value
> FROM probe p
> LEFT JOIN small_side s
> ON p.k = s.k
> AND CAST(p.v AS INT) - 1 = CAST(s.v AS INT)
> ),
> diagnosed AS (
> /* Aggregate barrier: prevents Calcite from inlining the inner expression
> back into the outer WHERE and simplifying the bug surface away. */
> SELECT k, v, MAX(observed_value) AS observed_value
> FROM classified
> GROUP BY k, v
> )
> SELECT k, v,
> observed_value AS observed_value_returned_by_select,
> 1 AS required_value_per_sql_semantics
> FROM diagnosed
> WHERE observed_value = 0
> ORDER BY k, v;
> {code}
> *Expected:* zero rows. Every probe row's {{(s_bool OR p_bool)}} is {{TRUE}}
> per SQL semantics (matched: {{FALSE OR TRUE}}; unmatched: {{NULL OR TRUE}}),
> so {{cast(... AS INT) = 1}} for every row, and no row can satisfy {{WHERE
> observed_value = 0}}.
> *Actual:*
> {noformat}
> C 3 0 1
> D 3 0 1
> {noformat}
> Two rows are returned. The filter {{WHERE observed_value = 0}} accepts them
> because their stored {{observed_value}} is 0. The {{1 AS
> required_value_per_sql_semantics}} projected alongside makes the
> contradiction visible per row: SQL semantics demand value 1; the database
> returned 0.
> {{C}} and {{D}} are exactly the unmatched probe rows.
> h2. Workaround
> {code:sql}
> SET hive.vectorized.execution.enabled=false;
> {code}
> Same query then returns zero rows.
> h2. Mechanism (pointer to the slot reuse)
> {{EXPLAIN VECTORIZATION DETAIL}} of the repro shows physical column *4*
> allocated for two purposes:
> {noformat}
> bigTableKeyExpressions:
> LongColSubtractLongScalar(col 4:int, val 1)(children:
> CastStringToLong(col 1:string) -> 4:int) -> 6:int
> className: VectorMapJoinOuterMultiKeyOperator
> smallTableValueMapping: 4:boolean
> {noformat}
> For matched rows the broadcast writes a {{boolean}} (0/1) into {{vector[col
> 4][i]}}, overwriting the cast intermediate. For unmatched rows
> {{generateOuterNulls}}
> ({{ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinOuterGenerateResultOperator.java}})
> flips {{isNull[col 4][i] = true}} but leaves {{vector[col 4][i]}} holding
> the stale {{int}} from {{CastStringToLong}}. A subsequent projection of
> {{cast(s_bool OR p_bool AS INT)}} reads the slot through {{ColOrCol}}
> ({{ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/ColOrCol.java}})
> followed by {{IfExprLongScalarLongScalar}}, and the dirty value ({{stale_int
> | 1}}) fails the latter's strict {{== 1}} check, producing 0.
> Additional note: disabling CBO (which steers the plan away from this
> vectorized path) is also observed to produce correct results, consistent with
> the analysis above.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)