[ 
https://issues.apache.org/jira/browse/HIVE-29597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stamatis Zampetakis resolved HIVE-29597.
----------------------------------------
    Fix Version/s: Not Applicable
       Resolution: Duplicate

> Vectorization: Scratch column reuse sometimes produces wrong results
> --------------------------------------------------------------------
>
>                 Key: HIVE-29597
>                 URL: https://issues.apache.org/jira/browse/HIVE-29597
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Konstantin Bereznyakov
>            Priority: Major
>             Fix For: Not Applicable
>
>
> ## Description
> The vectorizer can assign the same physical `LongColumnVector` slot to two 
> different logical columns: an intermediate scratch result (e.g. `cast(STRING 
> AS INT)`) and a small-side broadcast value column carried by a 
> `VectorMapJoinOuter*Operator`. The intermediate writes a typed value (an 
> `int`); the broadcast then writes a typed value of a different logical type 
> (e.g. a `boolean`) to the same slot for matched rows.
> For *unmatched* rows the broadcast write is skipped — only `isNull[i]` is 
> flipped — and the slot retains the stale intermediate value. Any downstream 
> operator that reads `vector[i]` without first checking `isNull[i]` then 
> propagates a value of the wrong logical type into the query result. The bug 
> is silent.
> Empirically reproduced on Apache Hive 4.x master at HEAD `6f2355e22c`. The 
> repro below uses the projection-mode `ColOrCol` path; analogous 
> projection-mode operators that read `vector[i]` directly (`ColAndCol`, 
> long-column arithmetic, equality/comparison expressions, 
> `VectorUDFAdaptor`-routed UDFs) are exposed by the same root cause but are 
> not separately verified by this filing.
> ## When does the bug fire?
> All of the following must hold:
> 1. Vectorized execution is on (default).
> 2. A vectorized OUTER-join MapJoin runs (`VectorMapJoinOuter*Operator`). 
> INNER / LEFT SEMI / ANTI variants do not propagate stale slot data through 
> unmatched rows.
> 3. The vectorizer aliases a scratch column with a `smallTableValueMapping` 
> target — typical when a `LongColumnVector`-backed type (e.g. `int`) is 
> computed for the join-key expression and a `LongColumnVector`-backed 
> broadcast value (e.g. `boolean`) is also produced from the small side.
> 4. Some output row is unmatched on the join keys, going through 
> `generateOuterNulls`.
> 5. A downstream operator reads `vector[i]` of the affected slot without 
> consulting `isNull[i]`.
> 6. The reader's contract is violated by the dirty value (e.g. a strict `== 1` 
> check fails on a non-{0,1} long).
> Removing any one condition masks the bug.
> ## Reproduction
> ```sql
> SET hive.auto.convert.join=true;                    -- default true
> SET hive.auto.convert.join.noconditionaltask=true;  -- default true
> CREATE TABLE t (k STRING, v STRING) STORED AS ORC;
> INSERT INTO t VALUES
>   ('A','1'),('A','2'),('A','3'),
>   ('B','2'),('B','3'),
>   ('C','3'),
>   ('D','1'),('D','3');
> WITH
>   probe AS (
>     SELECT k, v, (CAST(v AS INT) > 0) AS p_bool
>     FROM t WHERE CAST(v AS INT) >= 3
>   ),
>   small_side AS (
>     SELECT k, v, (CAST(v AS INT) > 9999) AS s_bool
>     FROM t
>   ),
>   classified AS (
>     SELECT p.k, p.v, CAST((s.s_bool OR p.p_bool) AS INT) AS observed_value
>     FROM probe p
>     LEFT JOIN small_side s
>       ON  p.k = s.k
>       AND CAST(p.v AS INT) - 1 = CAST(s.v AS INT)
>   ),
>   diagnosed AS (
>     -- Aggregate barrier: prevents Calcite from inlining the inner expression
>     -- back into the outer WHERE and simplifying the bug surface away.
>     SELECT k, v, MAX(observed_value) AS observed_value
>     FROM classified
>     GROUP BY k, v
>   )
> SELECT k, v,
>        observed_value AS observed_value_returned_by_select,
>        1              AS required_value_per_sql_semantics
> FROM diagnosed
> WHERE observed_value = 0
> ORDER BY k, v;
> ```
> **Expected**: zero rows. Every probe row's `(s_bool OR p_bool)` is `TRUE` per 
> SQL semantics (matched: `FALSE OR TRUE`; unmatched: `NULL OR TRUE`), so 
> `cast(... AS INT) = 1` for every row, and no row can satisfy `WHERE 
> observed_value = 0`.
> **Actual**:
> ```
> C    3    0    1
> D    3    0    1
> ```
> Two rows are returned. The filter `WHERE observed_value = 0` accepts them 
> because their stored `observed_value` is 0. The `1 AS 
> required_value_per_sql_semantics` projected alongside makes the contradiction 
> visible per row: SQL semantics demand value 1; the database returned 0.
> `C` and `D` are exactly the unmatched probe rows.
> ## Workaround
> ```sql
> SET hive.vectorized.execution.enabled=false;
> ```
> Same query then returns zero rows.
> ## Mechanism (pointer to the slot reuse)
> `EXPLAIN VECTORIZATION DETAIL` of the repro shows physical column **`4`** 
> allocated for two purposes:
> ```
> bigTableKeyExpressions:
>     LongColSubtractLongScalar(col 4:int, val 1)(children:
>         CastStringToLong(col 1:string) -> 4:int) -> 6:int
> className: VectorMapJoinOuterMultiKeyOperator
> smallTableValueMapping: 4:boolean
> ```
> For matched rows the broadcast writes a `boolean` (0/1) into `vector[col 
> 4][i]`, overwriting the cast intermediate. For unmatched rows 
> `generateOuterNulls` 
> (`ql/.../mapjoin/VectorMapJoinOuterGenerateResultOperator.java`) flips 
> `isNull[col 4][i] = true` but leaves `vector[col 4][i]` holding the stale 
> `int` from `CastStringToLong`. A subsequent projection of `cast(s_bool OR 
> p_bool AS INT)` reads the slot through `ColOrCol` 
> (`ql/.../expressions/ColOrCol.java`) followed by 
> `IfExprLongScalarLongScalar`, and the dirty value (`stale_int | 1`) fails the 
> latter's strict `== 1` check, producing 0.
> ## Suggested fix layers
> Listed root-first. Layers higher in this list cover broader classes of 
> downstream consumers; layers lower in the list narrow the symptom for one 
> specific chain.
> 1. **Vectorizer slot allocation**: don't reuse a scratch column for a 
> `smallTableValueMapping` target. Preserves the typed-slot invariant the rest 
> of the pipeline relies on; protects every downstream operator at once.
> 2. **`generateOuterNulls`**: when flipping `isNull[i]=true`, also write a 
> type-clean sentinel (e.g. `vector[i] = 0` or `LongColumnVector.NULL_VALUE`). 
> Defensive cleanup at the boundary; protects every downstream operator at once.
> 3. **Projection-mode operators that read `vector[i]` directly**: harden each 
> (e.g. `ColOrCol` masks `vector[i] & 1L` before OR'ing). Per-operator fix; 
> many operators to audit.
> 4. **Symptom-narrow fix in `IfExprLongScalarLongScalar`**: tolerate non-{0,1} 
> truthy values via `!= 0`. Hides this specific cast-to-int symptom; leaves 
> arithmetic, equality, and other dirty-input readers unaffected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to