erratic-pattern opened a new issue, #20937:
URL: https://github.com/apache/datafusion/issues/20937

   ### Describe the bug
   
   When a `VALUES` clause is joined against a Parquet table with 
`Dictionary(Int32, Utf8)` columns and `pushdown_filters = true`, the query 
fails with:
   
   ```
   Parquet error: External: Compute error: Error evaluating filter predicate:
   ArrowError(InvalidArgumentError("Can't compare arrays of different types"), 
Some(""))
   ```
   
   The dynamic filter pushdown from the `HashJoinExec` creates an `InListExpr` 
that is pushed into the Parquet scan. 
[`ArrayStaticFilter::contains()`](https://github.com/apache/arrow-datafusion/blob/d09ff92def5ad5f033e75bbfda1fd9d33a4c93c9/datafusion/physical-expr/src/expressions/in_list.rs#L90)
 unwraps the needle (the Dictionary array from the Parquet batch) to its plain 
Utf8 values via 
[`downcast_dictionary_array!`](https://github.com/apache/arrow-datafusion/blob/d09ff92def5ad5f033e75bbfda1fd9d33a4c93c9/datafusion/physical-expr/src/expressions/in_list.rs#L102-L108),
 but does not unwrap `in_array` (the stored InList values, which are also 
Dictionary from the type-coerced join build side). This results in a 
[`make_comparator(Utf8, 
Dictionary)`](https://github.com/apache/arrow-datafusion/blob/d09ff92def5ad5f033e75bbfda1fd9d33a4c93c9/datafusion/physical-expr/src/expressions/in_list.rs#L116)
 call, which 
[`arrow_ord::ord::build_compare`](https://docs.rs/arrow-ord/57.3.0/src/arrow_ord/ord
 .rs.html#475-490) does not support for mixed Dictionary/non-Dictionary types.
   
   ### To Reproduce
   
   Using the DataFusion CLI (`datafusion-cli`):
   
   ```sql
   -- Enable row-level filter pushdown (required to trigger the bug)
   SET datafusion.execution.parquet.pushdown_filters = true;
   SET datafusion.execution.parquet.reorder_filters = true;
   
   -- Create a Parquet file with Dictionary-encoded string columns
   COPY (
     SELECT
       arrow_cast(chr(65 + (row_num % 26)), 'Dictionary(Int32, Utf8)') as tag1,
       row_num * 1.0 as value
     FROM (SELECT unnest(range(0, 10000)) as row_num)
   ) TO '/tmp/dict_filter_bug.parquet';
   
   -- This query fails
   SELECT t.tag1, t.value
   FROM '/tmp/dict_filter_bug.parquet' t
   JOIN (VALUES ('A'), ('B')) AS v(c1)
   ON t.tag1 = v.c1;
   ```
   
   ### Expected behavior
   
   The query should return matching rows (tag1 = 'A' or 'B').
   
   ### Actual behavior
   
   ```
   Parquet error: External: Compute error: Error evaluating filter predicate:
   ArrowError(InvalidArgumentError("Can't compare arrays of different types"), 
Some(""))
   ```
   
   ### Root cause
   
   The error path through the code:
   
   1. The VALUES clause produces `Utf8` strings. Type coercion casts them to 
`Dictionary(Int32, Utf8)` to match the Parquet column type for the join key.
   2. The `HashJoinExec` builds the dynamic filter InList from the cast 
build-side arrays → `in_array` is `Dictionary(Int32, Utf8)`.
   3. 
[`instantiate_static_filter()`](https://github.com/apache/arrow-datafusion/blob/d09ff92def5ad5f033e75bbfda1fd9d33a4c93c9/datafusion/physical-expr/src/expressions/in_list.rs#L157-L159)
 receives the Dictionary array and falls through to 
`ArrayStaticFilter::try_new()`, which stores `in_array` as-is.
   4. 
[`InListExpr::try_new_from_array()`](https://github.com/apache/arrow-datafusion/blob/d09ff92def5ad5f033e75bbfda1fd9d33a4c93c9/datafusion/physical-expr/src/expressions/in_list.rs#L647)
 is called from 
[`create_membership_predicate()`](https://github.com/apache/arrow-datafusion/blob/d09ff92def5ad5f033e75bbfda1fd9d33a4c93c9/datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs#L124-L128)
 in the dynamic filter pushdown code.
   5. At runtime, `ArrayStaticFilter::contains(v)` is called with a Dictionary 
array from the Parquet batch.
   6. 
[`downcast_dictionary_array!`](https://github.com/apache/arrow-datafusion/blob/d09ff92def5ad5f033e75bbfda1fd9d33a4c93c9/datafusion/physical-expr/src/expressions/in_list.rs#L102-L108)
 matches and recursively calls `self.contains(v.values())` with the unwrapped 
Utf8 values.
   7. In the recursive call, `v` is now `Utf8` but `self.in_array` is still 
`Dictionary(Int32, Utf8)`.
   8. [`make_comparator(Utf8, 
Dictionary)`](https://github.com/apache/arrow-datafusion/blob/d09ff92def5ad5f033e75bbfda1fd9d33a4c93c9/datafusion/physical-expr/src/expressions/in_list.rs#L116)
 fails because `arrow_ord::ord::build_compare` has no arm for mixed 
Dictionary/non-Dictionary types.
   
   ### Possible fix
   
   The most straightforward fix would be to normalize `in_array` in 
[`ArrayStaticFilter::try_new()`](https://github.com/apache/arrow-datafusion/blob/d09ff92def5ad5f033e75bbfda1fd9d33a4c93c9/datafusion/physical-expr/src/expressions/in_list.rs#L187)
 — if the input is a Dictionary, unwrap it to its values array before storing. 
This is a one-time cost at filter construction, and ensures that after 
`contains()` unwraps the needle Dictionary, both sides are the same plain type 
and the existing arrow comparison kernels work correctly.
   
   ### Additional context
   
   - This only manifests when `pushdown_filters = true` (defaults to `false`), 
because the row-level filter evaluation inside the Parquet reader is what 
triggers `ArrayStaticFilter::contains()` with Dictionary arrays. With 
`pushdown_filters = false`, the InList filter is only used for statistics/bloom 
filter pruning (which works correctly), and the actual row filtering happens in 
the HashJoin.
   - The bug applies to single-column equi-joins where the VALUES side is cast 
to Dictionary by type coercion and the dynamic filter InList path is taken 
(small build side).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to