alamb opened a new issue, #20696:
URL: https://github.com/apache/datafusion/issues/20696

   ### Describe the bug
   
   A multi-column `INNER JOIN` with dictionary-encoded string keys fails at 
runtime when scanning Parquet tables with 
`datafusion.execution.parquet.pushdown_filters = true`.
   
   The failure is:
   
   `Parquet error: External: Compute error: Error evaluating filter predicate: 
ArrowError(InvalidArgumentError("Can't compare arrays of different types"), 
Some(""))`
   
   This appears related to hash join dynamic filter (`InList`) pushdown over 
dictionary columns.
   
   ### To Reproduce
   
   Environment:
   - `datafusion-cli 52.2.0`
   
   Steps:
   1. Save and run the attached SQL script with `datafusion-cli` from a clean 
directory:
   
   ```bash
   datafusion-cli -q -f repro_datafusion_cli_multi_column_dictionary_join.sql
   ```
   
   2. Script contents:
   
   ```sql
   -- Required to reproduce the failure path.
   SET datafusion.execution.parquet.pushdown_filters = true;
   
   CREATE TABLE h2o AS
   SELECT
     to_timestamp_nanos(time_ns) AS time,
     arrow_cast(state, 'Dictionary(Int32, Utf8)') AS state,
     arrow_cast(city, 'Dictionary(Int32, Utf8)') AS city,
     temp
   FROM (
     VALUES
       (200, 'CA', 'LA', 90.0),
       (250, 'MA', 'Boston', 72.4),
       (100, 'MA', 'Boston', 70.4),
       (350, 'CA', 'LA', 90.0)
   ) AS t(time_ns, state, city, temp);
   
   CREATE TABLE o2 AS
   SELECT
     to_timestamp_nanos(time_ns) AS time,
     arrow_cast(state, 'Dictionary(Int32, Utf8)') AS state,
     arrow_cast(city, 'Dictionary(Int32, Utf8)') AS city,
     temp,
     reading
   FROM (
     VALUES
       (250, 'MA', 'Boston', 53.4, 51.0),
       (100, 'MA', 'Boston', 50.4, 50.0)
   ) AS t(time_ns, state, city, temp, reading);
   
   CREATE EXTERNAL TABLE h2o_parquet_tbl STORED AS PARQUET LOCATION 
'h2o_parquet';
   CREATE EXTERNAL TABLE o2_parquet_tbl STORED AS PARQUET LOCATION 'o2_parquet';
   
   SELECT h2o_parquet_tbl.temp AS h2o_temp, o2_parquet_tbl.temp AS o2_temp, 
o2_parquet_tbl.reading
   FROM h2o_parquet_tbl
   INNER JOIN o2_parquet_tbl ON h2o_parquet_tbl.time = o2_parquet_tbl.time
       AND h2o_parquet_tbl.state = o2_parquet_tbl.state
       AND h2o_parquet_tbl.city = o2_parquet_tbl.city
   WHERE h2o_parquet_tbl.time >= '1970-01-01T00:00:00.000000050Z'
       AND h2o_parquet_tbl.time <= '1970-01-01T00:00:00.000000300Z';
   ```
   
   3. Observed output:
   
   ```text
   +-------+
   | count |
   +-------+
   | 4     |
   +-------+
   +-------+
   | count |
   +-------+
   | 2     |
   +-------+
   Parquet error: External: Compute error: Error evaluating filter predicate: 
ArrowError(InvalidArgumentError("Can't compare arrays of different types"), 
Some(""))
   ```
   
   
   ### Expected behavior
   
   The query should succeed and return (which it does if run directly from the 
input 0or if `SET datafusion.execution.parquet.pushdown_filters = false` is 
removed
   
   ```text
   +----------+---------+---------+
   | h2o_temp | o2_temp | reading |
   +----------+---------+---------+
   | 70.4     | 50.4    | 50.0    |
   | 72.4     | 53.4    | 51.0    |
   +----------+---------+---------+
   ```
   
   ### Additional context
   
   - If `SET datafusion.execution.parquet.pushdown_filters = false`, the query 
succeeds and returns the expected 2 rows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to