alamb opened a new issue, #20696:
URL: https://github.com/apache/datafusion/issues/20696
### Describe the bug
A multi-column `INNER JOIN` with dictionary-encoded string keys fails at
runtime when scanning Parquet tables with
`datafusion.execution.parquet.pushdown_filters = true`.
The failure is:
`Parquet error: External: Compute error: Error evaluating filter predicate:
ArrowError(InvalidArgumentError("Can't compare arrays of different types"),
Some(""))`
This appears related to hash join dynamic filter (`InList`) pushdown over
dictionary columns.
### To Reproduce
Environment:
- `datafusion-cli 52.2.0`
Steps:
1. Save and run the attached SQL script with `datafusion-cli` from a clean
directory:
```bash
datafusion-cli -q -f repro_datafusion_cli_multi_column_dictionary_join.sql
```
2. Script contents:
```sql
-- Required to reproduce the failure path.
SET datafusion.execution.parquet.pushdown_filters = true;
CREATE TABLE h2o AS
SELECT
to_timestamp_nanos(time_ns) AS time,
arrow_cast(state, 'Dictionary(Int32, Utf8)') AS state,
arrow_cast(city, 'Dictionary(Int32, Utf8)') AS city,
temp
FROM (
VALUES
(200, 'CA', 'LA', 90.0),
(250, 'MA', 'Boston', 72.4),
(100, 'MA', 'Boston', 70.4),
(350, 'CA', 'LA', 90.0)
) AS t(time_ns, state, city, temp);
CREATE TABLE o2 AS
SELECT
to_timestamp_nanos(time_ns) AS time,
arrow_cast(state, 'Dictionary(Int32, Utf8)') AS state,
arrow_cast(city, 'Dictionary(Int32, Utf8)') AS city,
temp,
reading
FROM (
VALUES
(250, 'MA', 'Boston', 53.4, 51.0),
(100, 'MA', 'Boston', 50.4, 50.0)
) AS t(time_ns, state, city, temp, reading);
CREATE EXTERNAL TABLE h2o_parquet_tbl STORED AS PARQUET LOCATION
'h2o_parquet';
CREATE EXTERNAL TABLE o2_parquet_tbl STORED AS PARQUET LOCATION 'o2_parquet';
SELECT h2o_parquet_tbl.temp AS h2o_temp, o2_parquet_tbl.temp AS o2_temp,
o2_parquet_tbl.reading
FROM h2o_parquet_tbl
INNER JOIN o2_parquet_tbl ON h2o_parquet_tbl.time = o2_parquet_tbl.time
AND h2o_parquet_tbl.state = o2_parquet_tbl.state
AND h2o_parquet_tbl.city = o2_parquet_tbl.city
WHERE h2o_parquet_tbl.time >= '1970-01-01T00:00:00.000000050Z'
AND h2o_parquet_tbl.time <= '1970-01-01T00:00:00.000000300Z';
```
3. Observed output:
```text
+-------+
| count |
+-------+
| 4 |
+-------+
+-------+
| count |
+-------+
| 2 |
+-------+
Parquet error: External: Compute error: Error evaluating filter predicate:
ArrowError(InvalidArgumentError("Can't compare arrays of different types"),
Some(""))
```
### Expected behavior
The query should succeed and return (which it does if run directly from the
input 0or if `SET datafusion.execution.parquet.pushdown_filters = false` is
removed
```text
+----------+---------+---------+
| h2o_temp | o2_temp | reading |
+----------+---------+---------+
| 70.4 | 50.4 | 50.0 |
| 72.4 | 53.4 | 51.0 |
+----------+---------+---------+
```
### Additional context
- If `SET datafusion.execution.parquet.pushdown_filters = false`, the query
succeeds and returns the expected 2 rows.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]