rroelke opened a new issue, #19290:
URL: https://github.com/apache/datafusion/issues/19290
### Describe the bug
I have a query predicate `left = right` where `left` is a column with type
`Utf8View` and `right` is a column with type `Dictionary(UInt8, LargeUtf8)`.
The SQL planner returns `Err` for this query because it cannot find a coercion
path for these two types. However, a coercion pathway exists: modifying the
query to use `left = arrow_cast(right, 'LargeUtf8')` allows it to succeed.
### To Reproduce
I am unable to attach a `.rs` file for some reason, so below is the contents
of my reproducer which can very simply be compiled with datafusion as the only
dependency.
```
#[cfg(test)]
mod tests {
use std::sync::Arc;
use datafusion::catalog::MemTable;
use datafusion::common::Result as DataFusionResult;
use datafusion::common::arrow::array::{
DictionaryArray, Int64Array, LargeStringArray, StringViewArray,
UInt8Array, UInt64Array,
};
use datafusion::common::arrow::datatypes::{DataType, Field, Schema};
use datafusion::common::arrow::record_batch::RecordBatch;
use datafusion::prelude::SessionContext;
const STATES: [&str; 50] = [
"AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI",
"ID", "IL", "IN", "IA",
"KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT",
"NE", "NV", "NH", "NJ",
"NM", "MY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD",
"TN", "TX", "UT", "VT",
"VA", "WA", "WV", "WI", "WY",
];
async fn create_table_with_text() -> DataFusionResult<MemTable> {
let states_schema = Arc::new(Schema::new(vec![
Field::new("rank", DataType::UInt64, false),
Field::new("state", DataType::Utf8View, false),
]));
let (ranks, states) = STATES
.iter()
.enumerate()
.collect::<(Vec<usize>, Vec<&str>)>();
let states_batch = RecordBatch::try_new(
Arc::clone(&states_schema),
vec![
Arc::new(ranks.iter().map(|r| *r as
u64).collect::<UInt64Array>()),
Arc::new(states.iter().map(Some).collect::<StringViewArray>()),
],
)?;
MemTable::try_new(states_schema, vec![vec![states_batch]])
}
async fn create_table_with_dictionary() -> DataFusionResult<MemTable> {
let gub_schema = Arc::new(Schema::new(vec![
Field::new("row_id", DataType::UInt64, false),
Field::new(
"state",
DataType::Dictionary(DataType::UInt8.into(),
DataType::LargeUtf8.into()),
true,
),
]));
let (ids, state_keys) = (1..=1000)
.map(|i| (i as u64, ((i + 10 * i) % STATES.len()) as u8))
.collect::<(Vec<u64>, Vec<u8>)>();
let state_values =
STATES.iter().map(Some).collect::<LargeStringArray>();
let batch = RecordBatch::try_new(
Arc::clone(&gub_schema),
vec![
Arc::new(UInt64Array::from(ids)),
Arc::new(DictionaryArray::new(
UInt8Array::from(state_keys),
Arc::new(state_values),
)),
],
)?;
MemTable::try_new(gub_schema, vec![vec![batch]])
}
/// Test evaluating equality of the `states.states` string column
against the `gub.states` dictionary column.
///
/// We run the same query twice. First we provide the coercion pathway
ourselves using the
/// `arrow_cast` function to demonstrate that there is a straightforward
coercion.
/// Then we evaluate the query without the explicit coercion to
demonstrate that the SQL
/// planner does not find this coercion.
#[tokio::test]
async fn join() -> DataFusionResult<()> {
let df: SessionContext = Default::default();
df.register_table("states",
Arc::new(create_table_with_text().await?))?;
df.register_table("gub",
Arc::new(create_table_with_dictionary().await?))?;
let expect_batch = vec![RecordBatch::try_new(
Arc::new(Schema::new(vec![Field::new(
"count(*)",
DataType::Int64,
false,
)])),
vec![Arc::new(Int64Array::from(vec![1000]))],
)?];
let query_cast = format!(
"SELECT count(*) FROM gub JOIN states ON states.rank =
((gub.row_id + 10 * gub.row_id) % {}) WHERE states.state =
arrow_cast(gub.state, 'Utf8')",
STATES.len()
);
let frame_cast = df.sql(&query_cast).await.expect("query_cast
error");
let batch_cast = frame_cast
.collect()
.await
.expect("query_cast evaluation error");
assert_eq!(batch_cast, expect_batch);
let query_nocast = format!(
"SELECT count(*) FROM gub JOIN states ON states.rank =
((gub.row_id + 10 * gub.row_id) % {}) WHERE states.state = gub.state",
STATES.len()
);
let frame_nocast = df.sql(&query_nocast).await.expect("query_nocast
error");
let batch_nocast = frame_nocast
.collect()
.await
.expect("query_nocast evaluation error");
assert_eq!(batch_nocast, expect_batch);
Ok(())
}
}
```
### Expected behavior
The SQL planner should find and deploy the coercion pathway on its own - the
author of the query should not have to deploy the `arrow_cast` function.
### Additional context
Arguably this is a feature request but I am filing it as a bug because I
have experienced it as such.
The reproducer above is adapted from a test I have written for my
application which creates the same data but using SQL rather than the
`MemTable`. That is, we have `CREATE TABLE states (rank INT, state TEXT)` and
then the other table uses a custom table provider which results in the
dictionary type.
In DataFusion 44, the SQL created the `state` column with Arrow data type
Utf8 or LargeUtf8.
In DataFusion 50, the SQL created the `state` column with Arrow data type
Utf8View.
This has broken my test because there previously was a coercion pathway for
comparing the `state` column against my Dictionary type column, but now there
is no such coercion pathway.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]