[I] Coercion from `Dictionary(UInt8, LargeUtf8)` to `Utf8View` [datafusion]

via GitHub Thu, 11 Dec 2025 08:27:25 -0800


rroelke opened a new issue, #19290:
URL: https://github.com/apache/datafusion/issues/19290


   ### Describe the bug
   
   I have a query predicate `left = right` where `left` is a column with type 
`Utf8View` and `right` is a column with type `Dictionary(UInt8, LargeUtf8)`.  
The SQL planner returns `Err` for this query because it cannot find a coercion 
path for these two types. However, a coercion pathway exists: modifying the 
query to use `left = arrow_cast(right, 'LargeUtf8')` allows it to succeed.
   
   ### To Reproduce
   
   I am unable to attach a `.rs` file for some reason, so below is the contents 
of my reproducer which can very simply be compiled with datafusion as the only 
dependency.
   
   ```
   #[cfg(test)]
   mod tests {
       use std::sync::Arc;
   
       use datafusion::catalog::MemTable;
       use datafusion::common::Result as DataFusionResult;
       use datafusion::common::arrow::array::{
           DictionaryArray, Int64Array, LargeStringArray, StringViewArray, 
UInt8Array, UInt64Array,
       };
       use datafusion::common::arrow::datatypes::{DataType, Field, Schema};
       use datafusion::common::arrow::record_batch::RecordBatch;
       use datafusion::prelude::SessionContext;
   
       const STATES: [&str; 50] = [
           "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", 
"ID", "IL", "IN", "IA",
           "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", 
"NE", "NV", "NH", "NJ",
           "NM", "MY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", 
"TN", "TX", "UT", "VT",
           "VA", "WA", "WV", "WI", "WY",
       ];
   
       async fn create_table_with_text() -> DataFusionResult<MemTable> {
           let states_schema = Arc::new(Schema::new(vec![
               Field::new("rank", DataType::UInt64, false),
               Field::new("state", DataType::Utf8View, false),
           ]));
   
           let (ranks, states) = STATES
               .iter()
               .enumerate()
               .collect::<(Vec<usize>, Vec<&str>)>();
   
           let states_batch = RecordBatch::try_new(
               Arc::clone(&states_schema),
               vec![
                   Arc::new(ranks.iter().map(|r| *r as 
u64).collect::<UInt64Array>()),
                   
Arc::new(states.iter().map(Some).collect::<StringViewArray>()),
               ],
           )?;
   
           MemTable::try_new(states_schema, vec![vec![states_batch]])
       }
   
       async fn create_table_with_dictionary() -> DataFusionResult<MemTable> {
           let gub_schema = Arc::new(Schema::new(vec![
               Field::new("row_id", DataType::UInt64, false),
               Field::new(
                   "state",
                   DataType::Dictionary(DataType::UInt8.into(), 
DataType::LargeUtf8.into()),
                   true,
               ),
           ]));
   
           let (ids, state_keys) = (1..=1000)
               .map(|i| (i as u64, ((i + 10 * i) % STATES.len()) as u8))
               .collect::<(Vec<u64>, Vec<u8>)>();
           let state_values = 
STATES.iter().map(Some).collect::<LargeStringArray>();
   
           let batch = RecordBatch::try_new(
               Arc::clone(&gub_schema),
               vec![
                   Arc::new(UInt64Array::from(ids)),
                   Arc::new(DictionaryArray::new(
                       UInt8Array::from(state_keys),
                       Arc::new(state_values),
                   )),
               ],
           )?;
   
           MemTable::try_new(gub_schema, vec![vec![batch]])
       }   
           
       /// Test evaluating equality of the `states.states` string column 
against the `gub.states` dictionary column.
       ///
       /// We run the same query twice. First we provide the coercion pathway 
ourselves using the
       /// `arrow_cast` function to demonstrate that there is a straightforward 
coercion.
       /// Then we evaluate the query without the explicit coercion to 
demonstrate that the SQL
       /// planner does not find this coercion.
       #[tokio::test]
       async fn join() -> DataFusionResult<()> {
           let df: SessionContext = Default::default();
           df.register_table("states", 
Arc::new(create_table_with_text().await?))?;
           df.register_table("gub", 
Arc::new(create_table_with_dictionary().await?))?;
   
           let expect_batch = vec![RecordBatch::try_new(
               Arc::new(Schema::new(vec![Field::new(
                   "count(*)",
                   DataType::Int64,
                   false,
               )])),
               vec![Arc::new(Int64Array::from(vec![1000]))],
           )?];
   
           let query_cast = format!(
               "SELECT count(*) FROM gub JOIN states ON states.rank = 
((gub.row_id + 10 * gub.row_id) % {}) WHERE states.state = 
arrow_cast(gub.state, 'Utf8')",
               STATES.len()
           );
           let frame_cast = df.sql(&query_cast).await.expect("query_cast 
error");
           let batch_cast = frame_cast
               .collect()
               .await
               .expect("query_cast evaluation error");
           assert_eq!(batch_cast, expect_batch);
   
           let query_nocast = format!(
               "SELECT count(*) FROM gub JOIN states ON states.rank = 
((gub.row_id + 10 * gub.row_id) % {}) WHERE states.state = gub.state",
               STATES.len()
           );
           let frame_nocast = df.sql(&query_nocast).await.expect("query_nocast 
error");
           let batch_nocast = frame_nocast
               .collect()
               .await
               .expect("query_nocast evaluation error");
           assert_eq!(batch_nocast, expect_batch);
   
           Ok(())
       }
   }
   ```
   
   ### Expected behavior
   
   The SQL planner should find and deploy the coercion pathway on its own - the 
author of the query should not have to deploy the `arrow_cast` function.
   
   ### Additional context
   
   Arguably this is a feature request but I am filing it as a bug because I 
have experienced it as such.
   
   The reproducer above is adapted from a test I have written for my 
application which creates the same data but using SQL rather than the 
`MemTable`.  That is, we have `CREATE TABLE states (rank INT, state TEXT)` and 
then the other table uses a custom table provider which results in the 
dictionary type.
   
   In DataFusion 44, the SQL created the `state` column with Arrow data type 
Utf8 or LargeUtf8.
   In DataFusion 50, the SQL created the `state` column with Arrow data type 
Utf8View.
   
   This has broken my test because there previously was a coercion pathway for 
comparing the `state` column against my Dictionary type column, but now there 
is no such coercion pathway.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Coercion from `Dictionary(UInt8, LargeUtf8)` to `Utf8View` [datafusion]

Reply via email to