dispanser opened a new issue, #20011:
URL: https://github.com/apache/datafusion/issues/20011

   ### Describe the bug
   
   We're currently upgrading from datafusion 45 to datafusion 46 :see-no-evil: 
and came across an issue related to dictionary ids during protobuf serde.
   Here's a small reproducer, and this seems to still be a problem on current 
datafusion:
   
   The following test works on df-45 but fails on df-46 and all later releases, 
added to the bottom of 
`datafusion/proto/tests/cases/roundtrip_physical_plan.rs`:
   ```rust
   #[test]
   fn roundtrip_call_null_scalar_struct_dict() {
       let data_type = DataType::Struct(Fields::from(vec![Field::new(
           "item",
           DataType::Dictionary(Box::new(DataType::UInt32), 
Box::new(DataType::Utf8)),
           true,
       )]));
       let schema = Arc::new(Schema::new(Fields::from([Arc::new(Field::new(
           "a",
           data_type.clone(),
           true,
       ))])));
       let scan = Arc::new(EmptyExec::new(schema.clone()));
       let scalar = lit(ScalarValue::try_from(data_type).unwrap());
       let filter = Arc::new(
           FilterExec::try_new(
               Arc::new(BinaryExpr::new(
                   scalar,
                   datafusion::logical_expr::Operator::Eq,
                   col("a", &schema).unwrap(),
               )),
               scan,
           )
           .unwrap(),
       );
       roundtrip_test(filter).expect("roundtrip");
   }
   ```
   
   In df46, this fails during deserialization of protobuf back to the physical 
plan:
   ```
   thread 
'cases::roundtrip_physical_plan::roundtrip_call_null_scalar_struct_dict' 
panicked at datafusion/proto/tests/cases/roundtrip_physical_plan.rs:140:10:
   from proto: Plan("DataFusion error: ArrowError(SchemaError(\"Invalid data 
for schema. Field { name: \\\"item\\\", data_type: Dictionary(UInt32, Utf8), 
nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } refers to 
node not found in schema\"), None)")
   ```
   
   In df52, it already fails during serialization, as there's no pre-allocated 
dict id available from the dictionary tracker
   ```
   thread 
'cases::roundtrip_physical_plan::roundtrip_call_null_scalar_struct_dict' 
(145817) panicked at 
datafusion/proto/tests/cases/roundtrip_physical_plan.rs:148:14:
   to proto: Plan("General error: Error encoding ScalarValue::List as IPC: Ipc 
error: no dict id for field item")
   ```
   
   I am a bit puzzled on how that is supposed to work. From what I can tell, 
the `DictionaryTracker` seems to be initialized empty and then never modified - 
I don't see a possible code path where a dict_id is even created.
   
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to