Re: [I] Avro schema parser uses type name instead of field name [AVRO] [arrow-rs]

via GitHub Wed, 26 Nov 2025 17:09:51 -0800


jecsand838 commented on issue #8928:
URL: https://github.com/apache/arrow-rs/issues/8928#issuecomment-3583779338


   @EmilyMatt I carefully reviewed this issue + the code over the past few 
hours. I realized the child metadata required for `f1_1` was missing in your 
reproduction description.
   
   Including this metadata would like this:
   
   ```rust
           let mut md = HashMap::new();
           md.insert(AVRO_NAMESPACE_METADATA_KEY.to_string(), 
"ns1".to_string());
           md.insert(AVRO_NAME_METADATA_KEY.to_string(), "main".to_string());
   
           // ======= Child Field Metadata
           let mut f1_md = HashMap::new();
           f1_md.insert(AVRO_NAMESPACE_METADATA_KEY.to_string(), 
"ns2".to_string());
           f1_md.insert(AVRO_NAME_METADATA_KEY.to_string(), 
"record2".to_string());
           // =======
   
           let arrow_reader_schema = ArrowSchema::new_with_metadata(
               vec![
                   ArrowField::new(
                       "f1",
                       DataType::Struct(Fields::from(vec![ArrowField::new(
                           "f1_1",
                           DataType::Utf8,
                           false,
                       )])),
                       false,
                   )
                   .with_metadata(f1_md), // inject here
               ],
               md, // root field metadata
           );
   ```
   
   I know this is burdensome and potentially heavy on the caller. The challenge 
here has been inter-oping Avro's type naming system with Arrow Schema. I 
couldn't find another way to do it which:
   1. Didn't involve using metadata. 
   2. Wouldn't cause explicit failures / extra complexity for callers just 
wanting to encode Arrow data--not originally from an Avro source--into Avro.
   
   The reason for the name generation logic was to make the implementation less 
rigid for callers going from pure Arrow (with the metadata missing) to Avro 
(i.e. point 2). My thinking was if there **wasn't** an original Avro schema, 
then generating one in this manner wouldn't necessarily be incorrect so long as 
it was in spec and repeatable with optional overrides (such as via the 
metadata).
   
   Additionally I included the Arrow Schema's `SCHEMA_METADATA_KEY` logic as a 
flexible alternative for callers with the complete Avro schema so they could 
just inject it into the Arrow `Schema` for downstream use 
   > Note, this is approach that came in handy with the [DataFusion 
work](https://github.com/apache/datafusion/pull/17861#pullrequestreview-3507689376).
 
   
   Basically I was attempting to balance flexibility, practicality, and 
correctness, which was somewhat challenging in this scenario. 
   
   I'm 100% open to and hoping for a better approach and already planned to 
refactor the Arrow `Schema` -> `AvroSchema` conversion logic in 
`arrow-avro/src/schema.rs` anyway due to it's brittleness. (Which is why I 
didn't catch the missing metadata sooner lol. If there's a bug in `arrow-avro`, 
I'm 98% confident it would pop-up in that schema conversion logic, so was 
already expecting it).
   
   Anyway when I run the reproducing test below in the 
`arrow-avro/src/schema.rs` file, with the child metadata inserted, it all seems 
to pass with the reader and writer schemas matching exactly.
   
   ```rust
       #[test]
       fn test_reader_schema_from_arrow_preserves_nested_field_name() {
           // Writer schema as described in issue
           let writer_json = r#"{
             "namespace": "ns1",
             "name": "main",
             "type": "record",
             "fields": [
               {
                 "name": "f1",
                 "type": {
                   "type": "record",
                   "namespace": "ns2",
                   "name": "record2",
                   "fields": [
                     {
                       "name": "f1_1",
                       "type": "string"
                     }
                   ]
                 }
               }
             ]
           }"#;
           // Parse the writer schema into the internal Avro representation
           let writer_schema: Schema =
               serde_json::from_str(writer_json).expect("writer schema JSON 
should parse");
           let mut md = HashMap::new();
           md.insert(AVRO_NAMESPACE_METADATA_KEY.to_string(), 
"ns1".to_string());
           md.insert(AVRO_NAME_METADATA_KEY.to_string(), "main".to_string());
           let mut f1_md = HashMap::new();
           f1_md.insert(AVRO_NAMESPACE_METADATA_KEY.to_string(), 
"ns2".to_string());
           f1_md.insert(AVRO_NAME_METADATA_KEY.to_string(), 
"record2".to_string());
           let arrow_reader_schema = ArrowSchema::new_with_metadata(
               vec![
                   ArrowField::new(
                       "f1",
                       DataType::Struct(Fields::from(vec![ArrowField::new(
                           "f1_1",
                           DataType::Utf8,
                           false,
                       )])),
                       false,
                   )
                   .with_metadata(f1_md),
               ],
               md,
           );
           // Build the Avro reader schema from the Arrow schema
           let reader_avro = AvroSchema::try_from(&arrow_reader_schema)
               .expect("Arrow -> Avro schema conversion should succeed");
           let reader_schema: Schema = reader_avro
               .schema()
               .expect("generated Avro schema JSON should parse");
           // Both writer and reader should be top-level records
           let writer_record = match &writer_schema {
               Schema::Complex(ComplexType::Record(r)) => r,
               other => panic!("expected writer to be a record, got: 
{other:?}"),
           };
           let reader_record = match &reader_schema {
               Schema::Complex(ComplexType::Record(r)) => r,
               other => panic!("expected reader to be a record, got: 
{other:?}"),
           };
           // Sanity: writer field name is "f1"
           assert_eq!(writer_record.fields.len(), 1);
           assert_eq!(writer_record.fields[0].name, "f1");
           assert_eq!(
               reader_record.fields[0].name, "f1",
               "Reader schema must preserve the Arrow field name `f1` and \
                    must not substitute the nested record's type name 
(`record2`) as the field name"
           );
           println!("{reader_schema:#?}\n\n{writer_schema:#?}");
           // Critical assert: If the bug is present, the reader schema and 
writer schema will not match
           assert_eq!(reader_schema, writer_schema);
           let _resolved = AvroFieldBuilder::new(&writer_schema)
               .with_reader_schema(&reader_schema)
               .with_utf8view(false)
               .with_strict_mode(false)
               .build()
               .expect(
                   "Writer and generated reader schemas should resolve without 
field name mismatches",
               );
       }
   ```
   
   I do think the `schema.rs` constant names + relevant docs being better / 
more descriptive would help here as well.
   
   Let me know if you think this resolves the issue. 
   
   CC: @alamb @nathaniel-d-ef 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Avro schema parser uses type name instead of field name [AVRO] [arrow-rs]

Reply via email to