jecsand838 commented on issue #8928:
URL: https://github.com/apache/arrow-rs/issues/8928#issuecomment-3583779338
@EmilyMatt I carefully reviewed this issue + the code over the past few
hours. I realized the child metadata required for `f1_1` was missing in your
reproduction description.
Including this metadata would like this:
```rust
let mut md = HashMap::new();
md.insert(AVRO_NAMESPACE_METADATA_KEY.to_string(),
"ns1".to_string());
md.insert(AVRO_NAME_METADATA_KEY.to_string(), "main".to_string());
// ======= Child Field Metadata
let mut f1_md = HashMap::new();
f1_md.insert(AVRO_NAMESPACE_METADATA_KEY.to_string(),
"ns2".to_string());
f1_md.insert(AVRO_NAME_METADATA_KEY.to_string(),
"record2".to_string());
// =======
let arrow_reader_schema = ArrowSchema::new_with_metadata(
vec![
ArrowField::new(
"f1",
DataType::Struct(Fields::from(vec![ArrowField::new(
"f1_1",
DataType::Utf8,
false,
)])),
false,
)
.with_metadata(f1_md), // inject here
],
md, // root field metadata
);
```
I know this is burdensome and potentially heavy on the caller. The challenge
here has been inter-oping Avro's type naming system with Arrow Schema. I
couldn't find another way to do it which:
1. Didn't involve using metadata.
2. Wouldn't cause explicit failures / extra complexity for callers just
wanting to encode Arrow data--not originally from an Avro source--into Avro.
The reason for the name generation logic was to make the implementation less
rigid for callers going from pure Arrow (with the metadata missing) to Avro
(i.e. point 2). My thinking was if there **wasn't** an original Avro schema,
then generating one in this manner wouldn't necessarily be incorrect so long as
it was in spec and repeatable with optional overrides (such as via the
metadata).
Additionally I included the Arrow Schema's `SCHEMA_METADATA_KEY` logic as a
flexible alternative for callers with the complete Avro schema so they could
just inject it into the Arrow `Schema` for downstream use
> Note, this is approach that came in handy with the [DataFusion
work](https://github.com/apache/datafusion/pull/17861#pullrequestreview-3507689376).
Basically I was attempting to balance flexibility, practicality, and
correctness, which was somewhat challenging in this scenario.
I'm 100% open to and hoping for a better approach and already planned to
refactor the Arrow `Schema` -> `AvroSchema` conversion logic in
`arrow-avro/src/schema.rs` anyway due to it's brittleness. (Which is why I
didn't catch the missing metadata sooner lol. If there's a bug in `arrow-avro`,
I'm 98% confident it would pop-up in that schema conversion logic, so was
already expecting it).
Anyway when I run the reproducing test below in the
`arrow-avro/src/schema.rs` file, with the child metadata inserted, it all seems
to pass with the reader and writer schemas matching exactly.
```rust
#[test]
fn test_reader_schema_from_arrow_preserves_nested_field_name() {
// Writer schema as described in issue
let writer_json = r#"{
"namespace": "ns1",
"name": "main",
"type": "record",
"fields": [
{
"name": "f1",
"type": {
"type": "record",
"namespace": "ns2",
"name": "record2",
"fields": [
{
"name": "f1_1",
"type": "string"
}
]
}
}
]
}"#;
// Parse the writer schema into the internal Avro representation
let writer_schema: Schema =
serde_json::from_str(writer_json).expect("writer schema JSON
should parse");
let mut md = HashMap::new();
md.insert(AVRO_NAMESPACE_METADATA_KEY.to_string(),
"ns1".to_string());
md.insert(AVRO_NAME_METADATA_KEY.to_string(), "main".to_string());
let mut f1_md = HashMap::new();
f1_md.insert(AVRO_NAMESPACE_METADATA_KEY.to_string(),
"ns2".to_string());
f1_md.insert(AVRO_NAME_METADATA_KEY.to_string(),
"record2".to_string());
let arrow_reader_schema = ArrowSchema::new_with_metadata(
vec![
ArrowField::new(
"f1",
DataType::Struct(Fields::from(vec![ArrowField::new(
"f1_1",
DataType::Utf8,
false,
)])),
false,
)
.with_metadata(f1_md),
],
md,
);
// Build the Avro reader schema from the Arrow schema
let reader_avro = AvroSchema::try_from(&arrow_reader_schema)
.expect("Arrow -> Avro schema conversion should succeed");
let reader_schema: Schema = reader_avro
.schema()
.expect("generated Avro schema JSON should parse");
// Both writer and reader should be top-level records
let writer_record = match &writer_schema {
Schema::Complex(ComplexType::Record(r)) => r,
other => panic!("expected writer to be a record, got:
{other:?}"),
};
let reader_record = match &reader_schema {
Schema::Complex(ComplexType::Record(r)) => r,
other => panic!("expected reader to be a record, got:
{other:?}"),
};
// Sanity: writer field name is "f1"
assert_eq!(writer_record.fields.len(), 1);
assert_eq!(writer_record.fields[0].name, "f1");
assert_eq!(
reader_record.fields[0].name, "f1",
"Reader schema must preserve the Arrow field name `f1` and \
must not substitute the nested record's type name
(`record2`) as the field name"
);
println!("{reader_schema:#?}\n\n{writer_schema:#?}");
// Critical assert: If the bug is present, the reader schema and
writer schema will not match
assert_eq!(reader_schema, writer_schema);
let _resolved = AvroFieldBuilder::new(&writer_schema)
.with_reader_schema(&reader_schema)
.with_utf8view(false)
.with_strict_mode(false)
.build()
.expect(
"Writer and generated reader schemas should resolve without
field name mismatches",
);
}
```
I do think the `schema.rs` constant names + relevant docs being better /
more descriptive would help here as well.
Let me know if you think this resolves the issue.
CC: @alamb @nathaniel-d-ef
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]