mzabaluev opened a new issue, #9575: URL: https://github.com/apache/arrow-rs/issues/9575
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** In applications implementing evolution of data schemas abstracted above Avro, such as Iceberg, there is a need to resolve the Arrow schema that is required for the output record batches against the writer schema in the Avro files. One example is datafusion-comet, where Spark passes down a SQL schema to read data from Avro files, which is converted to an Arrow schema for the "native" reader written in Rust, which streams RecordBatch items with that schema. **Describe the solution you'd like** A utility function to resolve the reader schema from these arguments: 1. The Avro writer schema, which can be read from e.g. an OCF file using the `HeaderInfo` API added in #9548, 2. The required Arrow schema for the output batches. The resulting Avro schema can be used to construct an (async) Avro reader using the `build_with_header` method. The behavior of the recursive schema resolution, where it differs from Avro resolution rules: * Struct/record fields are matched by name. Reordering of fields is allowed and should be reported as a modification of the writer schema. * Fields in the Arrow schema not present in the writer schema are added to the reader schema with the conventional nullable union type and the default value of `null`. It is an error if an added field is not nullable. * As Arrow struct types do not have names, the resulting record type in the reader schema receives the name and namespace attributes of the corresponding record in the writer schema. * The name of an Arrow list item field is not attested in the corresponding Avro array schema. The record batch will have the "item" list field as [currently hardcoded](https://github.com/apache/arrow-rs/blob/88422cbdcbfa8f4e2411d66578dd3582fafbf2a1/arrow-avro/src/reader/record.rs#L462). A metadata approach for round-tripping can be further proposed. * Likewise, the names in the KV struct field making a map are not attested in Avro. The resolution function should provide a way to determine if the resulting schema differs from the writer schema, so that the exact match could be processed in a fast path with no schema resolution. **Describe alternatives you've considered** The required Arrow schema could be passed to a builder method while constructing the reader, directly becoming the schema for the produced record batches. This would create duality with the reader schema option, and it's not clear how the two options would interact if used together. **Additional context** The wider topic of schema evolution and other adaptations, not specific to Avro, is discussed in #6735. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
