alamb commented on issue #8840: URL: https://github.com/apache/arrow-rs/issues/8840#issuecomment-3536458954
So the topic (and pain) of schema merging comes up a bunch in DataFusion (and perhaps elsewhere) For example, here is the code in DataFusion that handles schema merging for Parquet https://github.com/apache/datafusion/blob/6ab4d216b768c9327982e59376a62a29c69ca436/datafusion/datasource-parquet/src/file_format.rs#L406-L421 We also have a similar challenge when comparing schemas (in some cases a field that is less nullable than another should be compatibility). I am not quite sure what merging logic belongs in what crate (e.g. I don't have a sense for if there is a broadly agreed upon definition of what schema merging means for schema outsides the context of the schema evolution context of DataFusion) Thus what I suggest is: 1. We start by moving schema merging logic into DataFusion, and iterate there until we get it right 2. the consider if the logic belongs upstream in arrow-rs or not where we can commit to an API Given all the various potential options, the API I would suggest is some sort of Merger structure. Something like ```rust let mut schema_merger = SchemaMerger::new() .with_preserve_nulls(true); // set various options builder style; // try to merge the schemas schema_merger.try_merge(schemas)?; // get the built schema let merged_schema = schema_merger.build()? ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
