rdblue commented on pull request #1046: URL: https://github.com/apache/iceberg/pull/1046#issuecomment-635442419
@teabot, I think the main problem with unions is the support in processing engines. It is unlikely that Spark, Presto, Impala, etc. will add support for unions. Since Iceberg is a format that we want to be suitable for a common at-rest store, it doesn't make sense to have a type that requires work-around in those engines but has only a small benefit (ensuring only one option is non-null). I think it is unlikely that processing engines will support unions because it isn't clear how users would interact with them in SQL. For example, how do I filter to just records with a particular option of the union? That might seem easy, but it exposes underlying problems with unions and schema evolution, like [identifying union fields](https://github.com/apache/iceberg/pull/1046#pullrequestreview-416446251). If we generate names based on position, what happens when that position changes? If we do it based on ID, then we're exposing internal IDs to users. Also, what if a file is written with a version of the schema that has a new union option that isn't in the table schema? Do we choose another incorrect branch (null or default) or do we throw an exception? I think it is the right choice to continue using the more standard and well-defined types rather than adding union, since it would make it much harder to integrate Iceberg into processing engines. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
