ahmedabu98 commented on issue #38055: URL: https://github.com/apache/beam/issues/38055#issuecomment-4284684673
There are a few moving pieces to building out this feature. The IcebergIO sink, unlike BigQueryIO, does not work on JSON-like objects. It works on Beam Rows, which are fixed Schema objects. This means all the Rows in the input PCollection need to have the same fixed schema. This is an obstacle if we're trying to express different schemas in the same PCollection. To get around this, we will likely need to introduce a new semi-structured type in Beam Row that can contain variable columns depending on the destination. I'm thinking the Variant type is a good candidate for this (it's implemented in [Parquet](https://parquet.apache.org/docs/file-format/types/variantencoding/), [Spark](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.VariantType.html), and other places) I'd consider that a pre-requisite to this feature request (created #38251 to track it). When we have a semi-structured type, we can allow sources to express varying fields, and allow sinks to parse those fields dynamically and re-construct them into different schemas. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
