ahmedabu98 commented on issue #38055:
URL: https://github.com/apache/beam/issues/38055#issuecomment-4284684673

   There are a few moving pieces to building out this feature. The IcebergIO 
sink, unlike BigQueryIO, does not work on JSON-like objects. It works on Beam 
Rows, which are fixed Schema objects. This means all the Rows in the input 
PCollection need to have the same fixed schema.
   
   This is an obstacle if we're trying to express different schemas in the same 
PCollection.
   To get around this, we will likely need to introduce a new semi-structured 
type in Beam Row that can contain variable columns depending on the 
destination. I'm thinking the Variant type is a good candidate for this (it's 
implemented in 
[Parquet](https://parquet.apache.org/docs/file-format/types/variantencoding/), 
[Spark](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.VariantType.html),
 and other places)
   
   I'd consider that a pre-requisite to this feature request (created #38251 to 
track it). When we have a semi-structured type, we can allow sources to express 
varying fields, and allow sinks to parse those fields dynamically and 
re-construct them into different schemas.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to