kbendick commented on issue #5000: URL: https://github.com/apache/iceberg/issues/5000#issuecomment-1154123892
So far, I want to say that I like a lot about where this is going @wuwenchi. I think working on the UDF transforms would be a good first step (as those would have benefit regardless). > I think this has no effect on other engines, because watermarks and computed columns themselves do not actually store data, they just will add some logical processing when querying, but these logics only take effect on flink.What is actually stored is the data of the original physical columns, and the related format has not changed. My one concern here would be that users often use one engine, say Flink, for writing and then other engines later for processing. It might be the case that people want these computed columns reflected in the data (possibly even stored, though in the case of partition transforms and partitions in general, the partition fields value isn’t generally stored multiple times). It might be the case we cannot do certain things other than with Flink. Watermarks might be one of those. Though there could be steps we could take to make as much of this information available to downstream consumers as possible. For example, it has been discussed before to use the iceberg sequence ID as a form of a watermark (as it’s generally monotonically increasing). While other engines might not have native support for watermarks in them, at least having the data available would be beneficial. TLDR - Again, I think the UDFs you mentioned would be great to work on first, as those have value regardless of how we proceed next (for example, users might want to query with Iceberg’s bucket function as a way to more narrowly specify a subset of data to perform an action on). > Of course, maybe the above things can be implemented with Calcite, but since I am not particularly familiar with Calcite, the implementation may be more complicated, and it may also require the cooperation of flink, so we prefer to use this simpler way. I don’t have much knowledge of Calcite either, but in the more medium to long term, I think it would be good to reach out to the Flink community to possibly have some of these concepts more natively supported. My concern with using names directly to infer the function is that many users might have columns with those names (_years, etc) already in the data. But with the approach of the UDF, that issue goes away (as users can already choose the partition column name in Spark for example using `ALTER TABLE … ADD PARTITION FIELD bucket(16, id) AS shard`. Would you be interested in doing a POC or PR of just the transformation functions at first? Then at the community sync up we can possibly bring this up and form a working group to get input from others on this subject 🙂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org