[GitHub] [iceberg] kbendick commented on issue #5000: Proposal: FlinkSQL supports partition transform by computed columns

GitBox Mon, 13 Jun 2022 09:21:23 -0700


kbendick commented on issue #5000:
URL: https://github.com/apache/iceberg/issues/5000#issuecomment-1154123892


   So far, I want to say that I like a lot about where this is going @wuwenchi.
   
   I think working on the UDF transforms would be a good first step (as those 
would have benefit regardless).
   
   > I think this has no effect on other engines, because watermarks and 
computed columns themselves do not actually store data, they just will add some 
logical processing when querying, but these logics only take effect on 
flink.What is actually stored is the data of the original physical columns, and 
the related format has not changed.
   
   My one concern here would be that users often use one engine, say Flink, for 
writing and then other engines later for processing. It might be the case that 
people want these computed columns reflected in the data (possibly even stored, 
though in the case of partition transforms and partitions in general, the 
partition fields value isn’t generally stored multiple times).
   
   It might be the case we cannot do certain things other than with Flink. 
Watermarks might be one of those. Though there could be steps we could take to 
make as much of this information available to downstream consumers as possible.
   
   For example, it has been discussed before to use the iceberg sequence ID as 
a form of a watermark (as it’s generally monotonically increasing). While other 
engines might not have native support for watermarks in them, at least having 
the data available would be beneficial.
   
   TLDR - Again, I think the UDFs you mentioned would be great to work on 
first, as those have value regardless of how we proceed next (for example, 
users might want to query with Iceberg’s bucket function as a way to more 
narrowly specify a subset of data to perform an action on).
   
   > Of course, maybe the above things can be implemented with Calcite, but 
since I am not particularly familiar with Calcite, the implementation may be 
more complicated, and it may also require the cooperation of flink, so we 
prefer to use this simpler way.
   
   I don’t have much knowledge of Calcite either, but in the more medium to 
long term, I think it would be good to reach out to the Flink community to 
possibly have some of these concepts more natively supported.
   
   My concern with using names directly to infer the function is that many 
users might have columns with those names (_years, etc) already in the data.
   
   But with the approach of the UDF, that issue goes away (as users can already 
choose the partition column name in Spark for example using `ALTER TABLE … ADD 
PARTITION FIELD bucket(16, id) AS shard`.
   
   Would you be interested in doing a POC or PR of just the transformation 
functions at first? Then at the community sync up we can possibly bring this up 
and form a working group to get input from others on this subject 🙂 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #5000: Proposal: FlinkSQL supports partition transform by computed columns

Reply via email to