GitHub user vinothchandar added a comment to the discussion: RFC-80: Design 
proposal discussions

In summary: B may be achievable by the user already:  

- By the user splitting the columns into two tables, sharing a record key. 
- if Hudi lands the multi-table transactions 
https://github.com/apache/hudi/blob/master/rfc/rfc-73/rfc-73.md , to keep the 
tables in sync. 
- And a storage-partitioned join or some mechanism for target engines like 
Spark/Ray, to adapt the plans and scan the same record without shuffles. 

I think we should deeply understand the ML, AI pipeline lifecycle that will use 
this table/data, and the access-patterns. For e.g. may be it is in fact 
desirable to not cluster the unstructured data (to keep distribution 
randomized) or actually preferrable to cluster them (e.g. reorganize a table 
with html documents based on the url domain)

GitHub link: 
https://github.com/apache/hudi/discussions/14062#discussioncomment-14631949

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to