GitHub user vinothchandar added a comment to the discussion: RFC-80: Design proposal discussions
In summary: B may be achievable by the user already: - By the user splitting the columns into two tables, sharing a record key. - if Hudi lands the multi-table transactions https://github.com/apache/hudi/blob/master/rfc/rfc-73/rfc-73.md , to keep the tables in sync. - And a storage-partitioned join or some mechanism for target engines like Spark/Ray, to adapt the plans and scan the same record without shuffles. I think we should deeply understand the ML, AI pipeline lifecycle that will use this table/data, and the access-patterns. For e.g. may be it is in fact desirable to not cluster the unstructured data (to keep distribution randomized) or actually preferrable to cluster them (e.g. reorganize a table with html documents based on the url domain) GitHub link: https://github.com/apache/hudi/discussions/14062#discussioncomment-14631949 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
