[ https://issues.apache.org/jira/browse/HUDI-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexey Kudinkin updated HUDI-4178: ---------------------------------- Description: There are multiple issues with our current DataSource V2 integrations: Because we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs. This poses following problems # It doesn't fully implement Spark's protocol: for ex, this rule doesn't cache produced `LogicalPlan` making Spark re-create Hudi relations from scratch (including doing full table's file-listing) for every query reading this table. However, adding the caching in that sequence is not an option, since V2 APIs manage cache differently and therefore for us to be able to leverage that cache we will have to manage all of its lifecycle (adding, flushing) # Additionally, HoodieSpark3Analysis rule does not pass table's schema from the Spark Catalog to Hudi's relations making them fetch the schema from storage (either from commit's metadata or data file) every time was: Currently, HoodieSpark3Analysis rule does not pass table's schema from the Spark Catalog to Hudi's relations making them fetch the schema from storage (either from commit's metadata or data file) every time. > HoodieSpark3Analysis does not pass schema from Spark Catalog > ------------------------------------------------------------ > > Key: HUDI-4178 > URL: https://issues.apache.org/jira/browse/HUDI-4178 > Project: Apache Hudi > Issue Type: Bug > Affects Versions: 0.11.0 > Reporter: Alexey Kudinkin > Assignee: Alexey Kudinkin > Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.1 > > > There are multiple issues with our current DataSource V2 integrations: > Because we advertise Hudi tables as V2, Spark expects it to implement certain > APIs which are not implemented at the moment, instead we're using custom > Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 > APIs. This poses following problems > # It doesn't fully implement Spark's protocol: for ex, this rule doesn't > cache produced `LogicalPlan` making Spark re-create Hudi relations from > scratch (including doing full table's file-listing) for every query reading > this table. However, adding the caching in that sequence is not an option, > since V2 APIs manage cache differently and therefore for us to be able to > leverage that cache we will have to manage all of its lifecycle (adding, > flushing) > # Additionally, HoodieSpark3Analysis rule does not pass table's schema from > the Spark Catalog to Hudi's relations making them fetch the schema from > storage (either from commit's metadata or data file) every time > -- This message was sent by Atlassian Jira (v8.20.7#820007)