[ https://issues.apache.org/jira/browse/HUDI-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexey Kudinkin updated HUDI-4178: ---------------------------------- Story Points: 4 (was: 1) Summary: Performance regressions in Spark DataSourceV2 Integration (was: HoodieSpark3Analysis does not pass schema from Spark Catalog) > Performance regressions in Spark DataSourceV2 Integration > --------------------------------------------------------- > > Key: HUDI-4178 > URL: https://issues.apache.org/jira/browse/HUDI-4178 > Project: Apache Hudi > Issue Type: Bug > Affects Versions: 0.11.0 > Reporter: Alexey Kudinkin > Assignee: Alexey Kudinkin > Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.1 > > > There are multiple issues with our current DataSource V2 integrations: > Because we advertise Hudi tables as V2, Spark expects it to implement certain > APIs which are not implemented at the moment, instead we're using custom > Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 > APIs. This poses following problems > # It doesn't fully implement Spark's protocol: for ex, this rule doesn't > cache produced `LogicalPlan` making Spark re-create Hudi relations from > scratch (including doing full table's file-listing) for every query reading > this table. However, adding the caching in that sequence is not an option, > since V2 APIs manage cache differently and therefore for us to be able to > leverage that cache we will have to manage all of its lifecycle (adding, > flushing) > # Additionally, HoodieSpark3Analysis rule does not pass table's schema from > the Spark Catalog to Hudi's relations making them fetch the schema from > storage (either from commit's metadata or data file) every time > -- This message was sent by Atlassian Jira (v8.20.7#820007)