[ 
https://issues.apache.org/jira/browse/HUDI-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4178:
----------------------------------
    Story Points: 4  (was: 1)
         Summary: Performance regressions in Spark DataSourceV2 Integration  
(was: HoodieSpark3Analysis does not pass schema from Spark Catalog)

> Performance regressions in Spark DataSourceV2 Integration
> ---------------------------------------------------------
>
>                 Key: HUDI-4178
>                 URL: https://issues.apache.org/jira/browse/HUDI-4178
>             Project: Apache Hudi
>          Issue Type: Bug
>    Affects Versions: 0.11.0
>            Reporter: Alexey Kudinkin
>            Assignee: Alexey Kudinkin
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 0.11.1
>
>
> There are multiple issues with our current DataSource V2 integrations:
> Because we advertise Hudi tables as V2, Spark expects it to implement certain 
> APIs which are not implemented at the moment, instead we're using custom 
> Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 
> APIs. This poses following problems
>  # It doesn't fully implement Spark's protocol: for ex, this rule doesn't 
> cache produced `LogicalPlan` making Spark re-create Hudi relations from 
> scratch (including doing full table's file-listing) for every query reading 
> this table. However, adding the caching in that sequence is not an option, 
> since V2 APIs manage cache differently and therefore for us to be able to 
> leverage that cache we will have to manage all of its lifecycle (adding, 
> flushing)
>  # Additionally, HoodieSpark3Analysis rule does not pass table's schema from 
> the Spark Catalog to Hudi's relations making them fetch the schema from 
> storage (either from commit's metadata or data file) every time
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to