[jira] [Updated] (HUDI-4178) HoodieSpark3Analysis does not pass schema from Spark Catalog

Alexey Kudinkin (Jira) Fri, 03 Jun 2022 15:10:04 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexey Kudinkin updated HUDI-4178:
----------------------------------
    Description: 
There are multiple issues with our current DataSource V2 integrations:

Because we advertise Hudi tables as V2, Spark expects it to implement certain 
APIs which are not implemented at the moment, instead we're using custom 
Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 
APIs. This poses following problems
 # It doesn't fully implement Spark's protocol: for ex, this rule doesn't cache 
produced `LogicalPlan` making Spark re-create Hudi relations from scratch 
(including doing full table's file-listing) for every query reading this table. 
However, adding the caching in that sequence is not an option, since V2 APIs 
manage cache differently and therefore for us to be able to leverage that cache 
we will have to manage all of its lifecycle (adding, flushing)
 # Additionally, HoodieSpark3Analysis rule does not pass table's schema from 
the Spark Catalog to Hudi's relations making them fetch the schema from storage 
(either from commit's metadata or data file) every time

 

  was:
Currently, HoodieSpark3Analysis rule does not pass table's schema from the 
Spark Catalog to Hudi's relations making them fetch the schema from storage 
(either from commit's metadata or data file) every time.

 


> HoodieSpark3Analysis does not pass schema from Spark Catalog
> ------------------------------------------------------------
>
>                 Key: HUDI-4178
>                 URL: https://issues.apache.org/jira/browse/HUDI-4178
>             Project: Apache Hudi
>          Issue Type: Bug
>    Affects Versions: 0.11.0
>            Reporter: Alexey Kudinkin
>            Assignee: Alexey Kudinkin
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 0.11.1
>
>
> There are multiple issues with our current DataSource V2 integrations:
> Because we advertise Hudi tables as V2, Spark expects it to implement certain 
> APIs which are not implemented at the moment, instead we're using custom 
> Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 
> APIs. This poses following problems
>  # It doesn't fully implement Spark's protocol: for ex, this rule doesn't 
> cache produced `LogicalPlan` making Spark re-create Hudi relations from 
> scratch (including doing full table's file-listing) for every query reading 
> this table. However, adding the caching in that sequence is not an option, 
> since V2 APIs manage cache differently and therefore for us to be able to 
> leverage that cache we will have to manage all of its lifecycle (adding, 
> flushing)
>  # Additionally, HoodieSpark3Analysis rule does not pass table's schema from 
> the Spark Catalog to Hudi's relations making them fetch the schema from 
> storage (either from commit's metadata or data file) every time
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4178) HoodieSpark3Analysis does not pass schema from Spark Catalog

Reply via email to