[ 
https://issues.apache.org/jira/browse/HUDI-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-1157:
----------------------------
    Sprint: 2022/09/19

> Optimization whether to query Bootstrapped table using 
> HoodieBootstrapRelation vs Sparks Parquet datasource
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-1157
>                 URL: https://issues.apache.org/jira/browse/HUDI-1157
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: bootstrap
>            Reporter: Udit Mehrotra
>            Priority: Major
>
> This has been discussed in 
> [https://github.com/apache/hudi/pull/1702#discussion_r466317612]
> As of now, while querying using *DataSource* we are checking if the table has 
> been bootstrapped by the present of *bootstrap base path* in 
> *hoodie.properties* file, and based on that query the table using 
> *HoodieBootstrapRelation*  vs *Spark Parquet Data Source*. However, there 
> could be a scenario where all the files in the originally bootstrapped table 
> have wither been *upserted/deleted* and thus have been fully bootstrapped and 
> their data has been moved over to the target hoodie table. For such tables, 
> we can start querying them using *Spark Parquet Data Source* which will be 
> faster with all of spark's optimizations.
> So, basically we a need a way to check if all of the files have been fully 
> bootstrapped and moved over to the target location.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to