[ 
https://issues.apache.org/jira/browse/HUDI-7316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-7316:
----------------------------
    Fix Version/s: 0.15.0

> AbstractHoodieLogRecordReader should accept already-constructed 
> HoodieTableMetaClient in order to reduce occurrences of file listing calls 
> when reloading active timeline
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-7316
>                 URL: https://issues.apache.org/jira/browse/HUDI-7316
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Krishen Bhan
>            Priority: Trivial
>              Labels: pull-request-available
>             Fix For: 0.15.0, 1.0.0
>
>
> Currently some implementors of AbstractHoodieLogRecordReader create a 
> HoodieTableMetaClient on construction, which implicitly reloads active 
> timeline, causing a {{listStatus}} HDFS call. Since when using Spark engine 
> these are created in Spark executors, a Spark user may have hundreds to 
> thousands of executors that will make a {{listStatus}} call at the same time 
> (during a Spark stage). To avoid these redundant calls to the HDFS NameNode 
> (or any distributed filesystem service in general), users of 
> AbstractHoodieLogRecordReader and implementations should pass in 
> already-constructed HoodieTableMetaClient.
> As long as the caller passed in a HoodieTableMetaClient with active timeline 
> already loaded, and the implementation doesn't need to re-load the timeline 
> (such as in order to get a more "fresh" timeline) then these calls will be 
> avoided in the executor, without causing the logic to be incorrect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to