[ https://issues.apache.org/jira/browse/HUDI-7316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ethan Guo updated HUDI-7316: ---------------------------- Fix Version/s: 0.15.0 > AbstractHoodieLogRecordReader should accept already-constructed > HoodieTableMetaClient in order to reduce occurrences of file listing calls > when reloading active timeline > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HUDI-7316 > URL: https://issues.apache.org/jira/browse/HUDI-7316 > Project: Apache Hudi > Issue Type: Improvement > Reporter: Krishen Bhan > Priority: Trivial > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > Currently some implementors of AbstractHoodieLogRecordReader create a > HoodieTableMetaClient on construction, which implicitly reloads active > timeline, causing a {{listStatus}} HDFS call. Since when using Spark engine > these are created in Spark executors, a Spark user may have hundreds to > thousands of executors that will make a {{listStatus}} call at the same time > (during a Spark stage). To avoid these redundant calls to the HDFS NameNode > (or any distributed filesystem service in general), users of > AbstractHoodieLogRecordReader and implementations should pass in > already-constructed HoodieTableMetaClient. > As long as the caller passed in a HoodieTableMetaClient with active timeline > already loaded, and the implementation doesn't need to re-load the timeline > (such as in order to get a more "fresh" timeline) then these calls will be > avoided in the executor, without causing the logic to be incorrect. -- This message was sent by Atlassian Jira (v8.20.10#820010)