Krishen Bhan created HUDI-7316: ---------------------------------- Summary: AbstractHoodieLogRecordReader should accept already-constructed HoodieTableMetaClient in order to reduce occurrences of file listing calls when reloading active timeline Key: HUDI-7316 URL: https://issues.apache.org/jira/browse/HUDI-7316 Project: Apache Hudi Issue Type: Improvement Reporter: Krishen Bhan
Currently some implementors of AbstractHoodieLogRecordReader create a HoodieTableMetaClient on construction, which implicitly reloads active timeline, causing a {{listStatus}} HDFS call. Since when using Spark engine these are created in Spark executors, a Spark user may have hundreds to thousands of executors that will make a {{listStatus}} call at the same time (during a Spark stage). To avoid these redundant calls to the HDFS NameNode (or any distributed filesystem service in general), users of AbstractHoodieLogRecordReader and implementations should pass in already-constructed HoodieTableMetaClient. As long as the caller passed in a HoodieTableMetaClient with active timeline already loaded, and the implementation doesn't need to re-load the timeline (such as in order to get a more "fresh" timeline) then these calls will be avoided in the executor, without causing the logic to be incorrect. -- This message was sent by Atlassian Jira (v8.20.10#820010)