Krishen Bhan created HUDI-7316:
----------------------------------

             Summary: AbstractHoodieLogRecordReader should accept 
already-constructed HoodieTableMetaClient in order to reduce occurrences of 
file listing calls when reloading active timeline
                 Key: HUDI-7316
                 URL: https://issues.apache.org/jira/browse/HUDI-7316
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Krishen Bhan


Currently some implementors of AbstractHoodieLogRecordReader create a 
HoodieTableMetaClient on construction, which implicitly reloads active 
timeline, causing a {{listStatus}} HDFS call. Since when using Spark engine 
these are created in Spark executors, a Spark user may have hundreds to 
thousands of executors that will make a {{listStatus}} call at the same time 
(during a Spark stage). To avoid these redundant calls to the HDFS NameNode (or 
any distributed filesystem service in general), users of 
AbstractHoodieLogRecordReader and implementations should pass in 
already-constructed HoodieTableMetaClient.

As long as the caller passed in a HoodieTableMetaClient with active timeline 
already loaded, and the implementation doesn't need to re-load the timeline 
(such as in order to get a more "fresh" timeline) then these calls will be 
avoided in the executor, without causing the logic to be incorrect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to