hudi-bot opened a new issue, #14904: URL: https://github.com/apache/hudi/issues/14904
There are 3 ways for fetching the incremental data files for streaming read now: 1. Read the incremental commit metadata and resolve the data files to construct the inc filesystem view 2. Scan the filesystem directly and filter the data files with start commit time if the consuming starts from the 'earliest' offset 3. For 2, there is a more efficient way: to look up the metadata table if it is enabled While these 3 ways are far away from enough for production: for 1: there was a bottleneck when the start commit time has been far away from now, and the instants may have been archived, it takes too much time to load those metadata files, in our production, more than 30 minutes, which is unacceptable. for 2&3: they are only suitable for cases that read the full history and incremental data set. We better propose a way to look up the incremental data files with arbitrary time interval instants, to construct the filesystem efficiently. ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-2750 - Type: Task - Epic: https://issues.apache.org/jira/browse/HUDI-2749 - Fix version(s): - 1.1.0 --- ## Comments 17/Nov/21 13:32;vinoth;+1 on this. Dumping my thoughts here. When the start commit is far away, 2/3 can be more performant, since they already filter out the files that have already been cleaned etc. Reading the entire timeline archive log can be time consuming. I think we can index the timeline as well and support efficient range retrievals. but wondering why you think 2/3 is just only suitable for full history reads? Is it because the log files don't have the delta commit instant today in their names? With these (at-least on object storage), we can figure out what files changes between any given interval, right? Is this the gap?;;; --- 18/Nov/21 10:02;danny0405;> but wondering why you think 2/3 is just only suitable for full history reads? Yes, for 2&3, we can do as one shot for the first time the streaming reader tries to scan the file list. For object storage, we can figure out the inc change files efficiently. My bad, if the file scanning is fast enough, the metadata should not be a bottleneck.;;; --- 12/Sep/23 23:54;linliu;[~danny0405], [~vinoth] , since it has been a while since the task was filed, can you please check the task description and see if anything needs to be updated? ;;; --- 13/Sep/23 07:14;linliu;[~danny0405] confirmed offline that this task is still valid. Will work on this task shortly.;;; -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
