[I] Improve the incremental data files metadata more efficiently for streaming source [hudi]

via GitHub Sat, 29 Nov 2025 20:32:27 -0800


hudi-bot opened a new issue, #14904:
URL: https://github.com/apache/hudi/issues/14904


   There are 3 ways for fetching the incremental data files for streaming read 
now:
   
   1. Read the incremental commit metadata and resolve the data files to 
construct the inc filesystem view
   2. Scan the filesystem directly and filter the data files with start commit 
time if the consuming starts from the 'earliest' offset
   3. For 2, there is a more efficient way: to look up the metadata table if it 
is enabled
   
   While these 3 ways are far away from enough for production:
   
   for 1: there was a bottleneck when the start commit time has been far away 
from now, and the instants may have been archived, it takes too much time to 
load those metadata files, in our production, more than 30 minutes, which is 
unacceptable.
   
   for 2&3: they are only suitable for cases that read the full history and 
incremental data set.
   
   We better propose a way to look up the incremental data files with arbitrary 
time interval instants, to construct the filesystem efficiently.
   
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-2750
   - Type: Task
   - Epic: https://issues.apache.org/jira/browse/HUDI-2749
   - Fix version(s):
     - 1.1.0
   
   
   ---
   
   
   ## Comments
   
   17/Nov/21 13:32;vinoth;+1 on this. Dumping my thoughts here.  When the start 
commit is far away, 2/3 can be more performant, since they already filter out 
the files that have already been cleaned etc. Reading the entire timeline 
archive log can be time consuming. 
   
   I think we can index the timeline as well and support efficient range 
retrievals. but wondering why you think 2/3 is just only suitable for full 
history reads? Is it because the log files don't have the delta commit instant 
today in their names? With these (at-least on object storage), we can figure 
out what files changes between any given interval, right?
   
   Is this the gap?;;;
   
   ---
   
   18/Nov/21 10:02;danny0405;> but wondering why you think 2/3 is just only 
suitable for full history reads?
   
   Yes, for 2&3, we can do as one shot for the first time the streaming reader 
tries to scan the file list.  For object storage, we can figure out the inc 
change files efficiently. My bad, if the file scanning is fast enough, the 
metadata should not be a bottleneck.;;;
   
   ---
   
   12/Sep/23 23:54;linliu;[~danny0405], [~vinoth] , since it has been a while 
since the task was filed, can you please check the task description and see if 
anything needs to be updated? ;;;
   
   ---
   
   13/Sep/23 07:14;linliu;[~danny0405] confirmed offline that this task is 
still valid. Will work on this task shortly.;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Improve the incremental data files metadata more efficiently for streaming source [hudi]

Reply via email to