[ 
https://issues.apache.org/jira/browse/HUDI-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313828#comment-17313828
 ] 

sivabalan narayanan commented on HUDI-1723:
-------------------------------------------

[~xushiyan]: I don't have much exp on the query side, so some noob questions. 

Whats the granularity of the modification time? If its millisecs, you mean to 
say that we will have lot of files w/ exactly same modification time at ms 
granularity? 

Did you see this happen in prod env or just theorically speaking. 

I understand the problem, just trying to gauge the severity and probability of 
it occurring. 

> DFSPathSelector skips files with the same modify date when read up to source 
> limit
> ----------------------------------------------------------------------------------
>
>                 Key: HUDI-1723
>                 URL: https://issues.apache.org/jira/browse/HUDI-1723
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: DeltaStreamer
>            Reporter: Raymond Xu
>            Priority: Critical
>              Labels: sev:critical, user-support-issues
>             Fix For: 0.9.0
>
>         Attachments: Screen Shot 2021-03-26 at 1.42.42 AM.png
>
>
> org.apache.hudi.utilities.sources.helpers.DFSPathSelector#listEligibleFiles 
> filters the input files based on last saved checkpoint, which was the 
> modification date from last read file. However, the last read file's 
> modification date could be duplicated for multiple files and resulted in 
> skipping a few of them when reading up to source limit. An illustration is 
> shown in the attached picture.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to