Burak Yavuz created SPARK-19813:
-----------------------------------

             Summary: maxFilesPerTrigger combo latestFirst may miss old files 
in combination with maxFileAge in FileStreamSource
                 Key: SPARK-19813
                 URL: https://issues.apache.org/jira/browse/SPARK-19813
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 2.1.0
            Reporter: Burak Yavuz
            Assignee: Burak Yavuz


There is a file stream source option called maxFileAge which limits how old the 
files can be, relative the latest file that has been seen. This is used to 
limit the files that need to be remembered as "processed". Files older than the 
latest processed files are ignored. This values is by default 7 days.
This causes a problem when both 
 - latestFirst = true
 - maxFilesPerTrigger > total files to be processed.

Here is what happens in all combinations
 1) latestFirst = false - Since files are processed in order, there wont be any 
unprocessed file older than the latest processed file. All files will be 
processed.
 2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge 
thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is 
not, then all old files get processed in the first batch, and so no file is 
left behind.
 3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch 
process the latest X files. That sets the threshold latest file - maxFileAge, 
so files older than this threshold will never be considered for processing. 

The bug is with case 3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to