Jack Hu created SPARK-6061:
------------------------------

             Summary: File source dstream can not include the old file which 
timestamp is before the system time
                 Key: SPARK-6061
                 URL: https://issues.apache.org/jira/browse/SPARK-6061
             Project: Spark
          Issue Type: Bug
          Components: Streaming
    Affects Versions: 1.2.1
            Reporter: Jack Hu


The file source dstream (StreamContext.fileStream) has a properties named 
"newFilesOnly" to include the old files, it worked fine with 1.1.0, and broken 
at 1.2.1, the older files always be ignored no mattern what value is set.  

Here is the simple reproduce code:
https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb

The reason is that: the "modTimeIgnoreThreshold" in 
FileInputDStream::findNewFiles is set to a time closed to system time (Spark 
Streaming Clock time), so the files old than this time are ignored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to