[ https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346894#comment-14346894 ]
Sean Owen commented on SPARK-6061: ---------------------------------- [~jhu] {newFilesOnly}} means old files are *not* included. It's a way to reduce, not increase, the number of files processed. Can you clarify the issue by summarizing your example -- what happened, what did you expect. > File source dstream can not include the old file which timestamp is before > the system time > ------------------------------------------------------------------------------------------ > > Key: SPARK-6061 > URL: https://issues.apache.org/jira/browse/SPARK-6061 > Project: Spark > Issue Type: Bug > Components: Streaming > Affects Versions: 1.2.1 > Reporter: Jack Hu > Labels: FileSourceDStream, OlderFiles, Streaming > Original Estimate: 1m > Remaining Estimate: 1m > > The file source dstream (StreamContext.fileStream) has a properties named > "newFilesOnly" to include the old files, it worked fine with 1.1.0, and > broken at 1.2.1, the older files always be ignored no mattern what value is > set. > Here is the simple reproduce code: > https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb > The reason is that: the "modTimeIgnoreThreshold" in > FileInputDStream::findNewFiles is set to a time closed to system time (Spark > Streaming Clock time), so the files old than this time are ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org