[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time
[ https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386774#comment-14386774 ] Emre Sevinç commented on SPARK-6061: Any plans to make the private val {{FileInputDStream.MIN_REMEMBER_DURATION}} configurable via some API? It seems to be hard-coded as 1 minute in https://github.com/apache/spark/blob/branch-1.2/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L325, and this leads to files older than 1 minute not to be processed. > File source dstream can not include the old file which timestamp is before > the system time > -- > > Key: SPARK-6061 > URL: https://issues.apache.org/jira/browse/SPARK-6061 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.1 >Reporter: Jack Hu > Labels: FileSourceDStream, OlderFiles, Streaming > Original Estimate: 1m > Remaining Estimate: 1m > > The file source dstream (StreamContext.fileStream) has a properties named > "newFilesOnly" to include the old files, it worked fine with 1.1.0, and > broken at 1.2.1, the older files always be ignored no mattern what value is > set. > Here is the simple reproduce code: > https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb > The reason is that: the "modTimeIgnoreThreshold" in > FileInputDStream::findNewFiles is set to a time closed to system time (Spark > Streaming Clock time), so the files old than this time are ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time
[ https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348253#comment-14348253 ] Yi Tian commented on SPARK-6061: [~tdas] Could you explain about {{FileInputDStream.MIN_REMEMBER_DURATION}} ? > File source dstream can not include the old file which timestamp is before > the system time > -- > > Key: SPARK-6061 > URL: https://issues.apache.org/jira/browse/SPARK-6061 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.1 >Reporter: Jack Hu > Labels: FileSourceDStream, OlderFiles, Streaming > Original Estimate: 1m > Remaining Estimate: 1m > > The file source dstream (StreamContext.fileStream) has a properties named > "newFilesOnly" to include the old files, it worked fine with 1.1.0, and > broken at 1.2.1, the older files always be ignored no mattern what value is > set. > Here is the simple reproduce code: > https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb > The reason is that: the "modTimeIgnoreThreshold" in > FileInputDStream::findNewFiles is set to a time closed to system time (Spark > Streaming Clock time), so the files old than this time are ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time
[ https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348230#comment-14348230 ] Jack Hu commented on SPARK-6061: [~tianyi] Do you know why the {{FileInputDStream.MIN_REMEMBER_DURATION}} is introduced in 1.2.1 (Actually, it was introduced 1.1.1/1.2.0)? > File source dstream can not include the old file which timestamp is before > the system time > -- > > Key: SPARK-6061 > URL: https://issues.apache.org/jira/browse/SPARK-6061 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.1 >Reporter: Jack Hu > Labels: FileSourceDStream, OlderFiles, Streaming > Original Estimate: 1m > Remaining Estimate: 1m > > The file source dstream (StreamContext.fileStream) has a properties named > "newFilesOnly" to include the old files, it worked fine with 1.1.0, and > broken at 1.2.1, the older files always be ignored no mattern what value is > set. > Here is the simple reproduce code: > https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb > The reason is that: the "modTimeIgnoreThreshold" in > FileInputDStream::findNewFiles is set to a time closed to system time (Spark > Streaming Clock time), so the files old than this time are ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time
[ https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348160#comment-14348160 ] Yi Tian commented on SPARK-6061: In spark 1.2.1, when you set the {{newFilesOnly}} to {{false}}, it means this {{FileInputDStream}} would not only handle coming files, but also include files which came in the past 1 minute (not all the old files). The length of time defined in {{FileInputDStream.MIN_REMEMBER_DURATION}}. I think we should make this length of time configurable. > File source dstream can not include the old file which timestamp is before > the system time > -- > > Key: SPARK-6061 > URL: https://issues.apache.org/jira/browse/SPARK-6061 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.1 >Reporter: Jack Hu > Labels: FileSourceDStream, OlderFiles, Streaming > Original Estimate: 1m > Remaining Estimate: 1m > > The file source dstream (StreamContext.fileStream) has a properties named > "newFilesOnly" to include the old files, it worked fine with 1.1.0, and > broken at 1.2.1, the older files always be ignored no mattern what value is > set. > Here is the simple reproduce code: > https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb > The reason is that: the "modTimeIgnoreThreshold" in > FileInputDStream::findNewFiles is set to a time closed to system time (Spark > Streaming Clock time), so the files old than this time are ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time
[ https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348093#comment-14348093 ] Jack Hu commented on SPARK-6061: [~srowen] The issue is: I want to process the old files in file dstream, but the old files will be ignored when set the {{newFilesOnly}} to {{false}} > File source dstream can not include the old file which timestamp is before > the system time > -- > > Key: SPARK-6061 > URL: https://issues.apache.org/jira/browse/SPARK-6061 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.1 >Reporter: Jack Hu > Labels: FileSourceDStream, OlderFiles, Streaming > Original Estimate: 1m > Remaining Estimate: 1m > > The file source dstream (StreamContext.fileStream) has a properties named > "newFilesOnly" to include the old files, it worked fine with 1.1.0, and > broken at 1.2.1, the older files always be ignored no mattern what value is > set. > Here is the simple reproduce code: > https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb > The reason is that: the "modTimeIgnoreThreshold" in > FileInputDStream::findNewFiles is set to a time closed to system time (Spark > Streaming Clock time), so the files old than this time are ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time
[ https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346894#comment-14346894 ] Sean Owen commented on SPARK-6061: -- [~jhu] {newFilesOnly}} means old files are *not* included. It's a way to reduce, not increase, the number of files processed. Can you clarify the issue by summarizing your example -- what happened, what did you expect. > File source dstream can not include the old file which timestamp is before > the system time > -- > > Key: SPARK-6061 > URL: https://issues.apache.org/jira/browse/SPARK-6061 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.1 >Reporter: Jack Hu > Labels: FileSourceDStream, OlderFiles, Streaming > Original Estimate: 1m > Remaining Estimate: 1m > > The file source dstream (StreamContext.fileStream) has a properties named > "newFilesOnly" to include the old files, it worked fine with 1.1.0, and > broken at 1.2.1, the older files always be ignored no mattern what value is > set. > Here is the simple reproduce code: > https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb > The reason is that: the "modTimeIgnoreThreshold" in > FileInputDStream::findNewFiles is set to a time closed to system time (Spark > Streaming Clock time), so the files old than this time are ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time
[ https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346640#comment-14346640 ] maji2014 commented on SPARK-6061: - Different rules exist between 1.1.0 and 1.2.1. In current situation, the modTimeIgnoreThreshold is the max of "initialModTimeIgnoreThreshold" and "currentTime - durationToRemember.milliseconds", and the value of "currentTime - durationToRemember.milliseconds" is the ignore threshold if the propertis "newFilesOnly" is set to false. the default value of MIN_REMEMBER_DURATION is 1 minute > File source dstream can not include the old file which timestamp is before > the system time > -- > > Key: SPARK-6061 > URL: https://issues.apache.org/jira/browse/SPARK-6061 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.1 >Reporter: Jack Hu > Labels: FileSourceDStream, OlderFiles, Streaming > Original Estimate: 1m > Remaining Estimate: 1m > > The file source dstream (StreamContext.fileStream) has a properties named > "newFilesOnly" to include the old files, it worked fine with 1.1.0, and > broken at 1.2.1, the older files always be ignored no mattern what value is > set. > Here is the simple reproduce code: > https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb > The reason is that: the "modTimeIgnoreThreshold" in > FileInputDStream::findNewFiles is set to a time closed to system time (Spark > Streaming Clock time), so the files old than this time are ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org