[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time
[ https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386774#comment-14386774 ] Emre Sevinç commented on SPARK-6061: Any plans to make the private val {{FileInputDStream.MIN_REMEMBER_DURATION}} configurable via some API? It seems to be hard-coded as 1 minute in https://github.com/apache/spark/blob/branch-1.2/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L325, and this leads to files older than 1 minute not to be processed. File source dstream can not include the old file which timestamp is before the system time -- Key: SPARK-6061 URL: https://issues.apache.org/jira/browse/SPARK-6061 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Jack Hu Labels: FileSourceDStream, OlderFiles, Streaming Original Estimate: 1m Remaining Estimate: 1m The file source dstream (StreamContext.fileStream) has a properties named newFilesOnly to include the old files, it worked fine with 1.1.0, and broken at 1.2.1, the older files always be ignored no mattern what value is set. Here is the simple reproduce code: https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb The reason is that: the modTimeIgnoreThreshold in FileInputDStream::findNewFiles is set to a time closed to system time (Spark Streaming Clock time), so the files old than this time are ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time
[ https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348093#comment-14348093 ] Jack Hu commented on SPARK-6061: [~srowen] The issue is: I want to process the old files in file dstream, but the old files will be ignored when set the {{newFilesOnly}} to {{false}} File source dstream can not include the old file which timestamp is before the system time -- Key: SPARK-6061 URL: https://issues.apache.org/jira/browse/SPARK-6061 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Jack Hu Labels: FileSourceDStream, OlderFiles, Streaming Original Estimate: 1m Remaining Estimate: 1m The file source dstream (StreamContext.fileStream) has a properties named newFilesOnly to include the old files, it worked fine with 1.1.0, and broken at 1.2.1, the older files always be ignored no mattern what value is set. Here is the simple reproduce code: https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb The reason is that: the modTimeIgnoreThreshold in FileInputDStream::findNewFiles is set to a time closed to system time (Spark Streaming Clock time), so the files old than this time are ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time
[ https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348160#comment-14348160 ] Yi Tian commented on SPARK-6061: In spark 1.2.1, when you set the {{newFilesOnly}} to {{false}}, it means this {{FileInputDStream}} would not only handle coming files, but also include files which came in the past 1 minute (not all the old files). The length of time defined in {{FileInputDStream.MIN_REMEMBER_DURATION}}. I think we should make this length of time configurable. File source dstream can not include the old file which timestamp is before the system time -- Key: SPARK-6061 URL: https://issues.apache.org/jira/browse/SPARK-6061 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Jack Hu Labels: FileSourceDStream, OlderFiles, Streaming Original Estimate: 1m Remaining Estimate: 1m The file source dstream (StreamContext.fileStream) has a properties named newFilesOnly to include the old files, it worked fine with 1.1.0, and broken at 1.2.1, the older files always be ignored no mattern what value is set. Here is the simple reproduce code: https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb The reason is that: the modTimeIgnoreThreshold in FileInputDStream::findNewFiles is set to a time closed to system time (Spark Streaming Clock time), so the files old than this time are ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time
[ https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348230#comment-14348230 ] Jack Hu commented on SPARK-6061: [~tianyi] Do you know why the {{FileInputDStream.MIN_REMEMBER_DURATION}} is introduced in 1.2.1 (Actually, it was introduced 1.1.1/1.2.0)? File source dstream can not include the old file which timestamp is before the system time -- Key: SPARK-6061 URL: https://issues.apache.org/jira/browse/SPARK-6061 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Jack Hu Labels: FileSourceDStream, OlderFiles, Streaming Original Estimate: 1m Remaining Estimate: 1m The file source dstream (StreamContext.fileStream) has a properties named newFilesOnly to include the old files, it worked fine with 1.1.0, and broken at 1.2.1, the older files always be ignored no mattern what value is set. Here is the simple reproduce code: https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb The reason is that: the modTimeIgnoreThreshold in FileInputDStream::findNewFiles is set to a time closed to system time (Spark Streaming Clock time), so the files old than this time are ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time
[ https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348253#comment-14348253 ] Yi Tian commented on SPARK-6061: [~tdas] Could you explain about {{FileInputDStream.MIN_REMEMBER_DURATION}} ? File source dstream can not include the old file which timestamp is before the system time -- Key: SPARK-6061 URL: https://issues.apache.org/jira/browse/SPARK-6061 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Jack Hu Labels: FileSourceDStream, OlderFiles, Streaming Original Estimate: 1m Remaining Estimate: 1m The file source dstream (StreamContext.fileStream) has a properties named newFilesOnly to include the old files, it worked fine with 1.1.0, and broken at 1.2.1, the older files always be ignored no mattern what value is set. Here is the simple reproduce code: https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb The reason is that: the modTimeIgnoreThreshold in FileInputDStream::findNewFiles is set to a time closed to system time (Spark Streaming Clock time), so the files old than this time are ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time
[ https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346640#comment-14346640 ] maji2014 commented on SPARK-6061: - Different rules exist between 1.1.0 and 1.2.1. In current situation, the modTimeIgnoreThreshold is the max of initialModTimeIgnoreThreshold and currentTime - durationToRemember.milliseconds, and the value of currentTime - durationToRemember.milliseconds is the ignore threshold if the propertis newFilesOnly is set to false. the default value of MIN_REMEMBER_DURATION is 1 minute File source dstream can not include the old file which timestamp is before the system time -- Key: SPARK-6061 URL: https://issues.apache.org/jira/browse/SPARK-6061 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Jack Hu Labels: FileSourceDStream, OlderFiles, Streaming Original Estimate: 1m Remaining Estimate: 1m The file source dstream (StreamContext.fileStream) has a properties named newFilesOnly to include the old files, it worked fine with 1.1.0, and broken at 1.2.1, the older files always be ignored no mattern what value is set. Here is the simple reproduce code: https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb The reason is that: the modTimeIgnoreThreshold in FileInputDStream::findNewFiles is set to a time closed to system time (Spark Streaming Clock time), so the files old than this time are ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time
[ https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346894#comment-14346894 ] Sean Owen commented on SPARK-6061: -- [~jhu] {newFilesOnly}} means old files are *not* included. It's a way to reduce, not increase, the number of files processed. Can you clarify the issue by summarizing your example -- what happened, what did you expect. File source dstream can not include the old file which timestamp is before the system time -- Key: SPARK-6061 URL: https://issues.apache.org/jira/browse/SPARK-6061 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Jack Hu Labels: FileSourceDStream, OlderFiles, Streaming Original Estimate: 1m Remaining Estimate: 1m The file source dstream (StreamContext.fileStream) has a properties named newFilesOnly to include the old files, it worked fine with 1.1.0, and broken at 1.2.1, the older files always be ignored no mattern what value is set. Here is the simple reproduce code: https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb The reason is that: the modTimeIgnoreThreshold in FileInputDStream::findNewFiles is set to a time closed to system time (Spark Streaming Clock time), so the files old than this time are ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org