[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time

2015-03-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386774#comment-14386774
 ] 

Emre Sevinç commented on SPARK-6061:


Any plans to make the private val {{FileInputDStream.MIN_REMEMBER_DURATION}} 
configurable via some API?

It seems to be hard-coded as 1 minute in 
https://github.com/apache/spark/blob/branch-1.2/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L325,
 and this leads to files older than 1 minute not to be processed.

> File source dstream can not include the old file which timestamp is before 
> the system time
> --
>
> Key: SPARK-6061
> URL: https://issues.apache.org/jira/browse/SPARK-6061
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.1
>Reporter: Jack Hu
>  Labels: FileSourceDStream, OlderFiles, Streaming
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> The file source dstream (StreamContext.fileStream) has a properties named 
> "newFilesOnly" to include the old files, it worked fine with 1.1.0, and 
> broken at 1.2.1, the older files always be ignored no mattern what value is 
> set.  
> Here is the simple reproduce code:
> https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb
> The reason is that: the "modTimeIgnoreThreshold" in 
> FileInputDStream::findNewFiles is set to a time closed to system time (Spark 
> Streaming Clock time), so the files old than this time are ignored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time

2015-03-04 Thread Yi Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348253#comment-14348253
 ] 

Yi Tian commented on SPARK-6061:


[~tdas] Could you explain about {{FileInputDStream.MIN_REMEMBER_DURATION}} ?

> File source dstream can not include the old file which timestamp is before 
> the system time
> --
>
> Key: SPARK-6061
> URL: https://issues.apache.org/jira/browse/SPARK-6061
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.1
>Reporter: Jack Hu
>  Labels: FileSourceDStream, OlderFiles, Streaming
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> The file source dstream (StreamContext.fileStream) has a properties named 
> "newFilesOnly" to include the old files, it worked fine with 1.1.0, and 
> broken at 1.2.1, the older files always be ignored no mattern what value is 
> set.  
> Here is the simple reproduce code:
> https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb
> The reason is that: the "modTimeIgnoreThreshold" in 
> FileInputDStream::findNewFiles is set to a time closed to system time (Spark 
> Streaming Clock time), so the files old than this time are ignored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time

2015-03-04 Thread Jack Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348230#comment-14348230
 ] 

Jack Hu commented on SPARK-6061:


[~tianyi]

Do you know why the {{FileInputDStream.MIN_REMEMBER_DURATION}} is introduced in 
1.2.1 (Actually, it was introduced 1.1.1/1.2.0)? 

> File source dstream can not include the old file which timestamp is before 
> the system time
> --
>
> Key: SPARK-6061
> URL: https://issues.apache.org/jira/browse/SPARK-6061
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.1
>Reporter: Jack Hu
>  Labels: FileSourceDStream, OlderFiles, Streaming
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> The file source dstream (StreamContext.fileStream) has a properties named 
> "newFilesOnly" to include the old files, it worked fine with 1.1.0, and 
> broken at 1.2.1, the older files always be ignored no mattern what value is 
> set.  
> Here is the simple reproduce code:
> https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb
> The reason is that: the "modTimeIgnoreThreshold" in 
> FileInputDStream::findNewFiles is set to a time closed to system time (Spark 
> Streaming Clock time), so the files old than this time are ignored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time

2015-03-04 Thread Yi Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348160#comment-14348160
 ] 

Yi Tian commented on SPARK-6061:


In spark 1.2.1,  when you set the {{newFilesOnly}} to {{false}}, it means this 
{{FileInputDStream}} would not only handle coming files, but also include files 
which came in the past 1 minute (not all the old files). The length of time 
defined in {{FileInputDStream.MIN_REMEMBER_DURATION}}.
I think we should make this length of time configurable.


> File source dstream can not include the old file which timestamp is before 
> the system time
> --
>
> Key: SPARK-6061
> URL: https://issues.apache.org/jira/browse/SPARK-6061
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.1
>Reporter: Jack Hu
>  Labels: FileSourceDStream, OlderFiles, Streaming
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> The file source dstream (StreamContext.fileStream) has a properties named 
> "newFilesOnly" to include the old files, it worked fine with 1.1.0, and 
> broken at 1.2.1, the older files always be ignored no mattern what value is 
> set.  
> Here is the simple reproduce code:
> https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb
> The reason is that: the "modTimeIgnoreThreshold" in 
> FileInputDStream::findNewFiles is set to a time closed to system time (Spark 
> Streaming Clock time), so the files old than this time are ignored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time

2015-03-04 Thread Jack Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348093#comment-14348093
 ] 

Jack Hu commented on SPARK-6061:


[~srowen] 
The issue is: I want to process the old files in file dstream, but the old 
files will be ignored when set the {{newFilesOnly}} to {{false}} 

> File source dstream can not include the old file which timestamp is before 
> the system time
> --
>
> Key: SPARK-6061
> URL: https://issues.apache.org/jira/browse/SPARK-6061
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.1
>Reporter: Jack Hu
>  Labels: FileSourceDStream, OlderFiles, Streaming
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> The file source dstream (StreamContext.fileStream) has a properties named 
> "newFilesOnly" to include the old files, it worked fine with 1.1.0, and 
> broken at 1.2.1, the older files always be ignored no mattern what value is 
> set.  
> Here is the simple reproduce code:
> https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb
> The reason is that: the "modTimeIgnoreThreshold" in 
> FileInputDStream::findNewFiles is set to a time closed to system time (Spark 
> Streaming Clock time), so the files old than this time are ignored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time

2015-03-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346894#comment-14346894
 ] 

Sean Owen commented on SPARK-6061:
--

[~jhu] {newFilesOnly}} means old files are *not* included. It's a way to 
reduce, not increase, the number of files processed. Can you clarify the issue 
by summarizing your example -- what happened, what did you expect.

> File source dstream can not include the old file which timestamp is before 
> the system time
> --
>
> Key: SPARK-6061
> URL: https://issues.apache.org/jira/browse/SPARK-6061
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.1
>Reporter: Jack Hu
>  Labels: FileSourceDStream, OlderFiles, Streaming
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> The file source dstream (StreamContext.fileStream) has a properties named 
> "newFilesOnly" to include the old files, it worked fine with 1.1.0, and 
> broken at 1.2.1, the older files always be ignored no mattern what value is 
> set.  
> Here is the simple reproduce code:
> https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb
> The reason is that: the "modTimeIgnoreThreshold" in 
> FileInputDStream::findNewFiles is set to a time closed to system time (Spark 
> Streaming Clock time), so the files old than this time are ignored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time

2015-03-04 Thread maji2014 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346640#comment-14346640
 ] 

maji2014 commented on SPARK-6061:
-

Different rules exist between 1.1.0 and 1.2.1. 
In current situation, the modTimeIgnoreThreshold is the max of 
"initialModTimeIgnoreThreshold" and "currentTime - 
durationToRemember.milliseconds", and the value of "currentTime - 
durationToRemember.milliseconds" is the ignore threshold if the propertis 
"newFilesOnly" is set to false. the default value of MIN_REMEMBER_DURATION is 1 
minute 


> File source dstream can not include the old file which timestamp is before 
> the system time
> --
>
> Key: SPARK-6061
> URL: https://issues.apache.org/jira/browse/SPARK-6061
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.1
>Reporter: Jack Hu
>  Labels: FileSourceDStream, OlderFiles, Streaming
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> The file source dstream (StreamContext.fileStream) has a properties named 
> "newFilesOnly" to include the old files, it worked fine with 1.1.0, and 
> broken at 1.2.1, the older files always be ignored no mattern what value is 
> set.  
> Here is the simple reproduce code:
> https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb
> The reason is that: the "modTimeIgnoreThreshold" in 
> FileInputDStream::findNewFiles is set to a time closed to system time (Spark 
> Streaming Clock time), so the files old than this time are ignored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org