[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time

2015-03-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386774#comment-14386774
 ] 

Emre Sevinç commented on SPARK-6061:


Any plans to make the private val {{FileInputDStream.MIN_REMEMBER_DURATION}} 
configurable via some API?

It seems to be hard-coded as 1 minute in 
https://github.com/apache/spark/blob/branch-1.2/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L325,
 and this leads to files older than 1 minute not to be processed.

 File source dstream can not include the old file which timestamp is before 
 the system time
 --

 Key: SPARK-6061
 URL: https://issues.apache.org/jira/browse/SPARK-6061
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Jack Hu
  Labels: FileSourceDStream, OlderFiles, Streaming
   Original Estimate: 1m
  Remaining Estimate: 1m

 The file source dstream (StreamContext.fileStream) has a properties named 
 newFilesOnly to include the old files, it worked fine with 1.1.0, and 
 broken at 1.2.1, the older files always be ignored no mattern what value is 
 set.  
 Here is the simple reproduce code:
 https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb
 The reason is that: the modTimeIgnoreThreshold in 
 FileInputDStream::findNewFiles is set to a time closed to system time (Spark 
 Streaming Clock time), so the files old than this time are ignored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time

2015-03-04 Thread Jack Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348093#comment-14348093
 ] 

Jack Hu commented on SPARK-6061:


[~srowen] 
The issue is: I want to process the old files in file dstream, but the old 
files will be ignored when set the {{newFilesOnly}} to {{false}} 

 File source dstream can not include the old file which timestamp is before 
 the system time
 --

 Key: SPARK-6061
 URL: https://issues.apache.org/jira/browse/SPARK-6061
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Jack Hu
  Labels: FileSourceDStream, OlderFiles, Streaming
   Original Estimate: 1m
  Remaining Estimate: 1m

 The file source dstream (StreamContext.fileStream) has a properties named 
 newFilesOnly to include the old files, it worked fine with 1.1.0, and 
 broken at 1.2.1, the older files always be ignored no mattern what value is 
 set.  
 Here is the simple reproduce code:
 https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb
 The reason is that: the modTimeIgnoreThreshold in 
 FileInputDStream::findNewFiles is set to a time closed to system time (Spark 
 Streaming Clock time), so the files old than this time are ignored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time

2015-03-04 Thread Yi Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348160#comment-14348160
 ] 

Yi Tian commented on SPARK-6061:


In spark 1.2.1,  when you set the {{newFilesOnly}} to {{false}}, it means this 
{{FileInputDStream}} would not only handle coming files, but also include files 
which came in the past 1 minute (not all the old files). The length of time 
defined in {{FileInputDStream.MIN_REMEMBER_DURATION}}.
I think we should make this length of time configurable.


 File source dstream can not include the old file which timestamp is before 
 the system time
 --

 Key: SPARK-6061
 URL: https://issues.apache.org/jira/browse/SPARK-6061
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Jack Hu
  Labels: FileSourceDStream, OlderFiles, Streaming
   Original Estimate: 1m
  Remaining Estimate: 1m

 The file source dstream (StreamContext.fileStream) has a properties named 
 newFilesOnly to include the old files, it worked fine with 1.1.0, and 
 broken at 1.2.1, the older files always be ignored no mattern what value is 
 set.  
 Here is the simple reproduce code:
 https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb
 The reason is that: the modTimeIgnoreThreshold in 
 FileInputDStream::findNewFiles is set to a time closed to system time (Spark 
 Streaming Clock time), so the files old than this time are ignored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time

2015-03-04 Thread Jack Hu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348230#comment-14348230
 ] 

Jack Hu commented on SPARK-6061:


[~tianyi]

Do you know why the {{FileInputDStream.MIN_REMEMBER_DURATION}} is introduced in 
1.2.1 (Actually, it was introduced 1.1.1/1.2.0)? 

 File source dstream can not include the old file which timestamp is before 
 the system time
 --

 Key: SPARK-6061
 URL: https://issues.apache.org/jira/browse/SPARK-6061
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Jack Hu
  Labels: FileSourceDStream, OlderFiles, Streaming
   Original Estimate: 1m
  Remaining Estimate: 1m

 The file source dstream (StreamContext.fileStream) has a properties named 
 newFilesOnly to include the old files, it worked fine with 1.1.0, and 
 broken at 1.2.1, the older files always be ignored no mattern what value is 
 set.  
 Here is the simple reproduce code:
 https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb
 The reason is that: the modTimeIgnoreThreshold in 
 FileInputDStream::findNewFiles is set to a time closed to system time (Spark 
 Streaming Clock time), so the files old than this time are ignored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time

2015-03-04 Thread Yi Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348253#comment-14348253
 ] 

Yi Tian commented on SPARK-6061:


[~tdas] Could you explain about {{FileInputDStream.MIN_REMEMBER_DURATION}} ?

 File source dstream can not include the old file which timestamp is before 
 the system time
 --

 Key: SPARK-6061
 URL: https://issues.apache.org/jira/browse/SPARK-6061
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Jack Hu
  Labels: FileSourceDStream, OlderFiles, Streaming
   Original Estimate: 1m
  Remaining Estimate: 1m

 The file source dstream (StreamContext.fileStream) has a properties named 
 newFilesOnly to include the old files, it worked fine with 1.1.0, and 
 broken at 1.2.1, the older files always be ignored no mattern what value is 
 set.  
 Here is the simple reproduce code:
 https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb
 The reason is that: the modTimeIgnoreThreshold in 
 FileInputDStream::findNewFiles is set to a time closed to system time (Spark 
 Streaming Clock time), so the files old than this time are ignored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time

2015-03-04 Thread maji2014 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346640#comment-14346640
 ] 

maji2014 commented on SPARK-6061:
-

Different rules exist between 1.1.0 and 1.2.1. 
In current situation, the modTimeIgnoreThreshold is the max of 
initialModTimeIgnoreThreshold and currentTime - 
durationToRemember.milliseconds, and the value of currentTime - 
durationToRemember.milliseconds is the ignore threshold if the propertis 
newFilesOnly is set to false. the default value of MIN_REMEMBER_DURATION is 1 
minute 


 File source dstream can not include the old file which timestamp is before 
 the system time
 --

 Key: SPARK-6061
 URL: https://issues.apache.org/jira/browse/SPARK-6061
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Jack Hu
  Labels: FileSourceDStream, OlderFiles, Streaming
   Original Estimate: 1m
  Remaining Estimate: 1m

 The file source dstream (StreamContext.fileStream) has a properties named 
 newFilesOnly to include the old files, it worked fine with 1.1.0, and 
 broken at 1.2.1, the older files always be ignored no mattern what value is 
 set.  
 Here is the simple reproduce code:
 https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb
 The reason is that: the modTimeIgnoreThreshold in 
 FileInputDStream::findNewFiles is set to a time closed to system time (Spark 
 Streaming Clock time), so the files old than this time are ignored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6061) File source dstream can not include the old file which timestamp is before the system time

2015-03-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346894#comment-14346894
 ] 

Sean Owen commented on SPARK-6061:
--

[~jhu] {newFilesOnly}} means old files are *not* included. It's a way to 
reduce, not increase, the number of files processed. Can you clarify the issue 
by summarizing your example -- what happened, what did you expect.

 File source dstream can not include the old file which timestamp is before 
 the system time
 --

 Key: SPARK-6061
 URL: https://issues.apache.org/jira/browse/SPARK-6061
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Jack Hu
  Labels: FileSourceDStream, OlderFiles, Streaming
   Original Estimate: 1m
  Remaining Estimate: 1m

 The file source dstream (StreamContext.fileStream) has a properties named 
 newFilesOnly to include the old files, it worked fine with 1.1.0, and 
 broken at 1.2.1, the older files always be ignored no mattern what value is 
 set.  
 Here is the simple reproduce code:
 https://gist.github.com/jhu-chang/1ee5b0788c7479414eeb
 The reason is that: the modTimeIgnoreThreshold in 
 FileInputDStream::findNewFiles is set to a time closed to system time (Spark 
 Streaming Clock time), so the files old than this time are ignored. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org