All, Looking into this StackOverflow question <https://stackoverflow.com/questions/29022379/spark-streaming-hdfs/29036469> it appears that there is a bug when utilizing the newFilesOnly parameter in FileInputDStream. Before creating a ticket, I wanted to verify it here. The gist is that this code is wrong:
val modTimeIgnoreThreshold = math.max( initialModTimeIgnoreThreshold, // initial threshold based on newFilesOnly setting currentTime - durationToRemember.milliseconds // trailing end of the remember window ) The problem is that if you set newFilesOnly to false, then the initialModTimeIgnoreThreshold is always 0. This makes it always dropped out of the max operation. So, the best you get is files that were put in the directory (duration) from the start. Is this a bug or expected behavior; it seems like a bug to me. If I am correct, this appears to be a bigger fix than just using min as it would break other functionality. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Streaming-files-tp22051.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org