Oh yes, this was a bug and it has been fixed. Checkout from the master branch!
https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20created%20DESC%2C%20priority%20ASC TD On Mon, Jul 7, 2014 at 7:11 AM, Luis Ángel Vicente Sánchez < langel.gro...@gmail.com> wrote: > I have a basic spark streaming job that is watching a folder, processing > any new file and updating a column family in cassandra using the new > cassandra-spark-driver. > > I think there is a problem with SparkStreamingContext.textFileStream... if > I start my job in local mode with no files in the folder that is watched > and then I copy a bunch of files, sometimes spark is continually processing > those files again and again. > > I have noticed that it usually happens when spark doesn't detect all new > files in one go... i.e. I copied 6 files and spark detected 3 of them as > new and processed them; then it detected the other 3 as new and processed > them. After it finished to process all 6 files, it detected again the first > 3 files as new files and processed them... then the other 3... and again... > and again... and again. > > Should I rise a JIRA issue? > > Regards, > > Luis >