This might not be a straight forward approach, but one way would be to use the *PairRDDFunctions* and then you have a few methods to access the partitions and the filenames from the partitions. And once you have the filename, you can delete it after your operations. Not sure if spark updated the api though, but you can give a try.
Here's a snippet: UnionPartition upp = (UnionPartition) ds.values().getPartitions()[ 0 ]; NewHadoopPartition npp = (NewHadoopPartition) upp.split(); String fPath = npp.serializableHadoopSplit().value().toString(); Here fPath would be the first file's name in the stream. And ds is a PairRDDFunctions. Thanks Best Regards On Fri, Jan 30, 2015 at 11:37 PM, ganterm <gant...@gmail.com> wrote: > We are running a Spark streaming job that retrieves files from a directory > (using textFileStream). > One concern we are having is the case where the job is down but files are > still being added to the directory. > Once the job starts up again, those files are not being picked up (since > they are not new or changed while the job is running) but we would like > them > to be processed. > Is there a solution for that? Is there a way to keep track what files have > been processed and can we "force" older files to be picked up? Is there a > way to delete the processed files? > > Thanks! > Markus > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-tracking-deleting-processed-files-tp21444.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >