I'll answer in the context of structured streaming (the new streaming API build on DataFrames). When reading from files, the FileSource, records which files are included in each batch inside of the given checkpointLocation. If you fail in the middle of a batch, the streaming engine will retry that batch next time the query is restarted.
If you are concerned about exactly-once semantics, you can get that too. The FileSink (i.e. using writeStream) writing out to something like parquet does this automatically. If you are writing to something like a transactional database yourself, you can also implement similar functionality. Specifically, you can record the partition and version that are provided by the open method <https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/ForeachWriter.html#open(long,%20long)> into the database in the same transaction that is writing the data. This way, when you recover you can avoid writing the same updates more than once. On Wed, Oct 26, 2016 at 9:20 AM, Scott W <defy...@gmail.com> wrote: > Hello, > > I'm planning to use fileStream Spark streaming API to stream data from > HDFS. My Spark job would essentially process these files and post the > results to an external endpoint. > > *How does fileStream API handle checkpointing of the file it processed ? *In > other words, if my Spark job failed while posting the results to an > external endpoint, I want that same original file to be picked up again and > get reprocessed. > > Thanks much! >