Our system works with RDDs generated from Hadoop files. It processes each record in a Hadoop file and for a subset of those records generates output that is written to an external system via RDD.foreach. There are no dependencies between the records that are processed.
If writing to the external system fails (due to a detail of what is being written) and throws an exception, I see the following behavior: 1. Spark retries the entire partition (thus wasting time and effort), reaches the problem record and fails again. 2. It repeats step 1 up to the default 4 tries and then gives up. As a result, the rest of records from that Hadoop file are not processed. 3. The executor where the 4th failure occurred is marked as failed and told to shut down and thus I lose a core for processing the remaining Hadoop files, thus slowing down the entire process. For this particular problem, I know how to prevent the underlying exception, but I'd still like to get a handle on error handling for future situations that I haven't yet encountered. My goal is this: Retry the problem record only (rather than starting over at the beginning of the partition) up to N times, then give up and move on to process the rest of the partition. As far as I can tell, I need to supply my own retry behavior and if I want to process records after the problem record I have to swallow exceptions inside the foreach block. My 2 questions are: 1. Is there anything I can do to prevent the executor from being shut down when a failure occurs? 2. Are there ways Spark can help me get closer to my goal of retrying only the problem record without writing my own re-try code and swallowing exceptions? Regards, Art