Our system works with RDDs generated from Hadoop files. It processes each
record in a Hadoop file and for a subset of those records generates output
that is written to an external system via RDD.foreach. There are no
dependencies between the records that are processed.

If writing to the external system fails (due to a detail of what is being
written) and throws an exception, I see the following behavior:

1. Spark retries the entire partition (thus wasting time and effort),
reaches the problem record and fails again.
2. It repeats step 1 up to the default 4 tries and then gives up. As a
result, the rest of records from that Hadoop file are not processed.
3. The executor where the 4th failure occurred is marked as failed and told
to shut down and thus I lose a core for processing the remaining Hadoop
files, thus slowing down the entire process.

For this particular problem, I know how to prevent the underlying
exception, but I'd still like to get a handle on error handling for future
situations that I haven't yet encountered.

My goal is this:
Retry the problem record only (rather than starting over at the beginning
of the partition) up to N times, then give up and move on to process the
rest of the partition.

As far as I can tell, I need to supply my own retry behavior and if I want
to process records after the problem record I have to swallow exceptions
inside the foreach block.

My 2 questions are:
1. Is there anything I can do to prevent the executor from being shut down
when a failure occurs?

2. Are there ways Spark can help me get closer to my goal of retrying only
the problem record without writing my own re-try code and swallowing
exceptions?

Regards,
Art

Reply via email to