> On 16 Feb 2017, at 18:34, Ji Yan <ji...@drive.ai> wrote:
> 
> Dear spark users,
> 
> Is there any mechanism in Spark that does not guarantee the idempotent 
> nature? For example, for stranglers, the framework might start another task 
> assuming the strangler is slow while the strangler is still running. This 
> would be annoying sometime when say the task is writing to a file, but have 
> the same tasks running at the same time may corrupt the file. From the 
> documentation page, I know that Spark's speculative execution mode is turned 
> off by default. Does anyone know any other mechanism in Spark that may cause 
> problem in scenario like this?

 It's not so much "Two tasks writing to the same file' as "two tasks writing to 
different places with the work renamed into place at the end"

speculation is the key case when there's >1  writer, though they do write to 
different directories; the spark commit protocol guarantees that only the 
committed task gets its work into the final output.

Some failure modes *may* have >1 executor running the same work, right up to 
the point where the task commit operation is started. More specifically, a 
network partition may cause the executor to lose touch with the driver, and the 
driver to pass the same task on to another executor, while the existing 
executor keeps going. Its when that first executor tries to commit the data 
that you get a guarantee that the work doesn't get committed (no connectivity 
=> no commit, connectivity resumed => driver will tell executor it's been 
aborted).

If you are working with files outside of the tasks' working directory, then the 
outcome of failure will be "undefined". The FileCommitProtocol lets you  ask 
for a temp file which is rename()d to the destination in the commit. Use this 
and the files will only appear the task is committed. Even there there is a 
small, but non-zero chance that the commit may fail partway through, in which 
case the outcome is, as they say, "undefined". Avoid that today by not manually 
adding custom partitions to data sources in your hive metastore. 

Steve




---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to