> On 16 Feb 2017, at 18:34, Ji Yan <ji...@drive.ai> wrote: > > Dear spark users, > > Is there any mechanism in Spark that does not guarantee the idempotent > nature? For example, for stranglers, the framework might start another task > assuming the strangler is slow while the strangler is still running. This > would be annoying sometime when say the task is writing to a file, but have > the same tasks running at the same time may corrupt the file. From the > documentation page, I know that Spark's speculative execution mode is turned > off by default. Does anyone know any other mechanism in Spark that may cause > problem in scenario like this?
It's not so much "Two tasks writing to the same file' as "two tasks writing to different places with the work renamed into place at the end" speculation is the key case when there's >1 writer, though they do write to different directories; the spark commit protocol guarantees that only the committed task gets its work into the final output. Some failure modes *may* have >1 executor running the same work, right up to the point where the task commit operation is started. More specifically, a network partition may cause the executor to lose touch with the driver, and the driver to pass the same task on to another executor, while the existing executor keeps going. Its when that first executor tries to commit the data that you get a guarantee that the work doesn't get committed (no connectivity => no commit, connectivity resumed => driver will tell executor it's been aborted). If you are working with files outside of the tasks' working directory, then the outcome of failure will be "undefined". The FileCommitProtocol lets you ask for a temp file which is rename()d to the destination in the commit. Use this and the files will only appear the task is committed. Even there there is a small, but non-zero chance that the commit may fail partway through, in which case the outcome is, as they say, "undefined". Avoid that today by not manually adding custom partitions to data sources in your hive metastore. Steve --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org