On 28 Apr 2016, at 22:59, Alexander Pivovarov 
<apivova...@gmail.com<mailto:apivova...@gmail.com>> wrote:

Spark works well with S3 (read and write). However it's recommended to set 
spark.speculation true (it's expected that some tasks fail if you read large S3 
folder, so speculation should help)


I must disagree.


  1.  Speculative execution has >1 executor running the query, with whoever 
finishes first winning.
  2.  however, "finishes first" is implemented in the output committer, by 
renaming the attempt's output directory to the final output directory: whoever 
renames first wins.
  3.  This relies on rename() being implemented in the filesystem client as an 
atomic transaction.
  4.  Unfortunately, S3 doesn't do renames. Instead every file gets copied to 
one of the new name, then the old file deleted; an operation that takes time 
O(data * files)

if you have more than one executor trying to commit the work simultaneously, 
your output will be mess of both executions, without anything detecting and 
reporting it.

Where did you find this recommendation to set speculation=true?

-Steve

see also: https://issues.apache.org/jira/browse/SPARK-10063

Reply via email to