[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2014-09-16 Thread Gregory Phillips (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135806#comment-14135806
 ] 

Gregory Phillips commented on SPARK-2984:
-

[~aash] --

Thanks for bringing this to my attention. I was beginning to suspect something 
along these lines--my buckets are in US Standard (but my mental model of what 
was going on in S3 was read-after-write consistency) so this incarnation of the 
issue is probably user error on my part.

I actually was able to accomplish what I was trying to do by retrying the 
failed tasks (and they succeeded on subsequent attempts) so this points to 
eventual consistency as the culprit here.

 FileNotFoundException on _temporary directory
 -

 Key: SPARK-2984
 URL: https://issues.apache.org/jira/browse/SPARK-2984
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Ash
Priority: Critical

 We've seen several stacktraces and threads on the user mailing list where 
 people are having issues with a {{FileNotFoundException}} stemming from an 
 HDFS path containing {{_temporary}}.
 I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
 error condition might manifest in this circumstance:
 1) task T starts on a executor E1
 2) it takes a long time, so task T' is started on another executor E2
 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
 destination and deletes the {{_temporary}} directory during cleanup
 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
 those files no longer exist!  exception
 Some samples:
 {noformat}
 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
 140774430 ms.0
 java.io.FileNotFoundException: File 
 hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
  does not exist.
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
 at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
 at 
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
 at 
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
 at 
 org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
 at 
 org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
 at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
 at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
 at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
 at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
 at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
 at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at scala.util.Try$.apply(Try.scala:161)
 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
 at 
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 -- Chen Song at 
 http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html
 {noformat}
 I am running a Spark Streaming job that uses saveAsTextFiles to save results 
 into hdfs files. However, it has an exception after 20 batches
 result-140631234/_temporary/0/task_201407251119__m_03 does not 
 exist.
 {noformat}
 and
 {noformat}
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  No lease on 

[jira] [Comment Edited] (SPARK-2984) FileNotFoundException on _temporary directory

2014-09-15 Thread Gregory Phillips (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134073#comment-14134073
 ] 

Gregory Phillips edited comment on SPARK-2984 at 9/15/14 4:43 PM:
--

I'm running into this as well. But to respond to this theory:

{quote}
I think this may be related to spark.speculation. I think the error condition 
might manifest in this circumstance:
1) task T starts on a executor E1
2) it takes a long time, so task T' is started on another executor E2
3) T finishes in E1 so moves its data from _temporary to the final destination 
and deletes the _temporary directory during cleanup
4) T' finishes in E2 and attempts to move its data from _temporary, but those 
files no longer exist! exception
{quote}

Speculation is not necessary for this to occur. I am consistently running into 
this while testing some code against local without speculation where I am 
trying to download, manipulate and merge 2 sets of data from S3 and serialize 
the resulting RDD using saveAsTextFile back to S3:

{code}
Job aborted due to stage failure: Task 3.0:754 failed 1 times, most recent 
failure: Exception failure in TID 762 on host localhost: 
java.io.FileNotFoundException: 
s3n://bucket/_temporary/_attempt_201409151537__m_000754_762/part-00754.deflate:
 No such file or directory. 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:340)
 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:165)
 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
 
org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
 org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:786)
 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:769)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) 
org.apache.spark.scheduler.Task.run(Task.scala:51) 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
java.lang.Thread.run(Thread.java:744) Driver stacktrace:
{code}

I'm happy to provide more information or help investigate further to figure 
this one out.

Edit: I forgot to mention that the file in question actually did exist on S3 
when I checked after receiving this exception.


was (Author: gphil):
I'm running into this as well. But to respond to this theory:

{quote}
I think this may be related to spark.speculation. I think the error condition 
might manifest in this circumstance:
1) task T starts on a executor E1
2) it takes a long time, so task T' is started on another executor E2
3) T finishes in E1 so moves its data from _temporary to the final destination 
and deletes the _temporary directory during cleanup
4) T' finishes in E2 and attempts to move its data from _temporary, but those 
files no longer exist! exception
{quote}

Speculation is not necessary for this to occur. I am consistently running into 
this while testing some code against local without speculation where I am 
trying to download, manipulate and merge 2 sets of data from S3 and serialize 
the resulting RDD using saveAsTextFile back to S3:

{code}
Job aborted due to stage failure: Task 3.0:754 failed 1 times, most recent 
failure: Exception failure in TID 762 on host localhost: 
java.io.FileNotFoundException: 
s3n://bucket/_temporary/_attempt_201409151537__m_000754_762/part-00754.deflate:
 No such file or directory. 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:340)
 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:165)
 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
 
org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
 org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:786)
 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:769)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) 
org.apache.spark.scheduler.Task.run(Task.scala:51) 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
java.lang.Thread.run(Thread.java:744) Driver stacktrace:
{code}

I'm happy to provide more information or help investigate further to figure 
this one out.

 FileNotFoundException on _temporary directory