Hi,

My team is using Spark 1.0.1 and the project we're working on needs to
compute exact numbers, which are then saved to S3, to be reused later in
other Spark jobs to compute other numbers. The problem we noticed yesterday:
one of the output partition files in S3 was missing :/ (some part-00218)...
The problem only occurred once, and cannot be reproed. However because of
this incident, our numbers may not be reliable.

>From the Spark logs (from the cluster which generated the files with the
missing partition), we noticed some errors appearing multiple times:
- Loss was due to java.io.FileNotFoundException
java.io.FileNotFoundException:
s3://xxxxxxx/_temporary/_attempt_201501142002_0000_m_000368_12139/part-00368:
No such file or directory.
        at
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:340)
        at
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:165)
        at
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
        at
org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
        at 
org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)
        at
org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeToFile$1(PairRDDFunctions.scala:785)
        at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:788)
        at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:788)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
        at org.apache.spark.scheduler.Task.run(Task.scala:51)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)

And:
- WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(3,
ip-10-152-30-234.ec2.internal, 48973, 0) with no recent heart beats: 72614ms
exceeds 45000ms

Questions:
- Do those errors explain why the output partition file was missing?
(knowing that we still get those errors in our logs).
- Is there a way to detect data loss during runtime, and then stop our Spark
job completely ASAP if it happens?

Thanks,
Nicolas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Missing-output-partition-file-in-S3-tp21326.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to