Hi, My team is using Spark 1.0.1 and the project we're working on needs to compute exact numbers, which are then saved to S3, to be reused later in other Spark jobs to compute other numbers. The problem we noticed yesterday: one of the output partition files in S3 was missing :/ (some part-00218)... The problem only occurred once, and cannot be reproed. However because of this incident, our numbers may not be reliable.
>From the Spark logs (from the cluster which generated the files with the missing partition), we noticed some errors appearing multiple times: - Loss was due to java.io.FileNotFoundException java.io.FileNotFoundException: s3://xxxxxxx/_temporary/_attempt_201501142002_0000_m_000368_12139/part-00368: No such file or directory. at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:340) at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:165) at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172) at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132) at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) at org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeToFile$1(PairRDDFunctions.scala:785) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:788) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:788) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) And: - WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(3, ip-10-152-30-234.ec2.internal, 48973, 0) with no recent heart beats: 72614ms exceeds 45000ms Questions: - Do those errors explain why the output partition file was missing? (knowing that we still get those errors in our logs). - Is there a way to detect data loss during runtime, and then stop our Spark job completely ASAP if it happens? Thanks, Nicolas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Missing-output-partition-file-in-S3-tp21326.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org