Hi,
Our spark streaming app is configured to pull data from Kafka in 1 hour batch duration which performs aggregation of data by specific keys and store the related RDDs to HDFS in the transform phase. We have tried checkpoint of 7 days on the DStream of Kafka to ensure that the generated stream does not expire/lost. The first hour gets completed, but on the succeeding hours it always fails with exception: Job aborted due to stage failure: Task 39.0:1 failed 64 times, most recent failure: Exception failure in TID 27578 on host X.ec2.internal: java.io.FileNotFoundException: /data/run/spark/work/spark-local-20140911175744-4ddf/0d/shuffle_3_1_311 (No such file or directory) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.<init>(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177) org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161) org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158) scala.collection.Iterator$class.foreach(Iterator.scala:727) Environment: CDH version: 2.3.0-cdh5.1.0 Spark version: 1.0.0-cdh5.1.0 Spark settings: spark.io.compression.codec : org.apache.spark.io.SnappyCompressionCodec spark.serializer : org.apache.spark.serializer.KryoSerializer spark.kryoserializer.buffer.mb : 2 spark.local.dir : /data/run/spark/work/ spark.scheduler.mode : FAIR spark.rdd.compress : false spark.task.maxFailures : 64 spark.shuffle.use.netty : false spark.shuffle.spill : true spark.streaming.checkpoint.dir : hdfs://X.ec2.internal:8020/user/spark/checkpoints/event-storage spark.akka.threads : 4 spark.cores.max : 4 spark.executor.memory : 3g spark.shuffle.consolidateFiles : false spark.streaming.unpersist : true spark.logConf : true spark.shuffle.spill.compress : true Thanks, JL -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-in-1-hour-batch-duration-RDD-files-gets-lost-tp14027.html Sent from the Apache Spark User List mailing list archive at Nabble.com.