Constantin created SPARK-21288: ---------------------------------- Summary: Several files are missing in the results of the execution of the spark application. Key: SPARK-21288 URL: https://issues.apache.org/jira/browse/SPARK-21288 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.6.0 Environment: cloudera: Cloudera Express 5.10.0 java: HotSpot 1.8.0_77 spark: spark-core_2.10-1.6.0-cdh5.7.0.jar hadoop: 2.6.0-cdh5.7.0 from c00978c67b0d3fe9f3b896b5030741bd40bf541 hdfs: 2.6.0-cdh5.7.0 from rc00978c67b0d3fe9f3b896b5030741bd40bf541a yarn: 2.6.0-cdh5.7.0 from c00978c67b0d3fe9f3b896b5030741bd40bf541a
Reporter: Constantin Spark application save into output folder not all files, for example only files from 'part-r-00101.avro' to 'part-r-00127.avro', but must be from 'part-r-0000.avro' to 'part-r-00127.avro'. It looks like all files was stored into _temporary/... but when time to move results to output folder was come, files has disappeared from _temporary. In execution logs I saw that all task was committed with FileOutputCommitter. There was not tasks preemptions and speculation. Saving to hdfs like this: {code:scala} rdd .map(v => new AvroKey[V](v) -> null) .saveAsNewAPIHadoopFile( directory, classOf[AvroKey[V]], classOf[NullWritable], classOf[AvroKeyOutputFormat[V]], createJob().getConfiguration ) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org