I've observed a number of cases where Spark does not clean HDFS side-effects on errors, especially out of memory conditions. Here is an example from the following code snippet executed in spark-shell: import org.apache.spark.sql.hive.HiveContextimport org.apache.spark.sql.SaveModeval ctx = sqlContext.asInstanceOf[HiveContext]import ctx.implicits._ctx. jsonFile("file:///test_data/*/*/*/*.gz"). saveAsTable("test_data", SaveMode.Overwrite) First run: saveAsTable terminates with an out of memory exception. Second run (with more RAM to driver & executor): fails with many variations of java.lang.RuntimeException: hdfs://localhost:54310/user/hive/warehouse/test_data/_temporary/0/_temporary/attempt_201507171538_0008_r_000021_0/part-r-00022.parquet is not a Parquet file (too small) Third run (after hdfs dfs -rm -r hdfs:///user/hive/warehouse/test_data) succeeds. What are the best practices for dealing with these types of cleanup failures? Do they tend to come in known varieties? Thanks, Sim
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cleanup-when-tasks-generate-errors-tp23890.html Sent from the Apache Spark User List mailing list archive at Nabble.com.