[ https://issues.apache.org/jira/browse/SPARK-9345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Simeon Simeonov updated SPARK-9345: ----------------------------------- Description: When using spark-shell in local mode, I've observed the following behavior on a number of nodes: # Some operation generates an exception related to Spark SQL processing via {{HiveContext}}. # From that point on, nothing could be written to Hive with {{saveAsTable}}. # Another identically-configured version of Spark on the same machine may not exhibit the problem initially but, with enough exceptions, it will start exhibiting the problem also. # A new identically-configured installation of the same version on the same machine would exhibit the problem. The error is always related to inability to create a temporary folder on HDFS: {code} 15/07/25 16:03:35 ERROR InsertIntoHadoopFsRelation: Aborting task. java.io.IOException: Mkdirs failed to create file:/user/hive/warehouse/test/_temporary/0/_temporary/attempt_201507251603_0001_m_000001_0 (exists=false, cwd=file:/home/ubuntu) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786) at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:154) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) at org.apache.spark.sql.parquet.ParquetOutputWriter.<init>(newParquet.scala:83) at org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:229) at org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:470) at org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:360) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:172) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ... {code} The behavior does not seem related to HDFS as it persists even if the HDFS volume is reformatted. The behavior is difficult to reproduce reliably but consistently observable with sufficient Spark SQL experimentation (dozens of errors exceptions arising from Spark SQL processing). The likelihood of this happening goes up substantially if some Spark SQL operation runs out of memory, which suggests that the problem is related to cleanup. In this gist ([https://gist.github.com/ssimeonov/72a64947bc33628d2d11]) you can see how on the same machine, identically configured 1.3.1 and 1.4.1 versions sharing the same HDFS system and Hive metastore, behave differently. 1.3.1 can write to Hive. 1.4.1 cannot. The behavior started happening on 1.4.1 after an out of memory exception on a large job. was: When using spark-shell in local mode, I've observed the following behavior on a number of nodes: # Some operation generates an exception related to Spark SQL processing via {{HiveContext}}. # From that point on, nothing could be written to Hive with {{saveAsTable}}. # Another identically-configured version of Spark on the same machine may not exhibit the problem initially but, with enough exceptions, it will start exhibiting the problem also. # A new identically-configured installation of the same version on the same machine would exhibit the problem. The error is always related to inability to create a temporary folder on HDFS: {code} 15/07/25 16:03:35 ERROR InsertIntoHadoopFsRelation: Aborting task. java.io.IOException: Mkdirs failed to create file:/user/hive/warehouse/test/_temporary/0/_temporary/attempt_201507251603_0001_m_000001_0 (exists=false, cwd=file:/home/ubuntu) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786) at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:154) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279) at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) at org.apache.spark.sql.parquet.ParquetOutputWriter.<init>(newParquet.scala:83) at org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:229) at org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:470) at org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:360) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:172) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) ... {code} The behavior does not seem related to HDFS as it persists even if the HDFS volume is reformatted. The behavior is difficult to reproduce reliably but consistently observable with a good volume of Spark SQL experimentation. The likelihood of this happening goes up substantially if some Spark SQL operation runs out of memory, which suggests that the problem is related to cleanup. In this gist ([https://gist.github.com/ssimeonov/72a64947bc33628d2d11]) you can see how on the same machine, identically configured 1.3.1 and 1.4.1 versions sharing the same HDFS system and Hive metastore, behave differently. 1.3.1 can write to Hive. 1.4.1 cannot. The behavior started happening on 1.4.1 after an out of memory exception on a large job. > Failure to cleanup on exceptions causes persistent I/O problems later on > ------------------------------------------------------------------------ > > Key: SPARK-9345 > URL: https://issues.apache.org/jira/browse/SPARK-9345 > Project: Spark > Issue Type: Bug > Affects Versions: 1.3.1, 1.4.0, 1.4.1 > Environment: Ubuntu on AWS > Reporter: Simeon Simeonov > > When using spark-shell in local mode, I've observed the following behavior on > a number of nodes: > # Some operation generates an exception related to Spark SQL processing via > {{HiveContext}}. > # From that point on, nothing could be written to Hive with {{saveAsTable}}. > # Another identically-configured version of Spark on the same machine may not > exhibit the problem initially but, with enough exceptions, it will start > exhibiting the problem also. > # A new identically-configured installation of the same version on the same > machine would exhibit the problem. > The error is always related to inability to create a temporary folder on HDFS: > {code} > 15/07/25 16:03:35 ERROR InsertIntoHadoopFsRelation: Aborting task. > java.io.IOException: Mkdirs failed to create > file:/user/hive/warehouse/test/_temporary/0/_temporary/attempt_201507251603_0001_m_000001_0 > (exists=false, cwd=file:/home/ubuntu) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786) > at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:154) > at > parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279) > at > parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) > at > org.apache.spark.sql.parquet.ParquetOutputWriter.<init>(newParquet.scala:83) > at > org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:229) > at > org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:470) > at > org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:360) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:172) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > ... > {code} > The behavior does not seem related to HDFS as it persists even if the HDFS > volume is reformatted. > The behavior is difficult to reproduce reliably but consistently observable > with sufficient Spark SQL experimentation (dozens of errors exceptions > arising from Spark SQL processing). > The likelihood of this happening goes up substantially if some Spark SQL > operation runs out of memory, which suggests > that the problem is related to cleanup. > In this gist ([https://gist.github.com/ssimeonov/72a64947bc33628d2d11]) you > can see how on the same machine, identically configured 1.3.1 and 1.4.1 > versions sharing the same HDFS system and Hive metastore, behave differently. > 1.3.1 can write to Hive. 1.4.1 cannot. The behavior started happening on > 1.4.1 after an out of memory exception on a large job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org