[jira] [Updated] (SPARK-9345) Failure to cleanup on exceptions causes persistent I/O problems later on

Simeon Simeonov (JIRA) Sat, 25 Jul 2015 09:22:43 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-9345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Simeon Simeonov updated SPARK-9345:
-----------------------------------
    Description: 
When using spark-shell in local mode, I've observed the following behavior on a 
number of nodes:

# Some operation generates an exception related to Spark SQL processing via 
{{HiveContext}}.
# From that point on, nothing could be written to Hive with {{saveAsTable}}.
# Another identically-configured version of Spark on the same machine may not 
exhibit the problem initially but, with enough exceptions, it will start 
exhibiting the problem also.
# A new identically-configured installation of the same version on the same 
machine would exhibit the problem.

The error is always related to inability to create a temporary folder on HDFS:

{code}
15/07/25 16:03:35 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.io.IOException: Mkdirs failed to create 
file:/user/hive/warehouse/test/_temporary/0/_temporary/attempt_201507251603_0001_m_000001_0
 (exists=false, cwd=file:/home/ubuntu)
        at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
        at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
        at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:154)
        at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279)
        at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
        at 
org.apache.spark.sql.parquet.ParquetOutputWriter.<init>(newParquet.scala:83)
        at 
org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:229)
        at 
org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:470)
        at 
org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:360)
        at 
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:172)
        at 
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
        at 
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
        at org.apache.spark.scheduler.Task.run(Task.scala:70)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
        ...
{code}

The behavior does not seem related to HDFS as it persists even if the HDFS 
volume is reformatted. 

The behavior is difficult to reproduce reliably but consistently observable 
with a good volume of Spark SQL experimentation. 
The likelihood of this happening goes up substantially if some Spark SQL 
operation runs out of memory, which suggests
that the problem is related to cleanup.

In this gist ([https://gist.github.com/ssimeonov/72a64947bc33628d2d11]) you can 
see how on the same machine, identically configured 1.3.1 and 1.4.1 versions 
sharing the same HDFS system and Hive metastore, behave differently. 1.3.1 can 
write to Hive. 1.4.1 cannot. The behavior started happening on 1.4.1 after an 
out of memory exception on a large job. 


  was:
When using spark-shell in local mode, I've observed the following behavior on a 
number of nodes:

# Some operation generates an exception related to Spark SQL processing via 
`HiveContext`.
# From that point on, nothing could be written to Hive with `saveAsTable`.
# Another identically-configured version of Spark on the same machine may not 
exhibit the problem initially but, with enough exceptions, it will start 
exhibiting the problem also.
# A new identically-configured installation of the same version on the same 
machine would exhibit the problem.

The error is always related to inability to create a temporary folder on HDFS:

{code}
15/07/25 16:03:35 ERROR InsertIntoHadoopFsRelation: Aborting task.
java.io.IOException: Mkdirs failed to create 
file:/user/hive/warehouse/test/_temporary/0/_temporary/attempt_201507251603_0001_m_000001_0
 (exists=false, cwd=file:/home/ubuntu)
        at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
        at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
        at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:154)
        at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279)
        at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
        at 
org.apache.spark.sql.parquet.ParquetOutputWriter.<init>(newParquet.scala:83)
        at 
org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:229)
        at 
org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:470)
        at 
org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:360)
        at 
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:172)
        at 
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
        at 
org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
        at org.apache.spark.scheduler.Task.run(Task.scala:70)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
        ...
{code}

The behavior does not seem related to HDFS as it persists even if the HDFS 
volume is reformatted. 

The behavior is difficult to reproduce reliably but consistently observable 
with a good volume of Spark SQL experimentation. 
The likelihood of this happening goes up substantially if some Spark SQL 
operation runs out of memory, which suggests
that the problem is related to cleanup.

In this gist ([https://gist.github.com/ssimeonov/72a64947bc33628d2d11]) you can 
see how on the same machine, identically configured 1.3.1 and 1.4.1 versions 
sharing the same HDFS system and Hive metastore, behave differently. 1.3.1 can 
write to Hive. 1.4.1 cannot. The behavior started happening on 1.4.1 after an 
out of memory error. 


> Failure to cleanup on exceptions causes persistent I/O problems later on
> ------------------------------------------------------------------------
>
>                 Key: SPARK-9345
>                 URL: https://issues.apache.org/jira/browse/SPARK-9345
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.3.1, 1.4.0, 1.4.1
>         Environment: Ubuntu on AWS
>            Reporter: Simeon Simeonov
>
> When using spark-shell in local mode, I've observed the following behavior on 
> a number of nodes:
> # Some operation generates an exception related to Spark SQL processing via 
> {{HiveContext}}.
> # From that point on, nothing could be written to Hive with {{saveAsTable}}.
> # Another identically-configured version of Spark on the same machine may not 
> exhibit the problem initially but, with enough exceptions, it will start 
> exhibiting the problem also.
> # A new identically-configured installation of the same version on the same 
> machine would exhibit the problem.
> The error is always related to inability to create a temporary folder on HDFS:
> {code}
> 15/07/25 16:03:35 ERROR InsertIntoHadoopFsRelation: Aborting task.
> java.io.IOException: Mkdirs failed to create 
> file:/user/hive/warehouse/test/_temporary/0/_temporary/attempt_201507251603_0001_m_000001_0
>  (exists=false, cwd=file:/home/ubuntu)
>       at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
>       at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
>       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
>       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
>       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
>       at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:154)
>       at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279)
>       at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
>       at 
> org.apache.spark.sql.parquet.ParquetOutputWriter.<init>(newParquet.scala:83)
>       at 
> org.apache.spark.sql.parquet.ParquetRelation2$$anon$4.newInstance(newParquet.scala:229)
>       at 
> org.apache.spark.sql.sources.DefaultWriterContainer.initWriters(commands.scala:470)
>       at 
> org.apache.spark.sql.sources.BaseWriterContainer.executorSideSetup(commands.scala:360)
>       at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:172)
>       at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
>       at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:160)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>       at org.apache.spark.scheduler.Task.run(Task.scala:70)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
>         ...
> {code}
> The behavior does not seem related to HDFS as it persists even if the HDFS 
> volume is reformatted. 
> The behavior is difficult to reproduce reliably but consistently observable 
> with a good volume of Spark SQL experimentation. 
> The likelihood of this happening goes up substantially if some Spark SQL 
> operation runs out of memory, which suggests
> that the problem is related to cleanup.
> In this gist ([https://gist.github.com/ssimeonov/72a64947bc33628d2d11]) you 
> can see how on the same machine, identically configured 1.3.1 and 1.4.1 
> versions sharing the same HDFS system and Hive metastore, behave differently. 
> 1.3.1 can write to Hive. 1.4.1 cannot. The behavior started happening on 
> 1.4.1 after an out of memory exception on a large job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9345) Failure to cleanup on exceptions causes persistent I/O problems later on

Reply via email to