[ https://issues.apache.org/jira/browse/SPARK-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175280#comment-15175280 ]
Constantijn commented on SPARK-8413: ------------------------------------ I'm having the same issue when writing data to gs:// (google cloud's S3 equivalent) from spark using the DirectParquetOutputCommitter. https://groups.google.com/forum/#!topic/cloud-dataproc-discuss/jNP7fkJdD5A If there's any extra data I can provide to help solve this issue let me know. > DirectParquetOutputCommitter doesn't clean up the file on task failure > ---------------------------------------------------------------------- > > Key: SPARK-8413 > URL: https://issues.apache.org/jira/browse/SPARK-8413 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.1 > Reporter: Mingyu Kim > Priority: Critical > > Here are the steps that lead to the failure. > 1. Write a DataFrame using DirectParquetOutputCommitter > 2. 1st attempt fails during the writes. e.g. > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L355 > 3. There is no clean-up logic on task failure, so the parquet part file > written by the failed task is left half-written. > 4. 2nd attempt fails with the following exception because the target file > already exists. > {noformat} > 2015-06-15T15:37:32.703 WARN [task-result-getter-2] > org.apache.spark.scheduler.TaskSetManager - Lost task 56.1 in stage 7.0 (TID > 73125, <REDACTED>): java.io.IOException: File already exists:s3://<REDACTED> > at <REDACTED> > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:851) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:832) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:731) > at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:154) > at > parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279) > at > parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:350) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:371) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:371) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org