[ https://issues.apache.org/jira/browse/SPARK-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-8014: ----------------------------------- Assignee: Cheng Lian (was: Apache Spark) > DataFrame.write.mode("error").save(...) should not scan the output folder > ------------------------------------------------------------------------- > > Key: SPARK-8014 > URL: https://issues.apache.org/jira/browse/SPARK-8014 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.4.0 > Reporter: Jianshi Huang > Assignee: Cheng Lian > > When saving a DataFrame with {{ErrorIfExists}} as save mode, we shouldn't do > metadata discovery if the destination folder exists. This also applies to > {{SaveMode.Overwrite}} and {{SaveMode.Ignore}}. > To reproduce this issue, we may make an empty directory {{/tmp/foo}} and > leave an empty file {{bar}} there, then execute the following code in Spark > shell: > {code} > import sqlContext._ > import sqlContext.implicits._ > Seq(1 -> "a").toDF("i", > "s").write.format("parquet").mode("error").save("file:///tmp/foo") > {code} > From the exception stack trace we can see that metadata discovery code path > is executed: > {noformat} > java.io.IOException: Could not read footer: java.lang.RuntimeException: > file:/tmp/foo/bar is not a Parquet file (too small) > at > parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369) > at > org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154) > at > org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152) > at > org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:193) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:502) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:501) > at > org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:331) > at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144) > at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135) > ... > Caused by: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet > file (too small) > at > parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:408) > at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:228) > at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:224) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org