[jira] [Comment Edited] (SPARK-19809) NullPointerException on empty ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027987#comment-16027987 ] Hyukjin Kwon edited comment on SPARK-19809 at 5/29/17 3:21 AM: --- Yea, I agree that it should be dependent on the format specification/implementation, whether it is malformed or not. I think Parquet itself treats 0 bytes files as malformed file because it should read footer but it throws an exception up to my knowledge. The former case looks filtering out the whole partitions in {{FileSourceScanExec}}. Parquet requires to read the footers and it throws an exception, for example, I manually updated the code path to not skip the partitions so that the parquet reader is actually being called as below: {code} java.lang.RuntimeException: file:/.../tmp.abc is not a Parquet file (too small) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:466) at org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:568) at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:492) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:166) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:147) {code} If we don't specify the schema, it also throws an exception as below: {code} spark.read.parquet(".../tmp.abc").show() {code} {code} java.io.IOException: Could not read footer for file: FileStatus{path=file:/.../tmp.abc; isDirectory=false; length=0; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:498) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:485) at scala.collection.parallel.AugmentedIterableIterator$class.flatmap2combiner(RemainsIterator.scala:132) at scala.collection.parallel.immutable.ParVector$ParVectorIterator.flatmap2combiner(ParVector.scala:62) at scala.collection.parallel.ParIterableLike$FlatMap.leaf(ParIterableLike.scala:1072) {code} Assuming it is treated as a malformed file (per the ORC JIRA you pointed out above) for the current status, it looks a malformed file and it sounds we should be able to skip this in client side whether it should be dealt with {{spark.sql.files.ignoreCorruptFiles}} or not. For example, I found a related JIRA - https://issues.apache.org/jira/browse/AVRO-1530 and https://issues.apache.org/jira/browse/HIVE-11977. _If I read this correctly_, Avro looks decided not to change the behaviour but Hive deals with it. Only for this issue, I also agree that this could be a subset of the issues you pointed out. was (Author: hyukjin.kwon): Yea, I agree that it should be dependent on the format specification/implementation, whether it is malformed or not. I think Parquet itself treats 0 bytes files as malformed file because it should read footer but it throws an exception up to my knowledge. The former case looks filtering out the whole partitions in {{DataSourceScanExec}}. Parquet requires to read the footers and it throws an exception, for example, I manually updated the code path to not skip the partitions so that the parquet reader is actually being called as below: {code} java.lang.RuntimeException: file:/.../tmp.abc is not a Parquet file (too small) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:466) at org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:568) at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:492) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:166) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:147) {code} If we don't specify the schema, it also throws an exception as below: {code} spark.read.parquet(".../tmp.abc").show() {code} {code} java.io.IOException: Could not read footer for file: FileStatus{path=file:/.../tmp.abc; isDirectory=false; length=0; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:498) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:485) at scala.collection.parallel.AugmentedIterableIterator$class.flatmap2combiner(RemainsIterato
[jira] [Comment Edited] (SPARK-19809) NullPointerException on empty ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027476#comment-16027476 ] Dongjoon Hyun edited comment on SPARK-19809 at 5/27/17 4:18 PM: [~hyukjin.kwon]. I don't think so. Parquet file does not need `spark.sql.files.ignoreCorruptFiles` option. {code} scala> sql("create table empty_parquet(a int) stored as parquet location '/tmp/empty_parquet'").show ++ || ++ ++ $ touch /tmp/empty_parquet/zero.parquet scala> sql("select * from empty_parquet").show +---+ | a| +---+ +---+ {code} You can test this in Spark with SPARK-20728. {code} scala> sql("create table empty_orc2(a int) using orc location '/tmp/empty_orc'").show ++ || ++ ++ scala> sql("select * from empty_orc2").show +---+ | a| +---+ +---+ {code} I think this is a part of SPARK-20901. And ORC community will handle this. What we need is just to use latest ORC. One thing I'm wondering is this is tracked in https://issues.apache.org/jira/browse/ORC-162 (Open). was (Author: dongjoon): [~hyukjin.kwon]. I don't think so. Parquet file does not need `spark.sql.files.ignoreCorruptFiles` option. {code} scala> sql("create table empty_parquet(a int) stored as parquet location '/tmp/empty_parquet'").show ++ || ++ ++ $ touch /tmp/empty_parquet/zero.parquet scala> sql("select * from empty_parquet").show +---+ | a| +---+ +---+ {code} Also latest ORC file does not, too. It's fixed in https://issues.apache.org/jira/browse/ORC-162 . You can test this in Spark with SPARK-20728. {code} scala> sql("create table empty_orc2(a int) using orc location '/tmp/empty_orc'").show ++ || ++ ++ scala> sql("select * from empty_orc2").show +---+ | a| +---+ +---+ {code} I think this is a part of SPARK-20901. And ORC community already resolved this. What we need is just to use latest ORC. > NullPointerException on empty ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1 >Reporter: Michał Dawid > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) >
[jira] [Comment Edited] (SPARK-19809) NullPointerException on empty ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026515#comment-16026515 ] Dongjoon Hyun edited comment on SPARK-19809 at 5/26/17 5:09 PM: IMO, we had better be more robust on this. The 3rd party tools (reported pig or sqoop) sometimes introduce this issues. {code} scala> sql("create table empty_orc(a int) stored as orc location '/tmp/empty_orc'").show ++ || ++ ++ $ touch /tmp/empty_orc/zero.orc scala> sql("select * from empty_orc").show java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) {code} was (Author: dongjoon): IMO, we had better be more robust on this. The 3rd party tools (reported pig or sqoop) sometimes introduce this issues. {code} scala> sql("create table empty_orc(a int) stored as orc location '/tmp/empty_orc'").show ++ || ++ ++ $ touch /tmp/empty_orc/zero.orc scala> sql("select * from empty_orc").show {code} > NullPointerException on empty ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1 >Reporter: Michał Dawid > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.s