Cheng Lian created PARQUET-893:
----------------------------------

             Summary: GroupColumnIO.getFirst() doesn't check for empty groups
                 Key: PARQUET-893
                 URL: https://issues.apache.org/jira/browse/PARQUET-893
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
    Affects Versions: 1.8.1
            Reporter: Cheng Lian


The following Spark 2.1 snippet reproduces this issue:

{code}
import org.apache.spark.sql.types._

val path = "/tmp/parquet-test"

case class Inner(f00: Int)
case class Outer(f0: Inner, f1: Int)

val df = Seq(Outer(Inner(1), 1)).toDF()

df.printSchema()
// root
//  |-- f0: struct (nullable = true)
//  |    |-- f00: integer (nullable = false)
//  |-- f1: integer (nullable = false)

df.write.mode("overwrite").parquet(path)

val requestedSchema =
  new StructType().
    add("f0", new StructType().
      // This nested field name differs from the original one
      add("f01", IntegerType)).
    add("f1", IntegerType)

println(requestedSchema.treeString)
// root
//  |-- f0: struct (nullable = true)
//  |    |-- f01: integer (nullable = true)
//  |-- f1: integer (nullable = true)

spark.read.schema(requestedSchema).parquet(path).show()
{code}

In the above snippet, {{requestedSchema}} is compatible with the schema of the 
written Parquet file, but the following exception is thrown:

{noformat}
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
block -1 in file 
file:/tmp/parquet-test/part-00007-d2b0bec1-7be5-4b51-8d53-3642680bc9c2.snappy.parquet
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
        at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
        at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:184)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        at java.util.ArrayList.rangeCheck(ArrayList.java:653)
        at java.util.ArrayList.get(ArrayList.java:429)
        at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
        at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
        at 
org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
        at 
org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
        at 
org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:277)
        at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
        at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
        at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
        at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
        ... 21 more
{noformat}

According to this stack trace, it seems that {{GroupColumnIO.getFirst()}} 
[doesn't check for empty 
groups|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/io/GroupColumnIO.java#L103]
 properly.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to