GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/14388
[SPARK-16362][SQL] Support ArrayType and StructType in vectorized Parquet reader ## What changes were proposed in this pull request? Vectorization parquet reader now doesn't support complex types such as ArrayType, MapType and StructType. We should support it to extend the coverage of performance improvement introduced by vectorization parquet reader. This patch is to add ArrayType and StructType first. ### Main changes * Obtain repetition and definition level information for Parquet schema In order to support complex types in vectorized Parquet reader, we need to use repetition and definition level information for Parquet schema which are used to encoded the structure of complex types. This PR introduces a class to capture these encoding: `RepetitionDefinitionInfo`. This PR also introduces few classes to capture Parquet schema structure: `ParquetField`, `ParquetStruct`, `ParquetArray` and `ParquetMap`. A new method `getParquetStruct` is added to `ParquetSchemaConverter` which is used to create a `ParquetStruct` object which captures the structure and metadata. The `ParquetStruct` has the same schema structure as the required schema used to guide Parquet reading. It is used to provide the corresponding repetition and definition levels for the fields in the required schema. * Attach `VectorizedColumnReader` to `ColumnVector` Because in flat schema each `ColumnVector` is actually a data column, previously the relation between `VectorizedColumnReader` and `ColumnVector` is one-by-one. Now only the `ColumnVector` representing a data column will have corresponding `VectorizedColumnReader`. Then when it is time to read batch, the `ColumnVector` with complex type will delegate to its child `ColumnVector`. * Implement constructing complex records in `VectorizedColumnReader` The `readBatch` in `VectorizedColumnReader` is the main method to read data into `ColumnVector`. Previously its behavior is simply to load required number of data according to the data type of the column vector. Now after the data is loaded into the column, we need to construct complex records in its parent column that could be an ArrayType, MapType or StructType. The way to restore the data as complex types is encoding in repetition and definition levels in Parquet. The new method `constructComplexRecords` in `VectorizedColumnReader` implements the logic to restore the complex data. Basically, what `constructComplexRecords` does is to count the continuous values and add array into the parent column if the repetition level value indicates a new record happens. Besides, `constructComplexRecords` also needs to consider the null values. Null values could mean a null record at root level, an empty array or struct. This method considers different cases and sets it correctly. ### Benchmark val N = 10000 withParquetTable((0 until N).map { i => ((i to i + 1000).toList, (i to i + 100).map(_.toString).toList, (i to i + 1000).map(_.toDouble / 2).toList, ((0 to 10).map(_.toString).toList, (0 to 10).map(_.toString).toList)) }, "t") { val benchmark = new Benchmark("Vectorization Parquet for nested types", N) benchmark.addCase("Vectorization Parquet reader", 10) { iter => sql("SELECT _1[10], _2[20], _3[30], _4._1[5], _4._2[5] FROM t").collect() } benchmark.run() } Disabled vectorization: Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz Vectorization Parquet for nested types: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Vectorization Parquet reader 1706 / 2207 0.0 170580.8 1.0X Enabled vectorization: Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz Vectorization Parquet for nested types: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Vectorization Parquet reader 789 / 972 0.0 78919.4 1.0X ## How was this patch tested? Jenkins tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 vectorized-parquet-complex-type Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14388.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14388 ---- commit 8cfeb7e74843d8674c5354a67a7fc4f9d45100dd Author: Liang-Chi Hsieh <sim...@tw.ibm.com> Date: 2016-07-27T09:32:18Z Add ArrayType, StructType support to vectorized Parquet reader. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org