Hi spark users and developers, I have been trying to understand how Spark SQL works with Parquet for the couple of days. There is a performance problem that is unexpected using the column pruning. Here is a dummy example:
The parquet file has the 3 fields: |-- customer_id: string (nullable = true) |-- type: string (nullable = true) |-- mapping: map (nullable = true) | |-- key: string | |-- value: string (nullable = true) Note that mapping is just a field with a lot of key value pairs. I just created a parquet files with 1 billion entries with each entry having 10 key-value pairs in the mapping. After I generate this parquet file, I generate another parquet without the mapping field that is: |-- customer_id: string (nullable = true) |-- type: string (nullable = true) Let call the first parquet file data-with-mapping and the second parquet file data-without-mapping. Then I ran a very simple query over two parquet files: val df = sqlContext.read.parquet(path) df.select(df("type")).count The run on the data-with-mapping takes 34 seconds with the input size of 11.7 MB. The run on the data-without-mapping takes 8 seconds with the input size of 7.6 MB. They all ran on the same cluster with spark 1.4.1. What bothers me the most is the input size because I supposed column pruning will only deserialize columns that are relevant to the query (in this case the field type) but for sure, it reads more data on the data-with-mapping than the data-without-mapping. The speed is 4x faster in the data-without-mapping that means that the more columns a parquet file has the slower it is even only a specific column is needed. Anyone has an explanation on this? I was expecting both of them will finish approximate the same time. Best Regards, Jerry