Unexpected performance issues with Spark SQL using Parquet

2015-07-27 Thread Jerry Lam
Hi spark users and developers,

I have been trying to understand how Spark SQL works with Parquet for the
couple of days. There is a performance problem that is unexpected using the
column pruning. Here is a dummy example:

The parquet file has the 3 fields:

 |-- customer_id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- mapping: map (nullable = true)
 ||-- key: string
 ||-- value: string (nullable = true)

Note that mapping is just a field with a lot of key value pairs.
I just created a parquet files with 1 billion entries with each entry
having 10 key-value pairs in the mapping.

After I generate this parquet file, I generate another parquet without the
mapping field that is:
 |-- customer_id: string (nullable = true)
 |-- type: string (nullable = true)

Let call the first parquet file data-with-mapping and the second parquet
file data-without-mapping.

Then I ran a very simple query over two parquet files:
val df = sqlContext.read.parquet(path)
df.select(df(type)).count

The run on the data-with-mapping takes 34 seconds with the input size
of 11.7 MB.
The run on the data-without-mapping takes 8 seconds with the input size of
7.6 MB.

They all ran on the same cluster with spark 1.4.1.
What bothers me the most is the input size because I supposed column
pruning will only deserialize columns that are relevant to the query (in
this case the field type) but for sure, it reads more data on the
data-with-mapping than the data-without-mapping. The speed is 4x faster in
the data-without-mapping that means that the more columns a parquet file
has the slower it is even only a specific column is needed.

Anyone has an explanation on this? I was expecting both of them will finish
approximate the same time.

Best Regards,

Jerry


Re: Unexpected performance issues with Spark SQL using Parquet

2015-07-27 Thread Cheng Lian

Hi Jerry,

Thanks for the detailed report! I haven't investigate this issue in 
detail. But for the input size issue, I believe this is due to a 
limitation of HDFS API. It seems that Hadoop FileSystem adds the size of 
a whole block to the metrics even if you only touch a fraction of that 
block. In Parquet, all columns within a single row group are stored in a 
single HDFS block. This is probably the reason why you observed weird 
task input size. You may find more information in one of my earlier 
posts 
http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3c54c9899e.2030...@gmail.com%3E


For the performance issue, I don't have a proper explanation yet. Need 
further investigation.


Cheng

On 7/28/15 2:37 AM, Jerry Lam wrote:

Hi spark users and developers,

I have been trying to understand how Spark SQL works with Parquet for 
the couple of days. There is a performance problem that is unexpected 
using the column pruning. Here is a dummy example:


The parquet file has the 3 fields:

 |-- customer_id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- mapping: map (nullable = true)
 ||-- key: string
 ||-- value: string (nullable = true)

Note that mapping is just a field with a lot of key value pairs.
I just created a parquet files with 1 billion entries with each entry 
having 10 key-value pairs in the mapping.


After I generate this parquet file, I generate another parquet without 
the mapping field that is:

 |-- customer_id: string (nullable = true)
 |-- type: string (nullable = true)

Let call the first parquet file data-with-mapping and the second 
parquet file data-without-mapping.


Then I ran a very simple query over two parquet files:
val df = sqlContext.read.parquet(path)
df.select(df(type)).count

The run on the data-with-mapping takes 34 seconds with the input size 
of 11.7 MB.
The run on the data-without-mapping takes 8 seconds with the input 
size of 7.6 MB.


They all ran on the same cluster with spark 1.4.1.
What bothers me the most is the input size because I supposed column 
pruning will only deserialize columns that are relevant to the query 
(in this case the field type) but for sure, it reads more data on the 
data-with-mapping than the data-without-mapping. The speed is 4x 
faster in the data-without-mapping that means that the more columns a 
parquet file has the slower it is even only a specific column is needed.


Anyone has an explanation on this? I was expecting both of them will 
finish approximate the same time.


Best Regards,

Jerry





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org