Re: Spark SQL & Parquet - data are reading very very slow

Mick Davies Fri, 16 Jan 2015 03:19:18 -0800

I have seen similar results.

I have looked at the code and I think there are a couple of contributors:

Encoding/decoding java Strings to UTF8 bytes is quite expensive. I'm not
sure what you can do about that.

But there are options for optimization due to the repeated decoding of the
same String values.

As Spark queries process each row from Parquet it makes a call to convert
the Binary representation for each String column into a Java String. However
in many (probably most) circumstances the underlying Binary classes from
Parquet will have come from a Dictionary, for example when column
cardinality is low. Therefore Spark is converting the same byte array to a
copy of the same Java String over and over again. This is bad due to extra
cpu, extra memory used for these strings, and probably results in more
expensive grouping comparisons.

I tested a simple hack to cache the last Binary->String conversion per
column and this led to a 25% performance improvement for the queries I used.
Admittedly this was over a data set with lots or runs of the same Strings in
the queried columns.

I haven't looked at the code to write Parquet files in Spark but I imagine
similar duplicate String->Binary conversions could be happening.

These costs are quite important for the type of data that I expect will be
stored in Parquet which will often have denormalized tables and probably
lots of fairly low cardinality string columns

Its possible that changes could be made to Parquet to so the
encoding/decoding of Objects to Binary is handled on Parquet side of fence.
Parquet could deal with Objects (Strings) as the client understands them and
only use encoding/decoding to store/read from underlying storage medium.
Doing this I think Parquet could ensure that the encoding/decoding of each
Object occurs only once.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Parquet-data-are-reading-very-very-slow-tp21061p21187.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark SQL & Parquet - data are reading very very slow

Reply via email to