I have seen similar results. I have looked at the code and I think there are a couple of contributors:
Encoding/decoding java Strings to UTF8 bytes is quite expensive. I'm not sure what you can do about that. But there are options for optimization due to the repeated decoding of the same String values. As Spark queries process each row from Parquet it makes a call to convert the Binary representation for each String column into a Java String. However in many (probably most) circumstances the underlying Binary classes from Parquet will have come from a Dictionary, for example when column cardinality is low. Therefore Spark is converting the same byte array to a copy of the same Java String over and over again. This is bad due to extra cpu, extra memory used for these strings, and probably results in more expensive grouping comparisons. I tested a simple hack to cache the last Binary->String conversion per column and this led to a 25% performance improvement for the queries I used. Admittedly this was over a data set with lots or runs of the same Strings in the queried columns. I haven't looked at the code to write Parquet files in Spark but I imagine similar duplicate String->Binary conversions could be happening. These costs are quite important for the type of data that I expect will be stored in Parquet which will often have denormalized tables and probably lots of fairly low cardinality string columns Its possible that changes could be made to Parquet to so the encoding/decoding of Objects to Binary is handled on Parquet side of fence. Parquet could deal with Objects (Strings) as the client understands them and only use encoding/decoding to store/read from underlying storage medium. Doing this I think Parquet could ensure that the encoding/decoding of each Object occurs only once. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Parquet-data-are-reading-very-very-slow-tp21061p21187.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org