MIchael Davies created SPARK-5309: ------------------------------------- Summary: Reduce Binary/String conversion overhead when reading/writing Parquet files Key: SPARK-5309 URL: https://issues.apache.org/jira/browse/SPARK-5309 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: MIchael Davies Priority: Minor
Converting between Parquet Binary and Java Strings can form a significant proportion of query times. For columns which have repeated String values (which is common) the same Binary will be repeatedly being converted. A simple change to cache the last converted String per column was shown to reduce query times by 25% when grouping on a data set of 66M rows on a column with many repeated Strings. A possible optimisation would be to hand responsibility for Binary encoding/decoding over to Parquet so that it could ensure that this was done only once per Binary value. Next step is to look at Parquet code and to discuss with that project More details are available on this discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org