[ https://issues.apache.org/jira/browse/SPARK-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
MIchael Davies updated SPARK-5309: ---------------------------------- Description: Converting between Parquet Binary and Java Strings can form a significant proportion of query times. For columns which have repeated String values (which is common) the same Binary will be repeatedly being converted. A simple change to cache the last converted String per column was shown to reduce query times by 25% when grouping on a data set of 66M rows on a column with many repeated Strings. A possible optimisation would be to hand responsibility for Binary encoding/decoding over to Parquet so that it could ensure that this was done only once per Binary value. Next step is to look at Parquet code and to discuss with that project, which I will do. More details are available on this discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html was: Converting between Parquet Binary and Java Strings can form a significant proportion of query times. For columns which have repeated String values (which is common) the same Binary will be repeatedly being converted. A simple change to cache the last converted String per column was shown to reduce query times by 25% when grouping on a data set of 66M rows on a column with many repeated Strings. A possible optimisation would be to hand responsibility for Binary encoding/decoding over to Parquet so that it could ensure that this was done only once per Binary value. Next step is to look at Parquet code and to discuss with that project More details are available on this discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html > Reduce Binary/String conversion overhead when reading/writing Parquet files > --------------------------------------------------------------------------- > > Key: SPARK-5309 > URL: https://issues.apache.org/jira/browse/SPARK-5309 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.2.0 > Reporter: MIchael Davies > Priority: Minor > > Converting between Parquet Binary and Java Strings can form a significant > proportion of query times. > For columns which have repeated String values (which is common) the same > Binary will be repeatedly being converted. > A simple change to cache the last converted String per column was shown to > reduce query times by 25% when grouping on a data set of 66M rows on a column > with many repeated Strings. > A possible optimisation would be to hand responsibility for Binary > encoding/decoding over to Parquet so that it could ensure that this was done > only once per Binary value. > Next step is to look at Parquet code and to discuss with that project, which > I will do. > More details are available on this discussion: > http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org