[ https://issues.apache.org/jira/browse/SPARK-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14282263#comment-14282263 ]
MIchael Davies edited comment on SPARK-5309 at 1/19/15 8:39 AM: ---------------------------------------------------------------- Additionally noticed that predicates that are pushed down to Parquet are evaluated something like: {code} getNextRow while { read entire row applying any Binary->String conversions (some predicate calculations nested in reading values) if predicate fails loop otherwise return row } {code} For filters applied to column values that change slowly this is not very efficient. Predicates are often simple and evaluable against a single column = E.g. {code} WHERE currency = 'GBP' AND status = 'OPEN' {code} It would be great if Parquet could apply info it had about column compression to predicate evaluation and skip rows more efficiently when possible. was (Author: michael davies): Additionally noticed that predicates that are pushed down to Parquet are evaluated something like: {code} getNextRow while { read entire row applying any Binary->String conversions (some predicate calculations nested in reading values) if predicate fails loop otherwise return row } {code} For filters applied to column values that change slowly this is not very efficient. > Reduce Binary/String conversion overhead when reading/writing Parquet files > --------------------------------------------------------------------------- > > Key: SPARK-5309 > URL: https://issues.apache.org/jira/browse/SPARK-5309 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.2.0 > Reporter: MIchael Davies > Priority: Minor > > Converting between Parquet Binary and Java Strings can form a significant > proportion of query times. > For columns which have repeated String values (which is common) the same > Binary will be repeatedly being converted. > A simple change to cache the last converted String per column was shown to > reduce query times by 25% when grouping on a data set of 66M rows on a column > with many repeated Strings. > A possible optimisation would be to hand responsibility for Binary > encoding/decoding over to Parquet so that it could ensure that this was done > only once per Binary value. > Next step is to look at Parquet code and to discuss with that project, which > I will do. > More details are available on this discussion: > http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org