MIchael Davies created SPARK-5309:
-------------------------------------

             Summary: Reduce Binary/String conversion overhead when 
reading/writing Parquet files
                 Key: SPARK-5309
                 URL: https://issues.apache.org/jira/browse/SPARK-5309
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 1.2.0
            Reporter: MIchael Davies
            Priority: Minor


Converting between Parquet Binary and Java Strings can form a significant 
proportion of query times.

For columns which have repeated String values (which is common) the same Binary 
will be repeatedly being converted. 

A simple change to cache the last converted String per column was shown to 
reduce query times by 25% when grouping on a data set of 66M rows on a column 
with many repeated Strings.

A possible optimisation would be to hand responsibility for Binary 
encoding/decoding over to Parquet so that it could ensure that this was done 
only once per Binary value. 

Next step is to look at Parquet code and to discuss with that project

More details are available on this discussion:
http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to