[ 
https://issues.apache.org/jira/browse/SPARK-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14285899#comment-14285899
 ] 

Apache Spark commented on SPARK-5309:
-------------------------------------

User 'MickDavies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4139

> Reduce Binary/String conversion overhead when reading/writing Parquet files
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-5309
>                 URL: https://issues.apache.org/jira/browse/SPARK-5309
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.2.0
>            Reporter: MIchael Davies
>            Priority: Minor
>
> Converting between Parquet Binary and Java Strings can form a significant 
> proportion of query times.
> For columns which have repeated String values (which is common) the same 
> Binary will be repeatedly being converted. 
> A simple change to cache the last converted String per column was shown to 
> reduce query times by 25% when grouping on a data set of 66M rows on a column 
> with many repeated Strings.
> A possible optimisation would be to hand responsibility for Binary 
> encoding/decoding over to Parquet so that it could ensure that this was done 
> only once per Binary value. 
> Next step is to look at Parquet code and to discuss with that project, which 
> I will do.
> More details are available on this discussion:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to