[ 
https://issues.apache.org/jira/browse/SPARK-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MIchael Davies updated SPARK-5309:
----------------------------------
    Description: 
Converting between Parquet Binary and Java Strings can form a significant 
proportion of query times.

For columns which have repeated String values (which is common) the same Binary 
will be repeatedly being converted. 

A simple change to cache the last converted String per column was shown to 
reduce query times by 25% when grouping on a data set of 66M rows on a column 
with many repeated Strings.

A possible optimisation would be to hand responsibility for Binary 
encoding/decoding over to Parquet so that it could ensure that this was done 
only once per Binary value. 

Next step is to look at Parquet code and to discuss with that project, which I 
will do.

More details are available on this discussion:
http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html


  was:
Converting between Parquet Binary and Java Strings can form a significant 
proportion of query times.

For columns which have repeated String values (which is common) the same Binary 
will be repeatedly being converted. 

A simple change to cache the last converted String per column was shown to 
reduce query times by 25% when grouping on a data set of 66M rows on a column 
with many repeated Strings.

A possible optimisation would be to hand responsibility for Binary 
encoding/decoding over to Parquet so that it could ensure that this was done 
only once per Binary value. 

Next step is to look at Parquet code and to discuss with that project

More details are available on this discussion:
http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html



> Reduce Binary/String conversion overhead when reading/writing Parquet files
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-5309
>                 URL: https://issues.apache.org/jira/browse/SPARK-5309
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.2.0
>            Reporter: MIchael Davies
>            Priority: Minor
>
> Converting between Parquet Binary and Java Strings can form a significant 
> proportion of query times.
> For columns which have repeated String values (which is common) the same 
> Binary will be repeatedly being converted. 
> A simple change to cache the last converted String per column was shown to 
> reduce query times by 25% when grouping on a data set of 66M rows on a column 
> with many repeated Strings.
> A possible optimisation would be to hand responsibility for Binary 
> encoding/decoding over to Parquet so that it could ensure that this was done 
> only once per Binary value. 
> Next step is to look at Parquet code and to discuss with that project, which 
> I will do.
> More details are available on this discussion:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to