writing Parquet files

MIchael Davies (JIRA) Mon, 19 Jan 2015 00:41:16 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14282263#comment-14282263
 ]


MIchael Davies edited comment on SPARK-5309 at 1/19/15 8:39 AM:
----------------------------------------------------------------

Additionally noticed that predicates that are pushed down to Parquet are 
evaluated something like:

{code}
getNextRow
  while {
    read entire row applying any Binary->String conversions (some predicate 
calculations nested in reading values)
    if predicate fails loop otherwise return row
  }
{code}
For filters applied to column values that change slowly this is not very 
efficient. 

Predicates are often simple and evaluable against a single column = E.g.
{code}
WHERE currency = 'GBP' AND status = 'OPEN'
{code}
It would be great if Parquet could apply info it had about column compression 
to predicate evaluation and skip rows more efficiently when possible.
  


was (Author: michael davies):
Additionally noticed that predicates that are pushed down to Parquet are 
evaluated something like:

{code}
getNextRow
  while {
    read entire row applying any Binary->String conversions (some predicate 
calculations nested in reading values)
    if predicate fails loop otherwise return row
  }
{code}
For filters applied to column values that change slowly this is not very 
efficient. 
  

> Reduce Binary/String conversion overhead when reading/writing Parquet files
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-5309
>                 URL: https://issues.apache.org/jira/browse/SPARK-5309
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.2.0
>            Reporter: MIchael Davies
>            Priority: Minor
>
> Converting between Parquet Binary and Java Strings can form a significant 
> proportion of query times.
> For columns which have repeated String values (which is common) the same 
> Binary will be repeatedly being converted. 
> A simple change to cache the last converted String per column was shown to 
> reduce query times by 25% when grouping on a data set of 66M rows on a column 
> with many repeated Strings.
> A possible optimisation would be to hand responsibility for Binary 
> encoding/decoding over to Parquet so that it could ensure that this was done 
> only once per Binary value. 
> Next step is to look at Parquet code and to discuss with that project, which 
> I will do.
> More details are available on this discussion:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5309) Reduce Binary/String conversion overhead when reading/writing Parquet files

Reply via email to