[ 
https://issues.apache.org/jira/browse/PARQUET-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353562#comment-14353562
 ] 

Viktor Szathmáry commented on PARQUET-98:
-----------------------------------------

Not knowing much about the implementation, this is just conjecture – but based 
on the performance results and the extra amount of garbage during profiling, it 
seems to also read columns that are not needed for the query (e.g. if you have 
column A, B, C and you try to find rows where C='x', there's no need to read A 
and B – just when you have a match on C).

But in any case, this slowness is easy to reproduce based on the example I have 
provided above, I'm sure someone familiar with the internals can figure this 
out without my guesses ;)


> filter2 API performance regression
> ----------------------------------
>
>                 Key: PARQUET-98
>                 URL: https://issues.apache.org/jira/browse/PARQUET-98
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Viktor Szathmáry
>
> The new filter API seems to be much slower (or perhaps I'm using it wrong \:)
> Code using an UnboundRecordFilter:
> {code:java}
> ColumnRecordFilter.column(column,
>     ColumnPredicates.applyFunctionToBinary(
>     input -> Binary.fromString(value).equals(input)));
> {code}
> vs. code using FilterPredicate:
> {code:java}
> eq(binaryColumn(column), Binary.fromString(value));
> {code}
> The latter performs twice as slow on the same Parquet file (built using 
> 1.6.0rc2).
> Note: the reader is constructed using
> {code:java}
> ParquetReader.builder(new ProtoReadSupport().withFilter(filter).build()
> {code}
> The new filter API based approach seems to create a whole lot more garbage 
> (perhaps due to reconstructing all the rows?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to