[ 
https://issues.apache.org/jira/browse/PARQUET-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240333#comment-14240333
 ] 

Alex Levenson commented on PARQUET-98:
--------------------------------------

I will try to take a look but it might take be a while before I get a chance.
The filter2 API still skips assembly for records that are filtered out (eg does 
not instantiate a protobuf), and it only visits the columns you have asked for 
via the column projection API. But it doesn't look at the columns in the filter 
and project to only those columns, (neither does the unbound record filter 
approach I think) and it also still visits every one of the columns that you 
want in the final output record (it does not short circuit -- the unbound 
record filter might short circuit here, but the importance of that depends on 
the order columns are visited in).

> filter2 API performance regression
> ----------------------------------
>
>                 Key: PARQUET-98
>                 URL: https://issues.apache.org/jira/browse/PARQUET-98
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Viktor Szathmary
>
> The new filter API seems to be much slower (or perhaps I'm using it wrong \:)
> Code using an UnboundRecordFilter:
> {code:java}
> ColumnRecordFilter.column(column,
>     ColumnPredicates.applyFunctionToBinary(
>     input -> Binary.fromString(value).equals(input)));
> {code}
> vs. code using FilterPredicate:
> {code:java}
> eq(binaryColumn(column), Binary.fromString(value));
> {code}
> The latter performs twice as slow on the same Parquet file (built using 
> 1.6.0rc2).
> Note: the reader is constructed using
> {code:java}
> ParquetReader.builder(new ProtoReadSupport().withFilter(filter).build()
> {code}
> The new filter API based approach seems to create a whole lot more garbage 
> (perhaps due to reconstructing all the rows?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to