[
https://issues.apache.org/jira/browse/PARQUET-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232092#comment-14232092
]
Viktor Szathmary edited comment on PARQUET-98 at 12/2/14 8:41 PM:
------------------------------------------------------------------
My initial impression based on a quick profiling run is that it was creating a
whole lot more garbage, seemingly instantiating all the protobufs (or just
reading all columns of every row?), rather than just the ones matching the
expression.
was (Author: phraktle):
My initial impression based on a quick profiling run is that it was creating a
whole lot more garbage, seemingly instantiating all the protobufs, rather than
just the ones matching the expression.
> filter2 API performance regression
> ----------------------------------
>
> Key: PARQUET-98
> URL: https://issues.apache.org/jira/browse/PARQUET-98
> Project: Parquet
> Issue Type: Bug
> Reporter: Viktor Szathmary
>
> The new filter API seems to be much slower (or perhaps I'm using it wrong \:)
> Code using an UnboundRecordFilter:
> {code:java}
> ColumnRecordFilter.column(column,
> ColumnPredicates.applyFunctionToBinary(
> input -> Binary.fromString(value).equals(input)));
> {code}
> vs. code using FilterPredicate:
> {code:java}
> eq(binaryColumn(column), Binary.fromString(value));
> {code}
> The latter performs twice as slow on the same Parquet file (built using
> 1.6.0rc2).
> Note: the reader is constructed using
> {code:java}
> ParquetReader.builder(new ProtoReadSupport().withFilter(filter).build()
> {code}
> The new filter API based approach seems to create a whole lot more garbage
> (perhaps due to reconstructing all the rows?).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)