[
https://issues.apache.org/jira/browse/PARQUET-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224897#comment-14224897
]
Ryan Blue commented on PARQUET-98:
----------------------------------
Could you try out the problem data with VisualVM or another profiling tool to
see if there's anything funny going on? It doesn't look like your code is the
problem. If anything, I would have guessed the first version runs more slowly
because it converts a String to Binary each time it runs.
> filter2 API performance regression
> ----------------------------------
>
> Key: PARQUET-98
> URL: https://issues.apache.org/jira/browse/PARQUET-98
> Project: Parquet
> Issue Type: Bug
> Reporter: Viktor Szathmary
>
> The new filter API seems to be much slower (or perhaps I'm using it wrong \:)
> Code using an UnboundRecordFilter:
> {code:java}
> ColumnRecordFilter.column(column,
> ColumnPredicates.applyFunctionToBinary(
> input -> Binary.fromString(value).equals(input)));
> {code}
> vs. code using FilterPredicate:
> {code:java}
> eq(binaryColumn(column), Binary.fromString(value));
> {code}
> The latter performs twice as slow on the same Parquet file (built using
> 1.6.0rc2).
> Note: the reader is constructed using
> {code:java}
> ParquetReader.builder(new ProtoReadSupport().withFilter(filter).build()
> {code}
> The new filter API based approach seems to create a whole lot more garbage
> (perhaps due to reconstructing all the rows?).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)