[
https://issues.apache.org/jira/browse/ORC-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17087210#comment-17087210
]
Owen O'Malley commented on ORC-620:
-----------------------------------
My proposed interface would look like:
{code:java}
/**
* Set a row level filter.
* This is an advanced feature that allows the caller to specify
* a list of columns that are read first and then a filter that
* is called to determine which rows if any should be read.
*
* User should expect the batches that come from the reader
* to use the selected array set by their filter.
*
* Use cases for this are predicates that SearchArgs can't represent,
* such as relationships between columns (eg. columnA == columnB).
* @param filterColumns a comma separated list of the column names that
* are read before the filter is applied. Only top
* level columns in the reader's schema can be used
* here and they must not be duplicated.
* @param filter a function to perform filtering during the call to
* RecordReader.nextBatch. This function should not reference
* any static fields nor modify the passed in ColumnVectors.
* It will be passed:
* <ol>
* <li>An array of the data for the filter columns in the
* same order as they were given in filterColumns</li>
* <li>A MutableFilterContext to set the filter outputs</li>
* </ol>
* The return value should be true if any rows passed the
filter.
* @return this
*/
public Options setRowFilter(String filterColumns,
BiFunction<ColumnVector[], MutableFilterContext, Boolean> filter) {
{code}
> Modify the row filter API to use BiFunction
> -------------------------------------------
>
> Key: ORC-620
> URL: https://issues.apache.org/jira/browse/ORC-620
> Project: ORC
> Issue Type: Bug
> Reporter: Owen O'Malley
> Assignee: Owen O'Malley
> Priority: Major
>
> The current API for row filtering has a couple of issues:
> * The filter function is passed a VectorizedRowBatch instead of a
> MutableFilterContext.
> * The filter needs to know the precise location for the fields it needs out
> of the schema.
> I'd like to propose changing it from:
> {code:java}Consumer<VectorizedRowBatch>{code}
> to
> {code:java}BiFunction<ColumnVector[], MutableFilterContext, Boolean>{code}
> That has the advantage that the data that the function should read is
> explicitly passed to it and we remove the dependence on VectorizedRowBatch.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)