[ 
https://issues.apache.org/jira/browse/ORC-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17087210#comment-17087210
 ] 

Owen O'Malley commented on ORC-620:
-----------------------------------

My proposed interface would look like:


{code:java}
    /**
     * Set a row level filter.
     * This is an advanced feature that allows the caller to specify
     * a list of columns that are read first and then a filter that
     * is called to determine which rows if any should be read.
     *
     * User should expect the batches that come from the reader
     * to use the selected array set by their filter.
     *
     * Use cases for this are predicates that SearchArgs can't represent,
     * such as relationships between columns (eg. columnA == columnB).
     * @param filterColumns a comma separated list of the column names that
     *                      are read before the filter is applied. Only top
     *                      level columns in the reader's schema can be used
     *                      here and they must not be duplicated.
     * @param filter a function to perform filtering during the call to
     *              RecordReader.nextBatch. This function should not reference
     *               any static fields nor modify the passed in ColumnVectors.
     *               It will be passed:
     *               <ol>
     *               <li>An array of the data for the filter columns in the
     *                   same order as they were given in filterColumns</li>
     *               <li>A MutableFilterContext to set the filter outputs</li>
     *               </ol>
     *               The return value should be true if any rows passed the 
filter.
     * @return this
     */
    public Options setRowFilter(String filterColumns,
        BiFunction<ColumnVector[], MutableFilterContext, Boolean> filter) {

{code}


> Modify the row filter API to use BiFunction
> -------------------------------------------
>
>                 Key: ORC-620
>                 URL: https://issues.apache.org/jira/browse/ORC-620
>             Project: ORC
>          Issue Type: Bug
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>            Priority: Major
>
> The current API for row filtering has a couple of issues:
>  * The filter function is passed a VectorizedRowBatch instead of a 
> MutableFilterContext.
>  * The filter needs to know the precise location for the fields it needs out 
> of the schema.
> I'd like to propose changing it from:
> {code:java}Consumer<VectorizedRowBatch>{code}
> to
> {code:java}BiFunction<ColumnVector[], MutableFilterContext, Boolean>{code}
> That has the advantage that the data that the function should read is 
> explicitly passed to it and we remove the dependence on VectorizedRowBatch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to