GitHub user lightjacket closed a discussion: Custom Predicates for ParquetExec 
and Parquet row indexes

Hello,

I've been looking at adding some indexing on parquet files by creating some 
auxiliary files and using those to create predicates for the ParquetExec struct 
to use. The short of it is I have an optimizer in place to replace existing 
ParquetExec instances with new instances using a custom PhysicalExpr for the 
predicate:

```rust
let new_exec = ParquetExec::new(
    exec.base_config().clone(),
    Some(Arc::new(MyCustomPhysicalExpr::new(/* args */))),
    None
).with_pushdown_filters(true);
```

My custom PhysicalExpr then has an `evaluate` function like the following:
```rust
fn evaluate(&self, batch: &RecordBatch) -> 
datafusion::common::Result<ColumnarValue> {
    let my_index: HashSet<usize> = self.run_my_index(/* args */);
    let indexes = (0..batch.num_rows()).map(|i| my_index.contains(i as 
usize)).collect::<Vec<_>>();
    Ok(ColumnarValue::Array(Arc::new(BooleanArray::from(indexes))))
}
```

`my_index` currently has a set pf row numbers from the parquet file based on 
the args passed in. If there is only one RecordBatch, this happens to work 
(though I am not sure if I can assume the batch has not already been filtered 
somehow). 

So my question is whether there is a way from a RecordBatch to load what the 
original record indexes in the parquet file were. I haven't seen anything yet, 
but I'd imagine I am looking in the wrong places. Alternatively, if the 
approach is just wrong and there's a better way to use indexes like this in 
DataFusion, I'd be very interested.

Thank you!



GitHub link: https://github.com/apache/datafusion/discussions/9341

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to