GitHub user lightjacket closed a discussion: Custom Predicates for ParquetExec
and Parquet row indexes
Hello,
I've been looking at adding some indexing on parquet files by creating some
auxiliary files and using those to create predicates for the ParquetExec struct
to use. The short of it is I have an optimizer in place to replace existing
ParquetExec instances with new instances using a custom PhysicalExpr for the
predicate:
```rust
let new_exec = ParquetExec::new(
exec.base_config().clone(),
Some(Arc::new(MyCustomPhysicalExpr::new(/* args */))),
None
).with_pushdown_filters(true);
```
My custom PhysicalExpr then has an `evaluate` function like the following:
```rust
fn evaluate(&self, batch: &RecordBatch) ->
datafusion::common::Result<ColumnarValue> {
let my_index: HashSet<usize> = self.run_my_index(/* args */);
let indexes = (0..batch.num_rows()).map(|i| my_index.contains(i as
usize)).collect::<Vec<_>>();
Ok(ColumnarValue::Array(Arc::new(BooleanArray::from(indexes))))
}
```
`my_index` currently has a set pf row numbers from the parquet file based on
the args passed in. If there is only one RecordBatch, this happens to work
(though I am not sure if I can assume the batch has not already been filtered
somehow).
So my question is whether there is a way from a RecordBatch to load what the
original record indexes in the parquet file were. I haven't seen anything yet,
but I'd imagine I am looking in the wrong places. Alternatively, if the
approach is just wrong and there's a better way to use indexes like this in
DataFusion, I'd be very interested.
Thank you!
GitHub link: https://github.com/apache/datafusion/discussions/9341
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]