GitHub user r2evans created a discussion: advice on how to add "filter" to 
parquet-rewrite?

I'm proficient in other languages but I'm not very familiar with Rust, wanting 
to learn it. My use-case is likely a good starting point for using existing 
code to add functionality. I'll use https://doc.rust-lang.org/stable/book/ as a 
reference for coding standards. What I'm asking for help with is for gotchas or 
advice on best practices in where I'm trying to go with this.

I'd like to be able to add a simple filters to `src/bin/parquet-rewrite.rs` so 
that I can perhaps do either from

```
# implicit 0+
parquet-rewrite --input in.pq --output out.pq 'Field1 != "hello"` 'Field2 < 
100' 
# explicit
parquet-rewrite --input in.pq --output out.pq --filters 'Field1 != "hello"` 
--filter 'Field2 < 100' 
```

Initially (in)equality is my first "need", since I'm trying to remove specific 
IDs from one field. I'd love to be able to support a set of simple/basic 
filters, and if easy I'll extend into sets, missingness, etc.
- basics: `==`, `!=`, `>`, `>=`, `<`, and `<=` (recognizing that `==` and 
numeric can run into IEEE-754 issues)
- boolean: `Field` and `! Field`, though this may be easier as `Field == true` 
or `Field == 1`?
- sets: `Field in ('aa','bb','cc')` and `Field not in (..)`
- missingness: `Field is not null`, not sure if `Field != null` works
- "and" is default between multiple conditionals, would like to support `or`
- column present, really only useful in a generic sense when combined with 
another, such as `Field not exist or Field > 5`
- (bigger stretch) paren grouping: `Field1 != 'abc' or (Field2 == 'xyz' and 
Field3 > 100)`

(Perhaps saying "SQL-like filter" might be sufficient for many things, I'm sure 
I'm missing something in that comparison :-)

I am not familiar with the Rust ecosystem, if bringing in another dependency to 
easily support this (such as parsing of my SQL-like code above) is required, 
I'll learn that.

Ultimately, I'm hoping for advice from experienced arrow/parquet 
users/rustaceans along these lines:

- I've seen issues about "filter pushdown", is that a good place to start 
looking for adding this capability to `parquet-rewrite`?
- When comparing different types of data (string vs number), some languages are 
quite permissive (auto-casting, not without logical errors), other languages 
complain or dump core, what are community best-practices I should be using to 
guard against problems here?
- Some languages have fancy-looking "efficiencies" for iterating through data 
(comprehensions, list/vector-processing, etc), does this toolset (or rust in 
general) have strong recommendations for iteration over each row? For instance, 
iterating over all conditions for each row, or iterating rows for each 
conditional.
- I think I can use row-wise operations within the existing batch-wise ops, is 
there a more efficient way to go?

If I'm even partially successful, I'm happy to submit a PR for inclusion here 
if others find value in this, but it's not a requirement for me (local use 
only). (Due to my lack of experience with rust, a good review from others would 
certainly be justified.)


GitHub link: https://github.com/apache/arrow-rs/discussions/8467

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to