GitHub user r2evans created a discussion: advice on how to add "filter" to parquet-rewrite?
I'm proficient in other languages but I'm not very familiar with Rust, wanting to learn it. My use-case is likely a good starting point for using existing code to add functionality. I'll use https://doc.rust-lang.org/stable/book/ as a reference for coding standards. What I'm asking for help with is for gotchas or advice on best practices in where I'm trying to go with this. I'd like to be able to add a simple filters to `src/bin/parquet-rewrite.rs` so that I can perhaps do either from ``` # implicit 0+ parquet-rewrite --input in.pq --output out.pq 'Field1 != "hello"` 'Field2 < 100' # explicit parquet-rewrite --input in.pq --output out.pq --filters 'Field1 != "hello"` --filter 'Field2 < 100' ``` Initially (in)equality is my first "need", since I'm trying to remove specific IDs from one field. I'd love to be able to support a set of simple/basic filters, and if easy I'll extend into sets, missingness, etc. - basics: `==`, `!=`, `>`, `>=`, `<`, and `<=` (recognizing that `==` and numeric can run into IEEE-754 issues) - boolean: `Field` and `! Field`, though this may be easier as `Field == true` or `Field == 1`? - sets: `Field in ('aa','bb','cc')` and `Field not in (..)` - missingness: `Field is not null`, not sure if `Field != null` works - "and" is default between multiple conditionals, would like to support `or` - column present, really only useful in a generic sense when combined with another, such as `Field not exist or Field > 5` - (bigger stretch) paren grouping: `Field1 != 'abc' or (Field2 == 'xyz' and Field3 > 100)` (Perhaps saying "SQL-like filter" might be sufficient for many things, I'm sure I'm missing something in that comparison :-) I am not familiar with the Rust ecosystem, if bringing in another dependency to easily support this (such as parsing of my SQL-like code above) is required, I'll learn that. Ultimately, I'm hoping for advice from experienced arrow/parquet users/rustaceans along these lines: - I've seen issues about "filter pushdown", is that a good place to start looking for adding this capability to `parquet-rewrite`? - When comparing different types of data (string vs number), some languages are quite permissive (auto-casting, not without logical errors), other languages complain or dump core, what are community best-practices I should be using to guard against problems here? - Some languages have fancy-looking "efficiencies" for iterating through data (comprehensions, list/vector-processing, etc), does this toolset (or rust in general) have strong recommendations for iteration over each row? For instance, iterating over all conditions for each row, or iterating rows for each conditional. - I think I can use row-wise operations within the existing batch-wise ops, is there a more efficient way to go? If I'm even partially successful, I'm happy to submit a PR for inclusion here if others find value in this, but it's not a requirement for me (local use only). (Due to my lack of experience with rust, a good review from others would certainly be justified.) GitHub link: https://github.com/apache/arrow-rs/discussions/8467 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
