Hi Dev team, I created a pypi package to allow user friendly expression of conditions. For example, a condition can be written as:
(f.A <= 3 or f.B != 'b1') and f.C == ['c1', 'c2'] where A, B, C are partition keys, and f.C == ['c1', 'c2'] means f.C in ['c1', 'c2']. Arbitrary condition objects can be converted to pyarrow's filters by calling its to_pyarrow_filter() method, which will normalize the condition to conform to pyarrow filter specification. The filter can also be converted back to a condition object. We can therefore take a condition object as the filter parameter directly in read_table() and ParquetDatasetap() api as a user friendly way to create the conditions. Furthermore, the condition object be directly used to filter partition paths. This can replace the current complex filtering codes. (both native and python) For max efficiency, filtering with the condition object can be done in the below ways: 1. read the paths in chunks to keep the memory footprint small; 2. parse the paths to be a pandas dataframe; 3. use condition.query(dataframe) to get the filtered dataframe of path. 4. use numexpr backend for dataframe query for efficiency. 5. concat the filtered dataframe of each chunk For usage details of the package, please see its document at: https://condition.readthedocs.io/en/latest/usage.html <https://condition.readthedocs.io/en/latest/usage.html#> https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering What do you think? Your discussion and suggestion is appreciated. A JIRA ticket is already created: https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11566 Thank you, Weiyang (Bill)
