[Python] A user friendly way to filter parquet partitions

Bill Zhao Sun, 14 Feb 2021 22:09:08 -0800

Hi Dev team,

I created a pypi package to allow user friendly expression of conditions.
For example, a condition can be written as:


(f.A <= 3 or f.B != 'b1') and f.C == ['c1', 'c2']

where A, B, C are partition keys, and f.C == ['c1', 'c2']  means f.C in ['c1',
'c2'].

Arbitrary condition objects can be converted to pyarrow's filters by
calling its

to_pyarrow_filter() method, which will normalize the condition to conform
to pyarrow filter specification. The filter can also be converted back to a
condition object.

We can therefore take a condition object as the filter parameter directly
in read_table() and ParquetDatasetap() api as a user friendly way to create
the conditions.

Furthermore,  the condition object be directly used to filter partition
paths. This can replace the current complex filtering codes. (both native
and python)

For max efficiency, filtering with the condition object can be done in the
below ways:

   1. read the paths in chunks to keep the memory footprint small;
   2. parse the paths to be a pandas dataframe;
   3. use condition.query(dataframe) to get the filtered dataframe of path.
   4. use numexpr backend for dataframe query for efficiency.
   5. concat the filtered dataframe of each chunk

For usage details of the package, please see its document at:

https://condition.readthedocs.io/en/latest/usage.html
<https://condition.readthedocs.io/en/latest/usage.html#>

https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering

What do you think? Your discussion and suggestion is appreciated.

 A JIRA ticket is already created:

https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11566

Thank you,

Weiyang (Bill)

[Python] A user friendly way to filter parquet partitions

Reply via email to