Re: [Python] A user friendly way to filter parquet partitions

Micah Kornfield Wed, 17 Feb 2021 20:48:02 -0800

Hi Weiyang,
The library looks interesting, and for python certainly seems like it might
add a better user experience.


I'm not super active in python maintenance (others who are can hopefully
chime in).  But my impression is we try to keep dependencies minimal in
general.

Furthermore, the goal of the C++ library and associated bindings is to push
as much work down into C++ (ultimately filtering capabilities equivalent to
Pandas will be built)  so that all languages  can take advantage of the
same core code.

-Micah


On Sun, Feb 14, 2021 at 10:09 PM Bill Zhao <[email protected]> wrote:

> Hi Dev team,
>
> I created a pypi package to allow user friendly expression of conditions.
> For example, a condition can be written as:
>
> (f.A <= 3 or f.B != 'b1') and f.C == ['c1', 'c2']
>
> where A, B, C are partition keys, and f.C == ['c1', 'c2']  means f.C in
> ['c1',
> 'c2'].
>
> Arbitrary condition objects can be converted to pyarrow's filters by
> calling its
>
> to_pyarrow_filter() method, which will normalize the condition to conform
> to pyarrow filter specification. The filter can also be converted back to a
> condition object.
>
> We can therefore take a condition object as the filter parameter directly
> in read_table() and ParquetDatasetap() api as a user friendly way to create
> the conditions.
>
> Furthermore,  the condition object be directly used to filter partition
> paths. This can replace the current complex filtering codes. (both native
> and python)
>
> For max efficiency, filtering with the condition object can be done in the
> below ways:
>
>    1. read the paths in chunks to keep the memory footprint small;
>    2. parse the paths to be a pandas dataframe;
>    3. use condition.query(dataframe) to get the filtered dataframe of path.
>    4. use numexpr backend for dataframe query for efficiency.
>    5. concat the filtered dataframe of each chunk
>
> For usage details of the package, please see its document at:
>
> https://condition.readthedocs.io/en/latest/usage.html
> <https://condition.readthedocs.io/en/latest/usage.html#>
>
>
> https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering
>
> What do you think? Your discussion and suggestion is appreciated.
>
>  A JIRA ticket is already created:
>
> https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11566
>
> Thank you,
>
> Weiyang (Bill)
>

Re: [Python] A user friendly way to filter parquet partitions

Reply via email to