Re: [Python] A user friendly way to filter parquet partitions

Bill Zhao Thu, 18 Feb 2021 20:03:40 -0800

Hi Micah,

Thank you for looking into this matter.


I understand your goal of having minimal dependency and also solve
the problem from C++ for multi-language support.
With that, we cannot change to use the condition package as I proposed.

However, I had a difficult time making partition filtering work in the
beginning. I actually spent time fixing a bug and also
enhanced the document as in
https://issues.apache.org/jira/browse/ARROW-10574. I think the condition
package can
help alleviate the pain for python users.

How about mentioning the condition package in the document as a tool to get
pyarrow filters? This way it does
not change the pyarrow code base at all, but let the users choose to use it
or not.

Thanks,

Weiyang



Micah Kornfield <emkornfi...@gmail.com> 于2021年2月17日周三 下午8:48写道：

> Hi Weiyang,
> The library looks interesting, and for python certainly seems like it might
> add a better user experience.
>
> I'm not super active in python maintenance (others who are can hopefully
> chime in).  But my impression is we try to keep dependencies minimal in
> general.
>
> Furthermore, the goal of the C++ library and associated bindings is to push
> as much work down into C++ (ultimately filtering capabilities equivalent to
> Pandas will be built)  so that all languages  can take advantage of the
> same core code.
>
> -Micah
>
>
> On Sun, Feb 14, 2021 at 10:09 PM Bill Zhao <wyz...@gmail.com> wrote:
>
> > Hi Dev team,
> >
> > I created a pypi package to allow user friendly expression of conditions.
> > For example, a condition can be written as:
> >
> > (f.A <= 3 or f.B != 'b1') and f.C == ['c1', 'c2']
> >
> > where A, B, C are partition keys, and f.C == ['c1', 'c2']  means f.C in
> > ['c1',
> > 'c2'].
> >
> > Arbitrary condition objects can be converted to pyarrow's filters by
> > calling its
> >
> > to_pyarrow_filter() method, which will normalize the condition to conform
> > to pyarrow filter specification. The filter can also be converted back
> to a
> > condition object.
> >
> > We can therefore take a condition object as the filter parameter directly
> > in read_table() and ParquetDatasetap() api as a user friendly way to
> create
> > the conditions.
> >
> > Furthermore,  the condition object be directly used to filter partition
> > paths. This can replace the current complex filtering codes. (both native
> > and python)
> >
> > For max efficiency, filtering with the condition object can be done in
> the
> > below ways:
> >
> >    1. read the paths in chunks to keep the memory footprint small;
> >    2. parse the paths to be a pandas dataframe;
> >    3. use condition.query(dataframe) to get the filtered dataframe of
> path.
> >    4. use numexpr backend for dataframe query for efficiency.
> >    5. concat the filtered dataframe of each chunk
> >
> > For usage details of the package, please see its document at:
> >
> > https://condition.readthedocs.io/en/latest/usage.html
> > <https://condition.readthedocs.io/en/latest/usage.html#>
> >
> >
> >
> https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering
> >
> > What do you think? Your discussion and suggestion is appreciated.
> >
> >  A JIRA ticket is already created:
> >
> > https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11566
> >
> > Thank you,
> >
> > Weiyang (Bill)
> >
>

Re: [Python] A user friendly way to filter parquet partitions

Reply via email to