Re: [Python] A user friendly way to filter parquet partitions

Wes McKinney Fri, 19 Feb 2021 11:28:37 -0800

We could add a section to the docs that points out some ecosystem
add-on projects


On Thu, Feb 18, 2021 at 10:03 PM Bill Zhao <[email protected]> wrote:
>
> Hi Micah,
>
> Thank you for looking into this matter.
>
> I understand your goal of having minimal dependency and also solve
> the problem from C++ for multi-language support.
> With that, we cannot change to use the condition package as I proposed.
>
> However, I had a difficult time making partition filtering work in the
> beginning. I actually spent time fixing a bug and also
> enhanced the document as in
> https://issues.apache.org/jira/browse/ARROW-10574. I think the condition
> package can
> help alleviate the pain for python users.
>
> How about mentioning the condition package in the document as a tool to get
> pyarrow filters? This way it does
> not change the pyarrow code base at all, but let the users choose to use it
> or not.
>
> Thanks,
>
> Weiyang
>
>
>
> Micah Kornfield <[email protected]> 于2021年2月17日周三 下午8:48写道：
>
> > Hi Weiyang,
> > The library looks interesting, and for python certainly seems like it might
> > add a better user experience.
> >
> > I'm not super active in python maintenance (others who are can hopefully
> > chime in).  But my impression is we try to keep dependencies minimal in
> > general.
> >
> > Furthermore, the goal of the C++ library and associated bindings is to push
> > as much work down into C++ (ultimately filtering capabilities equivalent to
> > Pandas will be built)  so that all languages  can take advantage of the
> > same core code.
> >
> > -Micah
> >
> >
> > On Sun, Feb 14, 2021 at 10:09 PM Bill Zhao <[email protected]> wrote:
> >
> > > Hi Dev team,
> > >
> > > I created a pypi package to allow user friendly expression of conditions.
> > > For example, a condition can be written as:
> > >
> > > (f.A <= 3 or f.B != 'b1') and f.C == ['c1', 'c2']
> > >
> > > where A, B, C are partition keys, and f.C == ['c1', 'c2']  means f.C in
> > > ['c1',
> > > 'c2'].
> > >
> > > Arbitrary condition objects can be converted to pyarrow's filters by
> > > calling its
> > >
> > > to_pyarrow_filter() method, which will normalize the condition to conform
> > > to pyarrow filter specification. The filter can also be converted back
> > to a
> > > condition object.
> > >
> > > We can therefore take a condition object as the filter parameter directly
> > > in read_table() and ParquetDatasetap() api as a user friendly way to
> > create
> > > the conditions.
> > >
> > > Furthermore,  the condition object be directly used to filter partition
> > > paths. This can replace the current complex filtering codes. (both native
> > > and python)
> > >
> > > For max efficiency, filtering with the condition object can be done in
> > the
> > > below ways:
> > >
> > >    1. read the paths in chunks to keep the memory footprint small;
> > >    2. parse the paths to be a pandas dataframe;
> > >    3. use condition.query(dataframe) to get the filtered dataframe of
> > path.
> > >    4. use numexpr backend for dataframe query for efficiency.
> > >    5. concat the filtered dataframe of each chunk
> > >
> > > For usage details of the package, please see its document at:
> > >
> > > https://condition.readthedocs.io/en/latest/usage.html
> > > <https://condition.readthedocs.io/en/latest/usage.html#>
> > >
> > >
> > >
> > https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering
> > >
> > > What do you think? Your discussion and suggestion is appreciated.
> > >
> > >  A JIRA ticket is already created:
> > >
> > > https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11566
> > >
> > > Thank you,
> > >
> > > Weiyang (Bill)
> > >
> >

Re: [Python] A user friendly way to filter parquet partitions

Reply via email to