Hi Micah, Thank you for looking into this matter.
I understand your goal of having minimal dependency and also solve the problem from C++ for multi-language support. With that, we cannot change to use the condition package as I proposed. However, I had a difficult time making partition filtering work in the beginning. I actually spent time fixing a bug and also enhanced the document as in https://issues.apache.org/jira/browse/ARROW-10574. I think the condition package can help alleviate the pain for python users. How about mentioning the condition package in the document as a tool to get pyarrow filters? This way it does not change the pyarrow code base at all, but let the users choose to use it or not. Thanks, Weiyang Micah Kornfield <emkornfi...@gmail.com> 于2021年2月17日周三 下午8:48写道: > Hi Weiyang, > The library looks interesting, and for python certainly seems like it might > add a better user experience. > > I'm not super active in python maintenance (others who are can hopefully > chime in). But my impression is we try to keep dependencies minimal in > general. > > Furthermore, the goal of the C++ library and associated bindings is to push > as much work down into C++ (ultimately filtering capabilities equivalent to > Pandas will be built) so that all languages can take advantage of the > same core code. > > -Micah > > > On Sun, Feb 14, 2021 at 10:09 PM Bill Zhao <wyz...@gmail.com> wrote: > > > Hi Dev team, > > > > I created a pypi package to allow user friendly expression of conditions. > > For example, a condition can be written as: > > > > (f.A <= 3 or f.B != 'b1') and f.C == ['c1', 'c2'] > > > > where A, B, C are partition keys, and f.C == ['c1', 'c2'] means f.C in > > ['c1', > > 'c2']. > > > > Arbitrary condition objects can be converted to pyarrow's filters by > > calling its > > > > to_pyarrow_filter() method, which will normalize the condition to conform > > to pyarrow filter specification. The filter can also be converted back > to a > > condition object. > > > > We can therefore take a condition object as the filter parameter directly > > in read_table() and ParquetDatasetap() api as a user friendly way to > create > > the conditions. > > > > Furthermore, the condition object be directly used to filter partition > > paths. This can replace the current complex filtering codes. (both native > > and python) > > > > For max efficiency, filtering with the condition object can be done in > the > > below ways: > > > > 1. read the paths in chunks to keep the memory footprint small; > > 2. parse the paths to be a pandas dataframe; > > 3. use condition.query(dataframe) to get the filtered dataframe of > path. > > 4. use numexpr backend for dataframe query for efficiency. > > 5. concat the filtered dataframe of each chunk > > > > For usage details of the package, please see its document at: > > > > https://condition.readthedocs.io/en/latest/usage.html > > <https://condition.readthedocs.io/en/latest/usage.html#> > > > > > > > https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering > > > > What do you think? Your discussion and suggestion is appreciated. > > > > A JIRA ticket is already created: > > > > https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11566 > > > > Thank you, > > > > Weiyang (Bill) > > >