We could add a section to the docs that points out some ecosystem add-on projects
On Thu, Feb 18, 2021 at 10:03 PM Bill Zhao <[email protected]> wrote: > > Hi Micah, > > Thank you for looking into this matter. > > I understand your goal of having minimal dependency and also solve > the problem from C++ for multi-language support. > With that, we cannot change to use the condition package as I proposed. > > However, I had a difficult time making partition filtering work in the > beginning. I actually spent time fixing a bug and also > enhanced the document as in > https://issues.apache.org/jira/browse/ARROW-10574. I think the condition > package can > help alleviate the pain for python users. > > How about mentioning the condition package in the document as a tool to get > pyarrow filters? This way it does > not change the pyarrow code base at all, but let the users choose to use it > or not. > > Thanks, > > Weiyang > > > > Micah Kornfield <[email protected]> 于2021年2月17日周三 下午8:48写道: > > > Hi Weiyang, > > The library looks interesting, and for python certainly seems like it might > > add a better user experience. > > > > I'm not super active in python maintenance (others who are can hopefully > > chime in). But my impression is we try to keep dependencies minimal in > > general. > > > > Furthermore, the goal of the C++ library and associated bindings is to push > > as much work down into C++ (ultimately filtering capabilities equivalent to > > Pandas will be built) so that all languages can take advantage of the > > same core code. > > > > -Micah > > > > > > On Sun, Feb 14, 2021 at 10:09 PM Bill Zhao <[email protected]> wrote: > > > > > Hi Dev team, > > > > > > I created a pypi package to allow user friendly expression of conditions. > > > For example, a condition can be written as: > > > > > > (f.A <= 3 or f.B != 'b1') and f.C == ['c1', 'c2'] > > > > > > where A, B, C are partition keys, and f.C == ['c1', 'c2'] means f.C in > > > ['c1', > > > 'c2']. > > > > > > Arbitrary condition objects can be converted to pyarrow's filters by > > > calling its > > > > > > to_pyarrow_filter() method, which will normalize the condition to conform > > > to pyarrow filter specification. The filter can also be converted back > > to a > > > condition object. > > > > > > We can therefore take a condition object as the filter parameter directly > > > in read_table() and ParquetDatasetap() api as a user friendly way to > > create > > > the conditions. > > > > > > Furthermore, the condition object be directly used to filter partition > > > paths. This can replace the current complex filtering codes. (both native > > > and python) > > > > > > For max efficiency, filtering with the condition object can be done in > > the > > > below ways: > > > > > > 1. read the paths in chunks to keep the memory footprint small; > > > 2. parse the paths to be a pandas dataframe; > > > 3. use condition.query(dataframe) to get the filtered dataframe of > > path. > > > 4. use numexpr backend for dataframe query for efficiency. > > > 5. concat the filtered dataframe of each chunk > > > > > > For usage details of the package, please see its document at: > > > > > > https://condition.readthedocs.io/en/latest/usage.html > > > <https://condition.readthedocs.io/en/latest/usage.html#> > > > > > > > > > > > https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering > > > > > > What do you think? Your discussion and suggestion is appreciated. > > > > > > A JIRA ticket is already created: > > > > > > https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11566 > > > > > > Thank you, > > > > > > Weiyang (Bill) > > > > >
