I'll need to look at that. Right now I am recursively flattening the atruct and then using an Expression to filter. Any way to push this lower in the stack would be more performant
On Thu, Apr 21, 2022, 6:02 PM David Li <[email protected]> wrote: > Coincidentally there was a StackOverflow question about this recently too > with some answers outlining approaches for 7.0 and 8.0: > > > https://stackoverflow.com/questions/71945507/how-can-i-filter-or-select-sub-fields-of-structtype-columns-in-pyarrow > > On Thu, Apr 21, 2022, at 17:46, Weston Pace wrote: > > Awesome. I've created ARROW-16275[1] to track this. > > > > Also, I discovered that, starting with 8.0.0, we have support for > > expressing nested references in python so you can write: > > > > dataset.to_table(filter=ds.field("values", "one") < 200) > > > > [1] https://issues.apache.org/jira/browse/ARROW-16275 > > > > On Thu, Apr 21, 2022 at 6:44 AM Micah Kornfield <[email protected]> > wrote: > >>> > >>> If parquet stores statistics for each column of a struct array (don't > know offhand if they do) then we should create a JIRA to expose this. > >> > >> > >> It does store statistics per-leaf column. > >> > >> On Wed, Apr 20, 2022 at 3:34 PM Weston Pace <[email protected]> > wrote: > >>> > >>> No and no. This filter will not be used for predicate pushdown now or > in 8.0.0. It could possibly come after 8.0.0. If parquet stores > statistics for each column of a struct array (don't know offhand if they > do) then we should create a JIRA to expose this. > >>> > >>> On Wed, Apr 20, 2022, 11:01 AM Partha Dutta <[email protected]> > wrote: > >>>> > >>>> That works! Thanks. Do you know off hand if this filter would be used > in a predicate pushdown for a parquet dataset? Or would it be possibly > coming in version 8.0.0? > >>>> > >>>> On Wed, Apr 20, 2022 at 3:49 PM Weston Pace <[email protected]> > wrote: > >>>>> > >>>>> The second argument to `call_function` should be a list (the args to > >>>>> the function). Since `arr3` is iterable it is interpreting it as a > >>>>> list of args and trying to treat each row as an argument to your call > >>>>> (this is the reason it thinks you have 3 arguments). This should > >>>>> work: > >>>>> > >>>>> pc.call_function("struct_field", [arr3], > pc.StructFieldOptions(indices=[0])) > >>>>> > >>>>> Unfortunately, that evaluates the function immediately. If you want > >>>>> to create an expression then you need some way to create a call and I > >>>>> don't actually know how to do that. I can do something a little > >>>>> hackish: > >>>>> > >>>>> table = pa.Table.from_pydict({'values': arr3}) > >>>>> dataset = ds.dataset(table) > >>>>> sf_call = ds.field('')._call('struct_field', [ds.field('values')], > >>>>> pc.StructFieldOptions(indices=[0])) > >>>>> dataset.to_table(filter=sf_call < 200) > >>>>> > >>>>> However, I suspect there is probably a better way to create a call > >>>>> object than `ds.field('')._call(...)` > >>>>> > >>>>> On Wed, Apr 20, 2022 at 3:09 AM Partha Dutta <[email protected]> > wrote: > >>>>> > > >>>>> > I'm trying to use the compute function struct_field in order to > create an expression for dataset filtering. But running into an error. This > is the code snippet: > >>>>> > > >>>>> > arr1 = pa.array([100, 200, 300]) > >>>>> > arr2 = pa.array([400, 500, 600]) > >>>>> > arr3 = pa.StructArray.from_arrays([arr1, arr2], ["one", "two"]) > >>>>> > e = pc.call_function("struct_field", arr3, > pc.StructFieldOptions(indices=[0])) > 200 > >>>>> > Traceback (most recent call last): > >>>>> > File "<stdin>", line 1, in <module> > >>>>> > File "pyarrow/_compute.pyx", line 531, in > pyarrow._compute.call_function > >>>>> > File "pyarrow/_compute.pyx", line 330, in > pyarrow._compute.Function.call > >>>>> > File "pyarrow/error.pxi", line 143, in > pyarrow.lib.pyarrow_internal_check_status > >>>>> > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > >>>>> > pyarrow.lib.ArrowInvalid: Function 'struct_field' accepts 1 > arguments but attempted to look up kernel(s) with 3 > >>>>> > > >>>>> > If I try to exclude the options, I get > >>>>> > pyarrow.lib.ArrowInvalid: Function 'struct_field' cannot be called > without options > >>>>> > > >>>>> > Any advice? I am using pyarrow 7.0.0 > >>>>> > -- > >>>>> > Partha Dutta > >>>>> > [email protected] > >>>> > >>>> > >>>> > >>>> -- > >>>> Partha Dutta > >>>> [email protected] >
