I'll need to look at that. Right now I am recursively flattening the atruct
and then using an Expression to filter. Any way to push this lower in the
stack would be more performant

On Thu, Apr 21, 2022, 6:02 PM David Li <[email protected]> wrote:

> Coincidentally there was a StackOverflow question about this recently too
> with some answers outlining approaches for 7.0 and 8.0:
>
>
> https://stackoverflow.com/questions/71945507/how-can-i-filter-or-select-sub-fields-of-structtype-columns-in-pyarrow
>
> On Thu, Apr 21, 2022, at 17:46, Weston Pace wrote:
> > Awesome.  I've created ARROW-16275[1] to track this.
> >
> > Also, I discovered that, starting with 8.0.0, we have support for
> > expressing nested references in python so you can write:
> >
> >     dataset.to_table(filter=ds.field("values", "one") < 200)
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-16275
> >
> > On Thu, Apr 21, 2022 at 6:44 AM Micah Kornfield <[email protected]>
> wrote:
> >>>
> >>> If parquet stores statistics for each column of a struct array (don't
> know offhand if they do) then we should create a JIRA to expose this.
> >>
> >>
> >> It does store statistics per-leaf column.
> >>
> >> On Wed, Apr 20, 2022 at 3:34 PM Weston Pace <[email protected]>
> wrote:
> >>>
> >>> No and no.  This filter will not be used for predicate pushdown now or
> in 8.0.0.  It could possibly come after 8.0.0.  If parquet stores
> statistics for each column of a struct array (don't know offhand if they
> do) then we should create a JIRA to expose this.
> >>>
> >>> On Wed, Apr 20, 2022, 11:01 AM Partha Dutta <[email protected]>
> wrote:
> >>>>
> >>>> That works! Thanks. Do you know off hand if this filter would be used
> in a predicate pushdown for a parquet dataset? Or would it be possibly
> coming in version 8.0.0?
> >>>>
> >>>> On Wed, Apr 20, 2022 at 3:49 PM Weston Pace <[email protected]>
> wrote:
> >>>>>
> >>>>> The second argument to `call_function` should be a list (the args to
> >>>>> the function).  Since `arr3` is iterable it is interpreting it as a
> >>>>> list of args and trying to treat each row as an argument to your call
> >>>>> (this is the reason it thinks you have 3 arguments).  This should
> >>>>> work:
> >>>>>
> >>>>>     pc.call_function("struct_field", [arr3],
> pc.StructFieldOptions(indices=[0]))
> >>>>>
> >>>>> Unfortunately, that evaluates the function immediately.  If you want
> >>>>> to create an expression then you need some way to create a call and I
> >>>>> don't actually know how to do that.  I can do something a little
> >>>>> hackish:
> >>>>>
> >>>>> table = pa.Table.from_pydict({'values': arr3})
> >>>>> dataset = ds.dataset(table)
> >>>>> sf_call = ds.field('')._call('struct_field', [ds.field('values')],
> >>>>> pc.StructFieldOptions(indices=[0]))
> >>>>> dataset.to_table(filter=sf_call < 200)
> >>>>>
> >>>>> However, I suspect there is probably a better way to create a call
> >>>>> object than `ds.field('')._call(...)`
> >>>>>
> >>>>> On Wed, Apr 20, 2022 at 3:09 AM Partha Dutta <[email protected]>
> wrote:
> >>>>> >
> >>>>> > I'm trying to use the compute function struct_field in order to
> create an expression for dataset filtering. But running into an error. This
> is the code snippet:
> >>>>> >
> >>>>> > arr1 = pa.array([100, 200, 300])
> >>>>> > arr2 = pa.array([400, 500, 600])
> >>>>> > arr3 = pa.StructArray.from_arrays([arr1, arr2], ["one", "two"])
> >>>>> > e = pc.call_function("struct_field", arr3,
> pc.StructFieldOptions(indices=[0])) > 200
> >>>>> > Traceback (most recent call last):
> >>>>> >   File "<stdin>", line 1, in <module>
> >>>>> >   File "pyarrow/_compute.pyx", line 531, in
> pyarrow._compute.call_function
> >>>>> >   File "pyarrow/_compute.pyx", line 330, in
> pyarrow._compute.Function.call
> >>>>> >   File "pyarrow/error.pxi", line 143, in
> pyarrow.lib.pyarrow_internal_check_status
> >>>>> >   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> >>>>> > pyarrow.lib.ArrowInvalid: Function 'struct_field' accepts 1
> arguments but attempted to look up kernel(s) with 3
> >>>>> >
> >>>>> > If I try to exclude the options, I get
> >>>>> > pyarrow.lib.ArrowInvalid: Function 'struct_field' cannot be called
> without options
> >>>>> >
> >>>>> > Any advice? I am using pyarrow 7.0.0
> >>>>> > --
> >>>>> > Partha Dutta
> >>>>> > [email protected]
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Partha Dutta
> >>>> [email protected]
>

Reply via email to