Awesome.  I've created ARROW-16275[1] to track this.

Also, I discovered that, starting with 8.0.0, we have support for
expressing nested references in python so you can write:

    dataset.to_table(filter=ds.field("values", "one") < 200)

[1] https://issues.apache.org/jira/browse/ARROW-16275

On Thu, Apr 21, 2022 at 6:44 AM Micah Kornfield <[email protected]> wrote:
>>
>> If parquet stores statistics for each column of a struct array (don't know 
>> offhand if they do) then we should create a JIRA to expose this.
>
>
> It does store statistics per-leaf column.
>
> On Wed, Apr 20, 2022 at 3:34 PM Weston Pace <[email protected]> wrote:
>>
>> No and no.  This filter will not be used for predicate pushdown now or in 
>> 8.0.0.  It could possibly come after 8.0.0.  If parquet stores statistics 
>> for each column of a struct array (don't know offhand if they do) then we 
>> should create a JIRA to expose this.
>>
>> On Wed, Apr 20, 2022, 11:01 AM Partha Dutta <[email protected]> wrote:
>>>
>>> That works! Thanks. Do you know off hand if this filter would be used in a 
>>> predicate pushdown for a parquet dataset? Or would it be possibly coming in 
>>> version 8.0.0?
>>>
>>> On Wed, Apr 20, 2022 at 3:49 PM Weston Pace <[email protected]> wrote:
>>>>
>>>> The second argument to `call_function` should be a list (the args to
>>>> the function).  Since `arr3` is iterable it is interpreting it as a
>>>> list of args and trying to treat each row as an argument to your call
>>>> (this is the reason it thinks you have 3 arguments).  This should
>>>> work:
>>>>
>>>>     pc.call_function("struct_field", [arr3], 
>>>> pc.StructFieldOptions(indices=[0]))
>>>>
>>>> Unfortunately, that evaluates the function immediately.  If you want
>>>> to create an expression then you need some way to create a call and I
>>>> don't actually know how to do that.  I can do something a little
>>>> hackish:
>>>>
>>>> table = pa.Table.from_pydict({'values': arr3})
>>>> dataset = ds.dataset(table)
>>>> sf_call = ds.field('')._call('struct_field', [ds.field('values')],
>>>> pc.StructFieldOptions(indices=[0]))
>>>> dataset.to_table(filter=sf_call < 200)
>>>>
>>>> However, I suspect there is probably a better way to create a call
>>>> object than `ds.field('')._call(...)`
>>>>
>>>> On Wed, Apr 20, 2022 at 3:09 AM Partha Dutta <[email protected]> 
>>>> wrote:
>>>> >
>>>> > I'm trying to use the compute function struct_field in order to create 
>>>> > an expression for dataset filtering. But running into an error. This is 
>>>> > the code snippet:
>>>> >
>>>> > arr1 = pa.array([100, 200, 300])
>>>> > arr2 = pa.array([400, 500, 600])
>>>> > arr3 = pa.StructArray.from_arrays([arr1, arr2], ["one", "two"])
>>>> > e = pc.call_function("struct_field", arr3, 
>>>> > pc.StructFieldOptions(indices=[0])) > 200
>>>> > Traceback (most recent call last):
>>>> >   File "<stdin>", line 1, in <module>
>>>> >   File "pyarrow/_compute.pyx", line 531, in 
>>>> > pyarrow._compute.call_function
>>>> >   File "pyarrow/_compute.pyx", line 330, in 
>>>> > pyarrow._compute.Function.call
>>>> >   File "pyarrow/error.pxi", line 143, in 
>>>> > pyarrow.lib.pyarrow_internal_check_status
>>>> >   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
>>>> > pyarrow.lib.ArrowInvalid: Function 'struct_field' accepts 1 arguments 
>>>> > but attempted to look up kernel(s) with 3
>>>> >
>>>> > If I try to exclude the options, I get
>>>> > pyarrow.lib.ArrowInvalid: Function 'struct_field' cannot be called 
>>>> > without options
>>>> >
>>>> > Any advice? I am using pyarrow 7.0.0
>>>> > --
>>>> > Partha Dutta
>>>> > [email protected]
>>>
>>>
>>>
>>> --
>>> Partha Dutta
>>> [email protected]

Reply via email to