> > If parquet stores statistics for each column of a struct array (don't know > offhand if they do) then we should create a JIRA to expose this.
It does store statistics per-leaf column. On Wed, Apr 20, 2022 at 3:34 PM Weston Pace <[email protected]> wrote: > No and no. This filter will not be used for predicate pushdown now or in > 8.0.0. It could possibly come after 8.0.0. If parquet stores statistics > for each column of a struct array (don't know offhand if they do) then we > should create a JIRA to expose this. > > On Wed, Apr 20, 2022, 11:01 AM Partha Dutta <[email protected]> > wrote: > >> That works! Thanks. Do you know off hand if this filter would be used in >> a predicate pushdown for a parquet dataset? Or would it be possibly coming >> in version 8.0.0? >> >> On Wed, Apr 20, 2022 at 3:49 PM Weston Pace <[email protected]> >> wrote: >> >>> The second argument to `call_function` should be a list (the args to >>> the function). Since `arr3` is iterable it is interpreting it as a >>> list of args and trying to treat each row as an argument to your call >>> (this is the reason it thinks you have 3 arguments). This should >>> work: >>> >>> pc.call_function("struct_field", [arr3], >>> pc.StructFieldOptions(indices=[0])) >>> >>> Unfortunately, that evaluates the function immediately. If you want >>> to create an expression then you need some way to create a call and I >>> don't actually know how to do that. I can do something a little >>> hackish: >>> >>> table = pa.Table.from_pydict({'values': arr3}) >>> dataset = ds.dataset(table) >>> sf_call = ds.field('')._call('struct_field', [ds.field('values')], >>> pc.StructFieldOptions(indices=[0])) >>> dataset.to_table(filter=sf_call < 200) >>> >>> However, I suspect there is probably a better way to create a call >>> object than `ds.field('')._call(...)` >>> >>> On Wed, Apr 20, 2022 at 3:09 AM Partha Dutta <[email protected]> >>> wrote: >>> > >>> > I'm trying to use the compute function struct_field in order to create >>> an expression for dataset filtering. But running into an error. This is the >>> code snippet: >>> > >>> > arr1 = pa.array([100, 200, 300]) >>> > arr2 = pa.array([400, 500, 600]) >>> > arr3 = pa.StructArray.from_arrays([arr1, arr2], ["one", "two"]) >>> > e = pc.call_function("struct_field", arr3, >>> pc.StructFieldOptions(indices=[0])) > 200 >>> > Traceback (most recent call last): >>> > File "<stdin>", line 1, in <module> >>> > File "pyarrow/_compute.pyx", line 531, in >>> pyarrow._compute.call_function >>> > File "pyarrow/_compute.pyx", line 330, in >>> pyarrow._compute.Function.call >>> > File "pyarrow/error.pxi", line 143, in >>> pyarrow.lib.pyarrow_internal_check_status >>> > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status >>> > pyarrow.lib.ArrowInvalid: Function 'struct_field' accepts 1 arguments >>> but attempted to look up kernel(s) with 3 >>> > >>> > If I try to exclude the options, I get >>> > pyarrow.lib.ArrowInvalid: Function 'struct_field' cannot be called >>> without options >>> > >>> > Any advice? I am using pyarrow 7.0.0 >>> > -- >>> > Partha Dutta >>> > [email protected] >>> >> >> >> -- >> Partha Dutta >> [email protected] >> >
