I'll need to look at that. Right now I am recursively flattening the atruct
and then using an Expression to filter. Any way to push this lower in the
stack would be more performant
On Thu, Apr 21, 2022, 6:02 PM David Li wrote:
> Coincidentally there was a StackOverflow question about this
Coincidentally there was a StackOverflow question about this recently too with
some answers outlining approaches for 7.0 and 8.0:
https://stackoverflow.com/questions/71945507/how-can-i-filter-or-select-sub-fields-of-structtype-columns-in-pyarrow
On Thu, Apr 21, 2022, at 17:46, Weston Pace
Awesome. I've created ARROW-16275[1] to track this.
Also, I discovered that, starting with 8.0.0, we have support for
expressing nested references in python so you can write:
dataset.to_table(filter=ds.field("values", "one") < 200)
[1] https://issues.apache.org/jira/browse/ARROW-16275
On
No and no. This filter will not be used for predicate pushdown now or in
8.0.0. It could possibly come after 8.0.0. If parquet stores statistics
for each column of a struct array (don't know offhand if they do) then we
should create a JIRA to expose this.
On Wed, Apr 20, 2022, 11:01 AM Partha
That works! Thanks. Do you know off hand if this filter would be used in a
predicate pushdown for a parquet dataset? Or would it be possibly coming in
version 8.0.0?
On Wed, Apr 20, 2022 at 3:49 PM Weston Pace wrote:
> The second argument to `call_function` should be a list (the args to
> the
The second argument to `call_function` should be a list (the args to
the function). Since `arr3` is iterable it is interpreting it as a
list of args and trying to treat each row as an argument to your call
(this is the reason it thinks you have 3 arguments). This should
work:
I'm trying to use the compute function struct_field in order to create an
expression for dataset filtering. But running into an error. This is the
code snippet:
arr1 = pa.array([100, 200, 300])
arr2 = pa.array([400, 500, 600])
arr3 = pa.StructArray.from_arrays([arr1, arr2], ["one", "two"])
e =