Hey Tim,

We're still wiring up all the work needed for nested field refs in general (see 
ARROW-14658 [1]). And we haven't listed out what kinds of references we want to 
support. I would say we want to support things that Substrait supports [2] and 
the behavior you describe here appears to correspond to "masked complex 
expression" references there, that said, the way it ultimately gets 
implemented/exposed may be different. 

For now, you will have to read the column and then postprocess it yourself 
(this will require you to manually decompose the ListArray/StructArray and 
reconstruct the ListArray - I can work out an example if that would help).

By the way, thank you for the example here - it reminds me that we also likely 
should support pushing down the projection so that we only load the necessary 
leaf nodes in Parquet as well.

[1]: https://issues.apache.org/jira/browse/ARROW-14658
[2]: 
https://substrait.io/expressions/field_references/#masked-complex-expression

Best,
David

On Tue, Nov 9, 2021, at 15:45, Tim Nicolson wrote:
> Hi, 
> 
> I have a parquet dataset containing "order" structs each of which has a list 
> of "item" structs.  I would like to read a subset of the item structs. e.g.
> 
> order_id: int64
> ...other fields...
> items: list<item: struct<item_id: int64, price: int64, ...other fields...>>
> 
> # is this/will this be possible?
> dataset.to_table(columns=["order_id", "items.item_id", items.price"])
> 
> I guess they'd be lists of scalars rather than a list of structs with fewer 
> fields?
> 
> I couldn't see any reference to *lists* in 
> https://github.com/apache/arrow/pull/11466. 
> 
> Is this possible or planned?  Is there another way to achieve this?
> 
> Thanks in advance, 
> 
> Tim

Reply via email to