[ https://issues.apache.org/jira/browse/ARROW-14596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526407#comment-17526407 ]
Alenka Frim commented on ARROW-14596: ------------------------------------- I would like to add observations we got today when pairing with [~jorisvandenbossche] on this topic. First was the result of using {{pq.read_table}} with legacy implementation vs using {{ds.dataset}} with column projection. The data can get selected correctly with the dataset implementation but what happens is that the structure of a nested field is not kept (from struct it is flattened to string column). In case of using columns selection with a list in {{{}ds.dataset{}}}, it errors, as reported in the issue. {code:python} >>> import pandas as pd >>> import pyarrow as pa >>> import pyarrow.parquet as pq >>> >>> df = pd.DataFrame({ ... 'user_id': ['abc123', 'qrs456'], ... 'interaction': [{'type': 'click', 'element': 'button'}, {'type':'scroll', 'element': 'window'}] ... }) >>> >>> table = pa.Table.from_pandas(df) >>> pq.write_table(table, 'example.parquet') {code} {code:python} >>> pq.read_table('example.parquet', columns = ['user_id', 'interaction.type'], >>> use_legacy_dataset = True) pyarrow.Table user_id: string interaction: struct<type: string> child 0, type: string ---- user_id: [["abc123","qrs456"]] interaction: [ -- is_valid: all not null -- child 0 type: string ["click","scroll"]] {code} {code:python} >>> import pyarrow.dataset as ds >>> projection = { ... 'user_id': ds.field('user_id'), ... 'new': ds.field(('interaction', 'type')) ... } >>> ds.dataset('example.parquet').to_table(columns=projection) pyarrow.Table user_id: string new: string ---- user_id: [["abc123","qrs456"]] new: [["click","scroll"]] {code} {code:python} >>> ds.dataset('example.parquet').to_table(columns=['user_id', >>> 'interaction.type']) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "pyarrow/_dataset.pyx", line 303, in pyarrow._dataset.Dataset.to_table return self.scanner(**kwargs).to_table() File "pyarrow/_dataset.pyx", line 270, in pyarrow._dataset.Dataset.scanner return Scanner.from_dataset(self, **kwargs) File "pyarrow/_dataset.pyx", line 2322, in pyarrow._dataset.Scanner.from_dataset _populate_builder(builder, columns=columns, filter=filter, File "pyarrow/_dataset.pyx", line 2168, in pyarrow._dataset._populate_builder check_status(builder.ProjectColumns([tobytes(c) for c in columns])) File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status raise ArrowInvalid(message) pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(interaction.type) in user_id: string interaction: struct<element: string, type: string> __fragment_index: int32 __batch_index: int32 __last_in_fragment: bool __filename: string /Users/alenkafrim/repos/arrow/cpp/src/arrow/type.h:1722 CheckNonEmpty(matches, root) /Users/alenkafrim/repos/arrow/cpp/src/arrow/type.h:1757 FindOne(root) /Users/alenkafrim/repos/arrow/cpp/src/arrow/dataset/scanner.cc:714 ref->GetOne(dataset_schema) /Users/alenkafrim/repos/arrow/cpp/src/arrow/dataset/scanner.cc:784 ProjectionDescr::FromNames(std::move(columns), *scan_options_->dataset_schema) {code} When Scanner object is being created from the dataset class via {{to_table}} and (through _populate_builder) and in the case of a list of columns the {{ProjectColumns}} method ("arrow::dataset::ScannerBuilder") is being called it only accepts string column names and errors when a column is a struct. We were thinking if it would be a good idea to add a new method in {{scanner.cc}} that would mimic {{FromNames}} method but takes {{field_ref}} as an argument? Afterwords there would also be a need to recreate a struct field for which we are not sure how to approach. cc [~westonpace] [~apitrou] do you think that would be a correct way to go? > [Python] parquet.read_table nested fields in columns does not work for > use_legacy_dataset=False > ----------------------------------------------------------------------------------------------- > > Key: ARROW-14596 > URL: https://issues.apache.org/jira/browse/ARROW-14596 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Tom Scheffers > Assignee: Alenka Frim > Priority: Critical > Fix For: 9.0.0 > > > Reading nested field does not work with use_legacy_dataset=False. > This works: > > {code:java} > import pyarrow.parquet as pq > t = pq.read_table( > source=*filename*, > columns=['store_key', 'properties.country'], > use_legacy_dataset=True, > ).to_pandas() > {code} > This does not work (for the same parquet file): > > {code:java} > import pyarrow.parquet as pq > t = pq.read_table( > source=*filename*, > columns=['store_key', 'properties.country'], > use_legacy_dataset=False, > ).to_pandas(){code} > -- This message was sent by Atlassian Jira (v8.20.7#820007)