[ https://issues.apache.org/jira/browse/ARROW-6844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benoit Rostykus updated ARROW-6844: ----------------------------------- Attachment: dbg_sample2.gz.parquet > List<scalar type> columns read broken with 0.15.0 > ------------------------------------------------- > > Key: ARROW-6844 > URL: https://issues.apache.org/jira/browse/ARROW-6844 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.15.0 > Reporter: Benoit Rostykus > Priority: Major > Labels: parquet > Attachments: dbg_sample.gz.parquet, dbg_sample2.gz.parquet > > > Columns of type {{array<primitive type>}} (such as `array<int32>`, > `array<int64>`...) are not readable anymore using {{pyarrow == 0.15.0}} (but > were with {{pyarrow == 0.14.1}}) when the original writer of the parquet file > is {{parquet-mr 1.9.1}}. > {code} > import pyarrow.parquet as pq > pf = pq.ParquetFile('sample.gz.parquet') > print(pf.read(columns=['profile_ids'])) > {code} > with 0.14.1: > {code} > pyarrow.Table > profile_ids: list<element: int64> > child 0, element: int64 > ... > {code} > with 0.15.0: > {code} > Traceback (most recent call last): > File "<string>", line 1, in <module> > File > "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py", > line 253, in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1131, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Column data for field 0 with type list<item: int64> > is inconsistent with schema list<element: int64> > {code} > I've tested parquet files coming from multiple tables (with various schemas) > created with `parquet-mr`, couldn't read any `array<primitive type>` column > anymore. > > I _think_ the bug was introduced with [this > commit|[https://github.com/apache/arrow/commit/06fd2da5e8e71b660e6eea4b7702ca175e31f3f5]]. > I think the root of the issue comes from the fact that `parquet-mr` writes > the inner struct name as `"element"` by default (see > [here|[https://github.com/apache/parquet-mr/blob/b4198be200e7e2df82bc9a18d54c8cd16aa156ac/parquet-column/src/main/java/org/apache/parquet/schema/ConversionPatterns.java#L33]]), > whereas `parquet-cpp` (or `pyarrow`?) assumes `"item"` (see for example > [this > test|[https://github.com/apache/arrow/blob/c805b5fadb548925c915e0e130d6ed03c95d1398/python/pyarrow/tests/test_schema.py#L74]]). > The round-tripping tests write/read in pyarrow only obviously won't catch > this. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)