[ https://issues.apache.org/jira/browse/ARROW-11269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Max Burke updated ARROW-11269: ------------------------------ Attachment: main.rs > [Rust] Unable to read Parquet file because of mismatch in column-derived and > embedded schemas > --------------------------------------------------------------------------------------------- > > Key: ARROW-11269 > URL: https://issues.apache.org/jira/browse/ARROW-11269 > Project: Apache Arrow > Issue Type: Bug > Components: Rust > Affects Versions: 3.0.0 > Reporter: Max Burke > Priority: Blocker > Attachments: 0100c937-7c1c-78c4-1f4b-156ef04e79f0.parquet, main.rs > > > The issue seems to stem from the new(-ish) behavior of the Arrow Parquet > reader where the embedded arrow schema is used instead of deriving the schema > from the Parquet columns. > > However it seems like some cases still derive the schema type from the column > types, leading to the Arrow record batch reader erroring out that the column > types must match the schema types. > > In our case, the column type is an int96 datetime (ns) type, and the Arrow > type in the embedded schema is DataType::Timestamp(TimeUnit::Nanoseconds, > Some("UTC")). However, the code that constructs the Arrays seems to re-derive > this column type as DataType::Timestamp(TimeUnit::Nanoseconds, None) (because > the Parquet schema has no timezone information). And so, Parquet files that > we were able to read successfully with our branch of Arrow circa October are > now unreadable. > > I've attached an example of a Parquet file that demonstrates the problem. > This file was created in Python (as most of our Parquet files are). > I've also attached a sample Rust program that will demonstrate the error. -- This message was sent by Atlassian Jira (v8.3.4#803005)