[ 
https://issues.apache.org/jira/browse/ARROW-11269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Burke updated ARROW-11269:
------------------------------
    Description: 
The issue seems to stem from the new(-ish) behavior of the Arrow Parquet reader 
where the embedded arrow schema is used instead of deriving the schema from the 
Parquet columns.

 

However it seems like some cases still derive the schema type from the column 
types, leading to the Arrow record batch reader erroring out that the column 
types must match the schema types.

 

In our case, the column type is an int96 datetime (ns) type, and the Arrow type 
in the embedded schema is DataType::Timestamp(TimeUnit::Nanoseconds, 
Some("UTC")). However, the code that constructs the Arrays seems to re-derive 
this column type as DataType::Timestamp(TimeUnit::Nanoseconds, None) (because 
the Parquet schema has no timezone information). And so, Parquet files that we 
were able to read successfully with our branch of Arrow circa October are now 
unreadable.

 

I've attached an example of a Parquet file that demonstrates the problem. This 
file was created in Python (as most of our Parquet files are).

I've also attached a sample Rust program that will demonstrate the error.

  was:
The issue seems to stem from the new(-ish) behavior of the Arrow Parquet reader 
where the embedded arrow schema is used instead of deriving the schema from the 
Parquet columns.

 

However it seems like some cases still derive the schema type from the column 
types, leading to the Arrow record batch reader erroring out that the column 
types must match the schema types.

 

In our case, the column type is an int96 datetime (ns) type, and the Arrow type 
in the embedded schema is DataType::Timestamp(TimeUnit::Nanoseconds, 
Some("UTC")). However, the code that constructs the Arrays seems to re-derive 
this column type as DataType::Timestamp(TimeUnit::Nanoseconds, None) (because 
the Parquet schema has no timezone information). And so, Parquet files that we 
were able to read successfully with our branch of Arrow circa October are now 
unreadable.

 

I've attached an example of a Parquet file that demonstrates the problem. This 
file was created in Python (as most of our Parquet files are).


> [Rust] Unable to read Parquet file because of mismatch in column-derived and 
> embedded schemas
> ---------------------------------------------------------------------------------------------
>
>                 Key: ARROW-11269
>                 URL: https://issues.apache.org/jira/browse/ARROW-11269
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Rust
>    Affects Versions: 3.0.0
>            Reporter: Max Burke
>            Priority: Blocker
>         Attachments: 0100c937-7c1c-78c4-1f4b-156ef04e79f0.parquet, main.rs
>
>
> The issue seems to stem from the new(-ish) behavior of the Arrow Parquet 
> reader where the embedded arrow schema is used instead of deriving the schema 
> from the Parquet columns.
>  
> However it seems like some cases still derive the schema type from the column 
> types, leading to the Arrow record batch reader erroring out that the column 
> types must match the schema types.
>  
> In our case, the column type is an int96 datetime (ns) type, and the Arrow 
> type in the embedded schema is DataType::Timestamp(TimeUnit::Nanoseconds, 
> Some("UTC")). However, the code that constructs the Arrays seems to re-derive 
> this column type as DataType::Timestamp(TimeUnit::Nanoseconds, None) (because 
> the Parquet schema has no timezone information). And so, Parquet files that 
> we were able to read successfully with our branch of Arrow circa October are 
> now unreadable.
>  
> I've attached an example of a Parquet file that demonstrates the problem. 
> This file was created in Python (as most of our Parquet files are).
> I've also attached a sample Rust program that will demonstrate the error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to