Joshua Pedrick created ARROW-7678: ------------------------------------- Summary: [C++][Parquet] setting TZ= in environment on Linux causes broken parquet Key: ARROW-7678 URL: https://issues.apache.org/jira/browse/ARROW-7678 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.15.1 Environment: Linux, Ubuntu 18.04, arrow/parquet 0.15.1 from instructions https://arrow.apache.org/install/ Reporter: Joshua Pedrick
When I set TZ=CST-8, or other timezone on Linux time columns are corrupted in my resulting parquet file. Below are the calls I use to define my schema: {code:java} PrimitiveNode::Make( columnName, Repetition::REQUIRED, LogicalType::Timestamp( true, LogicalType::TimeUnit::MICROS, false, false ), ::parquet::Type::INT64 ) ); PrimitiveNode::Make( columnName, repetition, LogicalType::Time( true, LogicalType::TimeUnit::MICROS ), ::parquet::Type::INT64 ) ); {code} I use an Int64Writer for both types. When reading, in this case using pandas with pyarrow, but also in C++, I get the following exception: {code:java} File "pyarrow/_parquet.pyx", line 1136, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 80, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Couldn't deserialize thrift: TProtocolException: Invalid data Deserializing page header failed.{code} Seems as if the column header must be defining a timestamp+timezone even though I manually set is_adjusted_to_utc. -- This message was sent by Atlassian Jira (v8.3.4#803005)