Joshua Pedrick created ARROW-7678:
-------------------------------------
Summary: [C++][Parquet] setting TZ= in environment on Linux causes
broken parquet
Key: ARROW-7678
URL: https://issues.apache.org/jira/browse/ARROW-7678
Project: Apache Arrow
Issue Type: Bug
Components: C++
Affects Versions: 0.15.1
Environment: Linux, Ubuntu 18.04, arrow/parquet 0.15.1 from
instructions https://arrow.apache.org/install/
Reporter: Joshua Pedrick
When I set TZ=CST-8, or other timezone on Linux time columns are corrupted in
my resulting parquet file.
Below are the calls I use to define my schema:
{code:java}
PrimitiveNode::Make( columnName, Repetition::REQUIRED,
LogicalType::Timestamp( true, LogicalType::TimeUnit::MICROS, false, false ),
::parquet::Type::INT64 ) );
PrimitiveNode::Make( columnName,
repetition,
LogicalType::Time( true, LogicalType::TimeUnit::MICROS ),
::parquet::Type::INT64 ) );
{code}
I use an Int64Writer for both types. When reading, in this case using pandas
with pyarrow, but also in C++, I get the following exception:
{code:java}
File "pyarrow/_parquet.pyx", line 1136, in
pyarrow._parquet.ParquetReader.read_all
File "pyarrow/error.pxi", line 80, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Couldn't deserialize thrift: TProtocolException:
Invalid data
Deserializing page header failed.{code}
Seems as if the column header must be defining a timestamp+timezone even though
I manually set is_adjusted_to_utc.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)