Joshua Pedrick created ARROW-7678:
-------------------------------------

             Summary: [C++][Parquet] setting TZ= in environment on Linux causes 
broken parquet
                 Key: ARROW-7678
                 URL: https://issues.apache.org/jira/browse/ARROW-7678
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
    Affects Versions: 0.15.1
         Environment: Linux, Ubuntu 18.04, arrow/parquet 0.15.1 from 
instructions https://arrow.apache.org/install/
            Reporter: Joshua Pedrick


When I set TZ=CST-8, or other timezone on Linux time columns are corrupted in 
my resulting parquet file.

 

Below are the calls I use to define my schema:

 
{code:java}
PrimitiveNode::Make( columnName, Repetition::REQUIRED,
 LogicalType::Timestamp( true, LogicalType::TimeUnit::MICROS, false, false ),
 ::parquet::Type::INT64 ) );
PrimitiveNode::Make( columnName,
 repetition,
 LogicalType::Time( true, LogicalType::TimeUnit::MICROS ),
 ::parquet::Type::INT64 ) );
{code}
I use an Int64Writer for both types. When reading, in this case using pandas 
with pyarrow, but also in C++, I get the following exception:
{code:java}
 File "pyarrow/_parquet.pyx", line 1136, in 
pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 80, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Couldn't deserialize thrift: TProtocolException: 
Invalid data
Deserializing page header failed.{code}
Seems as if the column header must be defining a timestamp+timezone even though 
I manually set is_adjusted_to_utc.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to