[jira] [Commented] (ARROW-15492) [Python] handle timestamp type in parquet file for compatibility with older HiveQL

nero (Jira) Fri, 11 Feb 2022 00:36:42 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490747#comment-17490747
 ]


nero commented on ARROW-15492:
------------------------------

> There is no guarantee local time aligns with the writer's timezone. I think 
> the C++ library has started vendoring the necessary utilities to do the time 
> zone conversions

Looking forward to it.(y)

 

> An alternative could also be to provide additional metadata that consumers 
> could use to determine the source and pad as necessary outside of pyarrow.  

Yes, since Arrow can perceive the writer's timezone or the timezone stored in 
pyarrow.timestamp type when Arrow saves a table as a parquet file.

 

But for now, PyArrow seems cannot restore the timezone from the metadata when 
saving a parquet file with 
use_deprecated_int96_timestamps=True(pyarrow.parquet.write_table).
{code:python}
from datetime import datetime
import pyarrow as pa
import pyarrow.parquet as parquet

date = datetime(2022, 1, 1)
timestamp = int(date.timestamp())

pa_array = pa.array([timestamp])
pa_fields = [pa.field("t", pa.timestamp('s', tz='Asia/Shanghai'))]
pa_table = pa.Table.from_arrays([pa_array], schema=pa.schema(pa_fields))
print(pa_table)
# pyarrow.Table
# t: timestamp[ns, tz=Asia/Shanghai]
# ----
# t: [[1970-01-01 00:00:01.640966400]]
print(pa_table.to_pandas())
#                           t
# 0 2022-01-01 00:00:00+08:00


# A: write pyarrow.Tableto parquet (INT64 & Timestamp logical type)
parquet.write_table(pa_table, "test_int64_timestamp.parquet")
print(parquet.read_table("test_int64_timestamp.parquet"))
# pyarrow.Table
# t: timestamp[ms, tz=Asia/Shanghai]
# ----
# t: [[2021-12-31 16:00:00.000]]

# as same as pa_table, work fine here

# 
---------------------------------------------------------------------------------------

# B: write pyarrow.Table to parquet (INT96) 
parquet.write_table(pa_table, "test_int96.parquet", 
use_deprecated_int96_timestamps=True)
print(parquet.read_table("test_int96.parquet"))

# loss the time zone here

# pyarrow.Table
# t: timestamp[ns]
# ----
# t: [[2021-12-31 16:00:00.000000000]]

# also affect in pandas.DataFrame
print(parquet.read_table("test_int96.parquet").to_pandas())
#                     t
# 0 2021-12-31 16:00:00
{code}
 

Maybe Arrow should add the timezone to the metadata when writing timestamp type 
data into INT96?

> [Python] handle timestamp type in parquet file for compatibility with older 
> HiveQL
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-15492
>                 URL: https://issues.apache.org/jira/browse/ARROW-15492
>             Project: Apache Arrow
>          Issue Type: New Feature
>    Affects Versions: 6.0.1
>            Reporter: nero
>            Priority: Major
>
> Hi there,
> I face an issue when I write a parquet file by PyArrow.
> In the older version of Hive, it can only recognize the timestamp type stored 
> in INT96, so I use table.write_to_data with `use_deprecated 
> timestamp_int96_timestamps=True` option to save the parquet file. But the 
> HiveQL will skip conversion when the metadata of parquet file is not 
> created_by "parquet-mr".
> [hive/ParquetRecordReaderBase.java at 
> f1ff99636a5546231336208a300a114bcf8c5944 · apache/hive 
> (github.com)|https://github.com/apache/hive/blob/f1ff99636a5546231336208a300a114bcf8c5944/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L137-L139]
>  
> So I have to save the timestamp columns with timezone info(pad to UTC+8).
> But when pyarrow.parquet read from a dir which contains parquets created by 
> both PyArrow and parquet-mr, Arrow.Table will ignore the timezone info for 
> parquet-mr files.
>  
> Maybe PyArrow can expose the created_by option in pyarrow({*}prefer{*}, 
> parquet::WriterProperties::created_by is available in the C++ ).
> Or handle the timestamp type with timezone which files created by parquet-mr?
>  
> Maybe related to https://issues.apache.org/jira/browse/ARROW-14422



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15492) [Python] handle timestamp type in parquet file for compatibility with older HiveQL

Reply via email to