[ 
https://issues.apache.org/jira/browse/ARROW-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912811#comment-16912811
 ] 

Deepak Majeti commented on ARROW-4967:
--------------------------------------

The comments above are correct! INT96 type is deprecated and it statistics are 
disabled by default. The timestamp byte layout in INT96 is big endian and does 
not comply with the standard sort orders in the spec.

> [C++] Parquet: Object type and stats lost when using 96-bit timestamps
> ----------------------------------------------------------------------
>
>                 Key: ARROW-4967
>                 URL: https://issues.apache.org/jira/browse/ARROW-4967
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.12.1
>         Environment: PyArrow: 0.12.1
> Python: 2.7.15, 3.7.2
> Pandas: 0.24.2
>            Reporter: Diego Argueta
>            Priority: Minor
>              Labels: parquet
>
> Run the following code:
> {code:python}
> import datetime as dt
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> dataframe = pd.DataFrame({'foo': [dt.datetime.now()]})
> table = pa.Table.from_pandas(dataframe, preserve_index=False)
> pq.write_table(table, 'int64.parq')
> pq.write_table(table, 'int96.parq', use_deprecated_int96_timestamps=True)
> {code}
> Examining the {{int64.parq}} file, we see that the column metadata includes 
> an object type of {{TIMESTAMP_MICROS}} and also gives some stats. All is well.
> {code}
> file schema: schema 
> --------------------------------------------------------------------------------
> foo:         OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
> row group 1: RC:1 TS:76 OFFSET:4 
> --------------------------------------------------------------------------------
> foo:          INT64 SNAPPY ... ST:[min: 2019-12-31T23:59:59.999000, max: 
> 2019-12-31T23:59:59.999000, num_nulls: 0]
> {code}
> However, if we look at {{int96.parq}}, it appears that that metadata is lost. 
> No object type, and no column stats.
> {code}
> file schema: schema 
> --------------------------------------------------------------------------------
> foo:         OPTIONAL INT96 R:0 D:1
> row group 1: RC:1 TS:58 OFFSET:4 
> --------------------------------------------------------------------------------
> foo:          INT96 SNAPPY ... ST:[no stats for this column]
> {code}
> This is a bit confusing since the metadata for the exact same data can look 
> differently depending on an unrelated flag being set or cleared.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to