[jira] [Commented] (ARROW-4967) Object type and stats lost when using 96-bit timestamps

Joris Van den Bossche (JIRA) Wed, 08 May 2019 02:54:25 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835477#comment-16835477
 ]


Joris Van den Bossche commented on ARROW-4967:
----------------------------------------------

[~yiannisliodakis] Regarding the logical type, I think this is expected: INT96 
is only a physical type in the parquet format, and there is no timestamp-like 
logical type that uses INT96 as physical type. 

The usage of INT96 for timestamps only stems from a convention in some of the 
parquet implementations (I think Hive and Impala, but not very familiar with 
it), and therefore arrow has the option to write them, for compatibility with 
those systems. But note that this type is actually deprecated in the parquet 
format.

See eg [https://stackoverflow.com/a/54665645/653364], 
[https://stackoverflow.com/questions/42628287/sparks-int96-time-type] and the 
discussion in [https://github.com/apache/parquet-format/pull/49]

 

That's the explanation for the missing logical type. For the missing stats, I 
am not sure.

> Object type and stats lost when using 96-bit timestamps
> -------------------------------------------------------
>
>                 Key: ARROW-4967
>                 URL: https://issues.apache.org/jira/browse/ARROW-4967
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.12.1
>         Environment: PyArrow: 0.12.1
> Python: 2.7.15, 3.7.2
> Pandas: 0.24.2
>            Reporter: Diego Argueta
>            Priority: Minor
>              Labels: parquet
>
> Run the following code:
> {code:python}
> import datetime as dt
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> dataframe = pd.DataFrame({'foo': [dt.datetime.now()]})
> table = pa.Table.from_pandas(dataframe, preserve_index=False)
> pq.write_table(table, 'int64.parq')
> pq.write_table(table, 'int96.parq', use_deprecated_int96_timestamps=True)
> {code}
> Examining the {{int64.parq}} file, we see that the column metadata includes 
> an object type of {{TIMESTAMP_MICROS}} and also gives some stats. All is well.
> {code}
> file schema: schema 
> --------------------------------------------------------------------------------
> foo:         OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
> row group 1: RC:1 TS:76 OFFSET:4 
> --------------------------------------------------------------------------------
> foo:          INT64 SNAPPY ... ST:[min: 2019-12-31T23:59:59.999000, max: 
> 2019-12-31T23:59:59.999000, num_nulls: 0]
> {code}
> However, if we look at {{int96.parq}}, it appears that that metadata is lost. 
> No object type, and no column stats.
> {code}
> file schema: schema 
> --------------------------------------------------------------------------------
> foo:         OPTIONAL INT96 R:0 D:1
> row group 1: RC:1 TS:58 OFFSET:4 
> --------------------------------------------------------------------------------
> foo:          INT96 SNAPPY ... ST:[no stats for this column]
> {code}
> This is a bit confusing since the metadata for the exact same data can look 
> differently depending on an unrelated flag being set or cleared.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-4967) Object type and stats lost when using 96-bit timestamps

Reply via email to