Jorge created ARROW-9502:
----------------------------

             Summary: [Python][C++] Date64 converted to Date32 on parquet
                 Key: ARROW-9502
                 URL: https://issues.apache.org/jira/browse/ARROW-9502
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
            Reporter: Jorge


Executing the example below, 

{code:python}
import datetime
import pyarrow as pa
import pyarrow.parquet

data = [
    datetime.datetime(2000, 1, 1, 12, 34, 56, 123456), 
    datetime.datetime(2000, 1, 1)
]

data32 = pa.array(data, type='date32')
data64 = pa.array(data, type='date64')
table = pyarrow.Table.from_arrays([data32, data64], names=['a', 'b'])

pyarrow.parquet.write_table(table, 'a.parquet')

print(table)
print()
print(pyarrow.parquet.read_table('a.parquet'))
{code}

yields


{code:java}
pyarrow.Table
a: date32[day]
b: date64[ms]

pyarrow.Table
a: date32[day]
b: date32[day]   <------- IMO it should be date64[ms]
{code}

indicating that pyarrow converted its date64[ms] schema to date32[day]. I used 
the rust crate to print parquet's metadata, and the value is indeed stored as 
i32, which suggests that this likely happens on the writer, not reader.

IMO this does not have any practical implication because they are both dates 
and a 32 bit date (in days) can hold more dates than a 64 bit date in 
milliseconds, but still constitutes an error as the roundtrip serialization 
does not yield the same schema.

A broader question I have is why data64 exists in the first place? I can't see 
any reason to store a *date* in milliseconds since EPOCH.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to