Jorge created ARROW-9502: ---------------------------- Summary: [Python][C++] Date64 converted to Date32 on parquet Key: ARROW-9502 URL: https://issues.apache.org/jira/browse/ARROW-9502 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Jorge
Executing the example below, {code:python} import datetime import pyarrow as pa import pyarrow.parquet data = [ datetime.datetime(2000, 1, 1, 12, 34, 56, 123456), datetime.datetime(2000, 1, 1) ] data32 = pa.array(data, type='date32') data64 = pa.array(data, type='date64') table = pyarrow.Table.from_arrays([data32, data64], names=['a', 'b']) pyarrow.parquet.write_table(table, 'a.parquet') print(table) print() print(pyarrow.parquet.read_table('a.parquet')) {code} yields {code:java} pyarrow.Table a: date32[day] b: date64[ms] pyarrow.Table a: date32[day] b: date32[day] <------- IMO it should be date64[ms] {code} indicating that pyarrow converted its date64[ms] schema to date32[day]. I used the rust crate to print parquet's metadata, and the value is indeed stored as i32, which suggests that this likely happens on the writer, not reader. IMO this does not have any practical implication because they are both dates and a 32 bit date (in days) can hold more dates than a 64 bit date in milliseconds, but still constitutes an error as the roundtrip serialization does not yield the same schema. A broader question I have is why data64 exists in the first place? I can't see any reason to store a *date* in milliseconds since EPOCH. -- This message was sent by Atlassian Jira (v8.3.4#803005)