[ https://issues.apache.org/jira/browse/ARROW-8100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
paul hess updated ARROW-8100: ----------------------------- Description: I expect that either timestamp[ms] or date64 will give me a millisecond presicion datetime/timestamp as written to a parquet file, instead this is the behavior I see: {{ }} {{ >>> arr = pa.array([datetime(2020, 12, 20)])}} {\{ >>> arr.cast(pa.timestamp('ms'), safe=False)}} {\{ <pyarrow.lib.TimestampArray object at 0x117f3d4c8>}} \{{ [}} \{{ 2020-12-20 00:00:00.000}} \{{ ]}} {{>>> table = pa.Table.from_arrays([arr], }} {{ names=["start_date"])}} {{>>> table}} \{{ pyarrow.Table}} {{ start_date: timestamp[us]}} {{// just to make sure}} {{>>> table.column("start_date").cast(pa.timestamp('ms'), safe=False)}} \{{ <pyarrow.lib.ChunkedArray object at 0x117f5e9a8>}} \{{ [}} \{{ [}} \{{ 2020-12-20 00:00:00.000}} \{{ ]}} \{{ ]}} {{// just to make extra sure}} {{>>> schema = pa.schema([pa.field("start_date", pa.timestamp("ms"))])}} {\{>>> table.cast(schema, safe=False)parquet.write_table(table, }} {\{ "sldkfjasldkfj.parquet", }} {\{ coerce_timestamps="ms", }} {\{ compression="SNAPPY", }} {{ allow_truncated_timestamps=True)}} Result for the written file: Schema: {quote}{ "type" : "record", "name" : "schema", "fields" : [ Unknown macro: \{ "name" } ], "default" : null } ] } {quote} Data: ||start_date|| || |1608422400000| | that is a microsecond [us] value, despite casting to [ms] and setting the appropriate config on the write_table method. If it was a millisecond timestamp it would be accurate to translate back to a datetime with fromtimestamp, but: >>> from datetime import datetime >>> >>> >>> >>> >>> datetime.fromtimestamp(1608422400000) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: year 52938 is out of range >>> datetime.fromtimestamp(1608422400000 /1000) datetime.datetime(2020, 12, 19, 16, 0) Ok, so then we should use date64() type, after all the docs say *_Create instance of 64-bit date (milliseconds since UNIX epoch 1970-01-01)_* >>> arr = pa.array([datetime(2020, 12, 20, 0, 0, 0, 123)], type=pa.date64()) >>> arr <pyarrow.lib.Date64Array object at 0x11da877c8> [ 2020-12-20 ] >>> table = pa.Table.from_arrays([arr], names=["start_date"]) >>> table pyarrow.Table start_date: date64[ms] parquet.write_table(table, "bebedabeep.parquet", coerce_timestamps="ms", compression="SNAPPY", allow_truncated_timestamps=True) Result for the written file: Schema: {quote}{ "type" : "record", "name" : "schema", "fields" : [ Unknown macro: \{ "name" } ], "default" : null } ] } {quote} Data: ||start_date|| || |18616| | That is "days since UNIX epoch 1970-01-01" just like date32() type, the time info is stripped off, we can confirm this: >>> arr.to_pylist() [datetime.date(2020, 12, 20)] How do I write a millisecond precision timestamp with pyarrow.parquet? was: I expect that either timestamp[ms] or date64 will give me a millisecond presicion datetime/timestamp as written to a parquet file, instead this is the behavior I see: {{ }} {{ >>> arr = pa.array([datetime(2020, 12, 20)])}} {\{ >>> arr.cast(pa.timestamp('ms'), safe=False)}} {\{ <pyarrow.lib.TimestampArray object at 0x117f3d4c8>}} \{{ [}} \{{ 2020-12-20 00:00:00.000}} \{{ ]}} {{>>> table = pa.Table.from_arrays([arr], }} {{ names=["start_date"])}} {{>>> table}} \{{ pyarrow.Table}} {{ start_date: timestamp[us]}} {{// just to make sure}} {{>>> table.column("start_date").cast(pa.timestamp('ms'), safe=False)}} \{{ <pyarrow.lib.ChunkedArray object at 0x117f5e9a8>}} \{{ [}} \{{ [}} \{{ 2020-12-20 00:00:00.000}} \{{ ]}} \{{ ]}} {{// just to make extra sure}} {{>>> schema = pa.schema([pa.field("start_date", pa.timestamp("ms"))])}} {\{>>> table.cast(schema, safe=False)parquet.write_table(table, }} {\{ "sldkfjasldkfj.parquet", }} {\{ coerce_timestamps="ms", }} {\{ compression="SNAPPY", }} {{ allow_truncated_timestamps=True)}} Result for the written file: Schema: {quote}{ "type" : "record", "name" : "schema", "fields" : [ Unknown macro: \{ "name" } ], "default" : null } ] } {quote} Data: ||start_date|| || |1608422400000| | that is a microsecond [us] value, despite casting to [ms] and setting the appropriate config on the write_table method. If it was a millisecond timestamp it would be accurate to translate back to a datetime with fromtimestamp, but: >>> from datetime import datetime >>> >>> >>> >>> >>> datetime.fromtimestamp(1608422400000) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: year 52938 is out of range >>> datetime.fromtimestamp(1608422400000 /1000) datetime.datetime(2020, 12, 19, 16, 0) Ok, so then we should use date64() type, after all the docs say *_Create instance of 64-bit date (milliseconds since UNIX epoch 1970-01-01)_* >>> arr = pa.array([datetime(2020, 12, 20, 0, 0, 0, 123)], type=pa.date64()) >>> arr <pyarrow.lib.Date64Array object at 0x11da877c8> [ 2020-12-20 ] >>> table = pa.Table.from_arrays([arr], names=["start_date"]) >>> table pyarrow.Table start_date: date64[ms] parquet.write_table(table, "/Users/hessp/ddt/rest-ingress/bebedabeep.parquet", coerce_timestamps="ms", compression="SNAPPY", allow_truncated_timestamps=True) Result for the written file: Schema: {quote}{ "type" : "record", "name" : "schema", "fields" : [ Unknown macro: \{ "name" } ], "default" : null } ] } {quote} Data: ||start_date|| || |18616| | That is "days since UNIX epoch 1970-01-01" just like date32() type, the time info is stripped off, we can confirm this: >>> arr.to_pylist() [datetime.date(2020, 12, 20)] How do I write a millisecond precision timestamp with pyarrow.parquet? > timestamp[ms] and date64 data types not working as expected on write > -------------------------------------------------------------------- > > Key: ARROW-8100 > URL: https://issues.apache.org/jira/browse/ARROW-8100 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.15.1 > Reporter: paul hess > Priority: Major > > I expect that either timestamp[ms] or date64 will give me a millisecond > presicion datetime/timestamp as written to a parquet file, instead this is > the behavior I see: > {{ }} > {{ >>> arr = pa.array([datetime(2020, 12, 20)])}} > {\{ >>> arr.cast(pa.timestamp('ms'), safe=False)}} > {\{ <pyarrow.lib.TimestampArray object at 0x117f3d4c8>}} > \{{ [}} > \{{ 2020-12-20 00:00:00.000}} > \{{ ]}} > > {{>>> table = pa.Table.from_arrays([arr], }} > {{ names=["start_date"])}} > {{>>> table}} > \{{ pyarrow.Table}} > {{ start_date: timestamp[us]}} > > {{// just to make sure}} > > {{>>> table.column("start_date").cast(pa.timestamp('ms'), safe=False)}} > \{{ <pyarrow.lib.ChunkedArray object at 0x117f5e9a8>}} > \{{ [}} > \{{ [}} > \{{ 2020-12-20 00:00:00.000}} > \{{ ]}} > \{{ ]}} > {{// just to make extra sure}} > > {{>>> schema = pa.schema([pa.field("start_date", pa.timestamp("ms"))])}} > {\{>>> table.cast(schema, safe=False)parquet.write_table(table, }} > {\{ > "sldkfjasldkfj.parquet", }} > {\{ > coerce_timestamps="ms", }} > {\{ > compression="SNAPPY", }} > {{ > allow_truncated_timestamps=True)}} > Result for the written file: > Schema: > {quote}{ > "type" : "record", > "name" : "schema", > "fields" : [ > Unknown macro: \{ "name" } > ], > "default" : null > } ] > } > {quote} > Data: > ||start_date|| || > |1608422400000| | > > that is a microsecond [us] value, despite casting to [ms] and setting the > appropriate config on the write_table method. If it was a millisecond > timestamp it would be accurate to translate back to a datetime with > fromtimestamp, but: > >>> from datetime import datetime > >>> > >>> > >>> > >>> > >>> datetime.fromtimestamp(1608422400000) > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > ValueError: year 52938 is out of range > >>> datetime.fromtimestamp(1608422400000 /1000) > datetime.datetime(2020, 12, 19, 16, 0) > > > Ok, so then we should use date64() type, after all the docs say *_Create > instance of 64-bit date (milliseconds since UNIX epoch 1970-01-01)_* > > >>> arr = pa.array([datetime(2020, 12, 20, 0, 0, 0, 123)], type=pa.date64()) > >>> arr > <pyarrow.lib.Date64Array object at 0x11da877c8> > [ > 2020-12-20 > ] > >>> table = pa.Table.from_arrays([arr], names=["start_date"]) > >>> table > pyarrow.Table > start_date: date64[ms] > parquet.write_table(table, > "bebedabeep.parquet", > coerce_timestamps="ms", > compression="SNAPPY", > allow_truncated_timestamps=True) > > > Result for the written file: > Schema: > {quote}{ > "type" : "record", > "name" : "schema", > "fields" : [ > Unknown macro: \{ "name" } > ], > "default" : null > } ] > } > {quote} > Data: > > ||start_date|| || > |18616| | > > That is "days since UNIX epoch 1970-01-01" just like date32() type, the time > info is stripped off, we can confirm this: > >>> arr.to_pylist() > [datetime.date(2020, 12, 20)] > > How do I write a millisecond precision timestamp with pyarrow.parquet? -- This message was sent by Atlassian Jira (v8.3.4#803005)