[jira] [Updated] (ARROW-8100) timestamp[ms] and date64 data types not working as expected on write

paul hess (Jira) Thu, 12 Mar 2020 13:59:14 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-8100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


paul hess updated ARROW-8100:
-----------------------------
    Description: 
I expect that either timestamp[ms] or date64 will give me a millisecond 
presicion datetime/timestamp as written to a parquet file, instead this is the 
behavior I see:

{{ }}

>>> arr = pa.array([datetime(2020, 12, 20)])

(have used pa.array([datetime(2020, 12, 20), type=pa.timestamp('ms')]) with no 
later casting as well)

>>> arr.cast(pa.timestamp('ms'), safe=False)

<pyarrow.lib.TimestampArray object at 0x117f3d4c8>
 [
 2020-12-20 00:00:00.000
 ]

 

>>> table = pa.Table.from_arrays([arr],

{{                          names=["start_date"])}}

{{>>> table}}
 pyarrow.Table
 start_date: timestamp[us]

 

{{// just to make sure}}

 

{{>>> table.column("start_date").cast(pa.timestamp('ms'), safe=False)}}
 <pyarrow.lib.ChunkedArray object at 0x117f5e9a8>
 [
 [
 2020-12-20 00:00:00.000
 ]
 ]

 

{{// just to make extra sure}}

 

{{>>> schema = pa.schema([pa.field("start_date", pa.timestamp("ms"))])}}

>>> table.cast(schema, safe=False)parquet.write_table(table,

                                                                                
              "sldkfjasldkfj.parquet",  

                                                                                
             coerce_timestamps="ms", 

                                                                                
              compression="SNAPPY", 

{{                                          allow_truncated_timestamps=True)}}

Result for the written file:

Schema:

{
 "type" : "record",
 "name" : "schema",
 "fields" : [ {
 "name" : "start_date",
 "type" : [ "null",

{ "type" : "long", "logicalType" : "timestamp-millis" }

],
 "default" : null
 } ]
 }

Data:
||start_date|| ||
|1608422400000| |

 

that is a microsecond [us] value, despite casting to [ms] and setting the 
appropriate config on the write_table method. If it was a millisecond timestamp 
it would be accurate to translate back to a datetime with fromtimestamp, but:
 >>> from datetime import datetime
 >>>
 >>>
 >>>
 >>>
 >>> datetime.fromtimestamp(1608422400000)
 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 ValueError: year 52938 is out of range
 >>> datetime.fromtimestamp(1608422400000 /1000)
 datetime.datetime(2020, 12, 19, 16, 0)
  

 

Ok, so then we should use date64() type, after all the docs say *_Create 
instance of 64-bit date (milliseconds since UNIX epoch 1970-01-01)_*

 
 >>> arr = pa.array([datetime(2020, 12, 20, 0, 0, 0, 123)], type=pa.date64())
 >>> arr
 <pyarrow.lib.Date64Array object at 0x11da877c8>
 [
 2020-12-20
 ]

>>> table = pa.Table.from_arrays([arr], names=["start_date"])
 >>> table
 pyarrow.Table

start_date: date64[ms]

parquet.write_table(table,

                                 "bebedabeep.parquet",

                                  coerce_timestamps="ms",

                                  compression="SNAPPY",

                                  allow_truncated_timestamps=True)

                                         
  

Result for the written file:

Schema:

{
 "type" : "record",
 "name" : "schema",
 "fields" : [ {
 "name" : "start_date",
 "type" : [ "null",

{ "type" : "int", "logicalType" : "date" }

],
 "default" : null
 } ]
 }

Data:

 
||start_date|| ||
|18616| |

 
 That is "days since UNIX epoch 1970-01-01" just like date32() type, the time 
info is stripped off, we can confirm this:
 >>> arr.to_pylist()
 [datetime.date(2020, 12, 20)]
  

How do I write a millisecond precision timestamp with pyarrow.parquet?

  was:
I expect that either timestamp[ms] or date64 will give me a millisecond 
presicion datetime/timestamp as written to a parquet file, instead this is the 
behavior I see:

{{ }}

>>> arr = pa.array([datetime(2020, 12, 20)])

(have used pa.array([datetime(2020, 12, 20), type=pa.timestamp('ms')]) as well)

>>> arr.cast(pa.timestamp('ms'), safe=False)

<pyarrow.lib.TimestampArray object at 0x117f3d4c8>
 [
 2020-12-20 00:00:00.000
 ]

 

>>> table = pa.Table.from_arrays([arr],

{{                          names=["start_date"])}}

{{>>> table}}
 pyarrow.Table
 start_date: timestamp[us]

 

{{// just to make sure}}

 

{{>>> table.column("start_date").cast(pa.timestamp('ms'), safe=False)}}
 <pyarrow.lib.ChunkedArray object at 0x117f5e9a8>
 [
 [
 2020-12-20 00:00:00.000
 ]
 ]

 

{{// just to make extra sure}}

 

{{>>> schema = pa.schema([pa.field("start_date", pa.timestamp("ms"))])}}

>>> table.cast(schema, safe=False)parquet.write_table(table,

                                                                                
              "sldkfjasldkfj.parquet",  

                                                                                
             coerce_timestamps="ms", 

                                                                                
              compression="SNAPPY", 

{{                                          allow_truncated_timestamps=True)}}

Result for the written file:

Schema:

{
 "type" : "record",
 "name" : "schema",
 "fields" : [ {
 "name" : "start_date",
 "type" : [ "null",

{ "type" : "long", "logicalType" : "timestamp-millis" }

],
 "default" : null
 } ]
 }

Data:
||start_date|| ||
|1608422400000| |

 

that is a microsecond [us] value, despite casting to [ms] and setting the 
appropriate config on the write_table method. If it was a millisecond timestamp 
it would be accurate to translate back to a datetime with fromtimestamp, but:
 >>> from datetime import datetime
 >>>
 >>>
 >>>
 >>>
 >>> datetime.fromtimestamp(1608422400000)
 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 ValueError: year 52938 is out of range
 >>> datetime.fromtimestamp(1608422400000 /1000)
 datetime.datetime(2020, 12, 19, 16, 0)
  

 

Ok, so then we should use date64() type, after all the docs say *_Create 
instance of 64-bit date (milliseconds since UNIX epoch 1970-01-01)_*

 
 >>> arr = pa.array([datetime(2020, 12, 20, 0, 0, 0, 123)], type=pa.date64())
 >>> arr
 <pyarrow.lib.Date64Array object at 0x11da877c8>
 [
 2020-12-20
 ]

>>> table = pa.Table.from_arrays([arr], names=["start_date"])
 >>> table
 pyarrow.Table

start_date: date64[ms]

parquet.write_table(table,

                                 "bebedabeep.parquet",

                                  coerce_timestamps="ms",

                                  compression="SNAPPY",

                                  allow_truncated_timestamps=True)

                                         
  

Result for the written file:

Schema:

{
 "type" : "record",
 "name" : "schema",
 "fields" : [ {
 "name" : "start_date",
 "type" : [ "null",

{ "type" : "int", "logicalType" : "date" }

],
 "default" : null
 } ]
 }

Data:

 
||start_date|| ||
|18616| |

 
 That is "days since UNIX epoch 1970-01-01" just like date32() type, the time 
info is stripped off, we can confirm this:
 >>> arr.to_pylist()
 [datetime.date(2020, 12, 20)]
  

How do I write a millisecond precision timestamp with pyarrow.parquet?


> timestamp[ms] and date64 data types not working as expected on write
> --------------------------------------------------------------------
>
>                 Key: ARROW-8100
>                 URL: https://issues.apache.org/jira/browse/ARROW-8100
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0, 0.15.1
>            Reporter: paul hess
>            Priority: Major
>
> I expect that either timestamp[ms] or date64 will give me a millisecond 
> presicion datetime/timestamp as written to a parquet file, instead this is 
> the behavior I see:
> {{ }}
> >>> arr = pa.array([datetime(2020, 12, 20)])
> (have used pa.array([datetime(2020, 12, 20), type=pa.timestamp('ms')]) with 
> no later casting as well)
> >>> arr.cast(pa.timestamp('ms'), safe=False)
> <pyarrow.lib.TimestampArray object at 0x117f3d4c8>
>  [
>  2020-12-20 00:00:00.000
>  ]
>  
> >>> table = pa.Table.from_arrays([arr],
> {{                          names=["start_date"])}}
> {{>>> table}}
>  pyarrow.Table
>  start_date: timestamp[us]
>  
> {{// just to make sure}}
>  
> {{>>> table.column("start_date").cast(pa.timestamp('ms'), safe=False)}}
>  <pyarrow.lib.ChunkedArray object at 0x117f5e9a8>
>  [
>  [
>  2020-12-20 00:00:00.000
>  ]
>  ]
>  
> {{// just to make extra sure}}
>  
> {{>>> schema = pa.schema([pa.field("start_date", pa.timestamp("ms"))])}}
> >>> table.cast(schema, safe=False)parquet.write_table(table,
>                                                                               
>                 "sldkfjasldkfj.parquet",  
>                                                                               
>                coerce_timestamps="ms", 
>                                                                               
>                 compression="SNAPPY", 
> {{                                          allow_truncated_timestamps=True)}}
> Result for the written file:
> Schema:
> {
>  "type" : "record",
>  "name" : "schema",
>  "fields" : [ {
>  "name" : "start_date",
>  "type" : [ "null",
> { "type" : "long", "logicalType" : "timestamp-millis" }
> ],
>  "default" : null
>  } ]
>  }
> Data:
> ||start_date|| ||
> |1608422400000| |
>  
> that is a microsecond [us] value, despite casting to [ms] and setting the 
> appropriate config on the write_table method. If it was a millisecond 
> timestamp it would be accurate to translate back to a datetime with 
> fromtimestamp, but:
>  >>> from datetime import datetime
>  >>>
>  >>>
>  >>>
>  >>>
>  >>> datetime.fromtimestamp(1608422400000)
>  Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  ValueError: year 52938 is out of range
>  >>> datetime.fromtimestamp(1608422400000 /1000)
>  datetime.datetime(2020, 12, 19, 16, 0)
>   
>  
> Ok, so then we should use date64() type, after all the docs say *_Create 
> instance of 64-bit date (milliseconds since UNIX epoch 1970-01-01)_*
>  
>  >>> arr = pa.array([datetime(2020, 12, 20, 0, 0, 0, 123)], type=pa.date64())
>  >>> arr
>  <pyarrow.lib.Date64Array object at 0x11da877c8>
>  [
>  2020-12-20
>  ]
> >>> table = pa.Table.from_arrays([arr], names=["start_date"])
>  >>> table
>  pyarrow.Table
> start_date: date64[ms]
> parquet.write_table(table,
>                                  "bebedabeep.parquet",
>                                   coerce_timestamps="ms",
>                                   compression="SNAPPY",
>                                   allow_truncated_timestamps=True)
>                                          
>   
> Result for the written file:
> Schema:
> {
>  "type" : "record",
>  "name" : "schema",
>  "fields" : [ {
>  "name" : "start_date",
>  "type" : [ "null",
> { "type" : "int", "logicalType" : "date" }
> ],
>  "default" : null
>  } ]
>  }
> Data:
>  
> ||start_date|| ||
> |18616| |
>  
>  That is "days since UNIX epoch 1970-01-01" just like date32() type, the time 
> info is stripped off, we can confirm this:
>  >>> arr.to_pylist()
>  [datetime.date(2020, 12, 20)]
>   
> How do I write a millisecond precision timestamp with pyarrow.parquet?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8100) timestamp[ms] and date64 data types not working as expected on write

Reply via email to