[jira] [Updated] (ARROW-7747) [Python] coerce_timestamps + allow_truncated_timestamps does not work as expected with nanoseconds

Jira Mon, 03 Feb 2020 00:16:38 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Théophile Chevalier updated ARROW-7747:
---------------------------------------
    Description: 
Hi,

I've encountered what seems to me a bug using
{noformat}
pyarrow==0.15.1
pandas==0.25.3
numpy==1.18.1{noformat}
 
I'm trying to write a table containing nanosecond timestamps to a millisecond 
schema. Here is a minimal example:
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np

pyarrow_schema = pa.schema([pa.field("datetime_ms", pa.timestamp("ms"))])

timestamp = np.datetime64("2019-06-21T22:13:02.901123")

d = {"datetime_ms": timestamp}

df = pd.DataFrame(d, index=range(1))

table = pa.Table.from_pandas(df, schema=pyarrow_schema)

pq.write_table(
    table,
    "test.parquet",
    coerce_timestamps="ms",
    allow_truncated_timestamps=True,
)
{code}

{noformat}
pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would 
lose data: 1561155182901123000', 'Conversion failed for column datetime_ms with 
type datetime64[ns]'){noformat}

>From my understanding, the expected behaviour shoud be arrow allowing the 
>conversion anyway, even if loosing some data.

Related discussions:
- https://github.com/apache/arrow/issues/1920
- https://issues.apache.org/jira/browse/ARROW-2555

This test 
https://github.com/apache/arrow/blob/f70dbd1dbdb51a47e6a8a8aac8efd40ccf4d44f2/python/pyarrow/tests/test_parquet.py#L846
 does not explicitely check for nanosecond timestamps.

To be honest I've not checked at the code yet, so let me know whether I missed 
something. I'd be happy to fix it if it's really a bug.

  was:
Hi,

I've encountered what seems to me a bug using:
{noformat}
pyarrow==0.15.1
pandas==0.25.3
numpy==1.18.1{noformat}
 

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np

pyarrow_schema = pa.schema([pa.field("datetime_ms", pa.timestamp("ms"))])

timestamp = np.datetime64("2019-06-21T22:13:02.901123")

d = {"datetime_ms": timestamp}

df = pd.DataFrame(d, index=range(1))

table = pa.Table.from_pandas(df, schema=pyarrow_schema)

pq.write_table(
    table,
    "test.parquet",
    coerce_timestamps="ms",
    allow_truncated_timestamps=True,
)
{code}

{noformat}
pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would 
lose data: 1561155182901123000', 'Conversion failed for column datetime_ms with 
type datetime64[ns]'){noformat}

>From my understanding, the expected behaviour shoud be arrow allowing the 
>conversion anyway, even if loosing some data.

Related discussions:
- https://github.com/apache/arrow/issues/1920
- https://issues.apache.org/jira/browse/ARROW-2555

This test 
https://github.com/apache/arrow/blob/f70dbd1dbdb51a47e6a8a8aac8efd40ccf4d44f2/python/pyarrow/tests/test_parquet.py#L846
 does not explicitely check for nanosecond timestamps.

To be honest I've not checked at the code yet, so let me know whether I missed 
something. I'd be happy to fix it if it's really a bug.


> [Python] coerce_timestamps + allow_truncated_timestamps does not work as 
> expected with nanoseconds 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-7747
>                 URL: https://issues.apache.org/jira/browse/ARROW-7747
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>            Reporter: Théophile Chevalier
>            Priority: Major
>
> Hi,
> I've encountered what seems to me a bug using
> {noformat}
> pyarrow==0.15.1
> pandas==0.25.3
> numpy==1.18.1{noformat}
>  
> I'm trying to write a table containing nanosecond timestamps to a millisecond 
> schema. Here is a minimal example:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> import numpy as np
> pyarrow_schema = pa.schema([pa.field("datetime_ms", pa.timestamp("ms"))])
> timestamp = np.datetime64("2019-06-21T22:13:02.901123")
> d = {"datetime_ms": timestamp}
> df = pd.DataFrame(d, index=range(1))
> table = pa.Table.from_pandas(df, schema=pyarrow_schema)
> pq.write_table(
>     table,
>     "test.parquet",
>     coerce_timestamps="ms",
>     allow_truncated_timestamps=True,
> )
> {code}
> {noformat}
> pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would 
> lose data: 1561155182901123000', 'Conversion failed for column datetime_ms 
> with type datetime64[ns]'){noformat}
> From my understanding, the expected behaviour shoud be arrow allowing the 
> conversion anyway, even if loosing some data.
> Related discussions:
> - https://github.com/apache/arrow/issues/1920
> - https://issues.apache.org/jira/browse/ARROW-2555
> This test 
> https://github.com/apache/arrow/blob/f70dbd1dbdb51a47e6a8a8aac8efd40ccf4d44f2/python/pyarrow/tests/test_parquet.py#L846
>  does not explicitely check for nanosecond timestamps.
> To be honest I've not checked at the code yet, so let me know whether I 
> missed something. I'd be happy to fix it if it's really a bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7747) [Python] coerce_timestamps + allow_truncated_timestamps does not work as expected with nanoseconds

Reply via email to