[ https://issues.apache.org/jira/browse/ARROW-7747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Théophile Chevalier updated ARROW-7747: --------------------------------------- Description: Hi, I've encountered what seems to me a bug using {noformat} pyarrow==0.15.1 pandas==0.25.3 numpy==1.18.1{noformat} I'm trying to write a table containing nanosecond timestamps to a millisecond schema. Here is a minimal example: {code:python} import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import numpy as np pyarrow_schema = pa.schema([pa.field("datetime_ms", pa.timestamp("ms"))]) timestamp = np.datetime64("2019-06-21T22:13:02.901123") d = {"datetime_ms": timestamp} df = pd.DataFrame(d, index=range(1)) table = pa.Table.from_pandas(df, schema=pyarrow_schema) pq.write_table( table, "test.parquet", coerce_timestamps="ms", allow_truncated_timestamps=True, ) {code} {noformat} pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would lose data: 1561155182901123000', 'Conversion failed for column datetime_ms with type datetime64[ns]'){noformat} >From my understanding, the expected behaviour shoud be arrow allowing the >conversion anyway, even if loosing some data. Related discussions: - https://github.com/apache/arrow/issues/1920 - https://issues.apache.org/jira/browse/ARROW-2555 This test https://github.com/apache/arrow/blob/f70dbd1dbdb51a47e6a8a8aac8efd40ccf4d44f2/python/pyarrow/tests/test_parquet.py#L846 does not explicitely check for nanosecond timestamps. To be honest I've not checked at the code yet, so let me know whether I missed something. I'd be happy to fix it if it's really a bug. was: Hi, I've encountered what seems to me a bug using: {noformat} pyarrow==0.15.1 pandas==0.25.3 numpy==1.18.1{noformat} {code:python} import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import numpy as np pyarrow_schema = pa.schema([pa.field("datetime_ms", pa.timestamp("ms"))]) timestamp = np.datetime64("2019-06-21T22:13:02.901123") d = {"datetime_ms": timestamp} df = pd.DataFrame(d, index=range(1)) table = pa.Table.from_pandas(df, schema=pyarrow_schema) pq.write_table( table, "test.parquet", coerce_timestamps="ms", allow_truncated_timestamps=True, ) {code} {noformat} pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would lose data: 1561155182901123000', 'Conversion failed for column datetime_ms with type datetime64[ns]'){noformat} >From my understanding, the expected behaviour shoud be arrow allowing the >conversion anyway, even if loosing some data. Related discussions: - https://github.com/apache/arrow/issues/1920 - https://issues.apache.org/jira/browse/ARROW-2555 This test https://github.com/apache/arrow/blob/f70dbd1dbdb51a47e6a8a8aac8efd40ccf4d44f2/python/pyarrow/tests/test_parquet.py#L846 does not explicitely check for nanosecond timestamps. To be honest I've not checked at the code yet, so let me know whether I missed something. I'd be happy to fix it if it's really a bug. > [Python] coerce_timestamps + allow_truncated_timestamps does not work as > expected with nanoseconds > --------------------------------------------------------------------------------------------------- > > Key: ARROW-7747 > URL: https://issues.apache.org/jira/browse/ARROW-7747 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.15.1 > Reporter: Théophile Chevalier > Priority: Major > > Hi, > I've encountered what seems to me a bug using > {noformat} > pyarrow==0.15.1 > pandas==0.25.3 > numpy==1.18.1{noformat} > > I'm trying to write a table containing nanosecond timestamps to a millisecond > schema. Here is a minimal example: > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > import numpy as np > pyarrow_schema = pa.schema([pa.field("datetime_ms", pa.timestamp("ms"))]) > timestamp = np.datetime64("2019-06-21T22:13:02.901123") > d = {"datetime_ms": timestamp} > df = pd.DataFrame(d, index=range(1)) > table = pa.Table.from_pandas(df, schema=pyarrow_schema) > pq.write_table( > table, > "test.parquet", > coerce_timestamps="ms", > allow_truncated_timestamps=True, > ) > {code} > {noformat} > pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would > lose data: 1561155182901123000', 'Conversion failed for column datetime_ms > with type datetime64[ns]'){noformat} > From my understanding, the expected behaviour shoud be arrow allowing the > conversion anyway, even if loosing some data. > Related discussions: > - https://github.com/apache/arrow/issues/1920 > - https://issues.apache.org/jira/browse/ARROW-2555 > This test > https://github.com/apache/arrow/blob/f70dbd1dbdb51a47e6a8a8aac8efd40ccf4d44f2/python/pyarrow/tests/test_parquet.py#L846 > does not explicitely check for nanosecond timestamps. > To be honest I've not checked at the code yet, so let me know whether I > missed something. I'd be happy to fix it if it's really a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)