Hi Rich,
Thanks for the report.
It seems that the issue is in the parquet writing or reading itself, and
not the pandas<->pyarrow conversion.
Converting from python to pyarrow looks OK:
In [15]: arr = pa.array([decimal.Decimal('9223372036854775808'),
decimal.Decimal('1.111')])
In [16]: arr
Out[16]:
<pyarrow.lib.Decimal128Array object at 0x7fd07d79a468>
[
9223372036854775808.000,
1.111
]
But then writing and reading again to/from parquet gives the issue:
In [17]: pq.write_table(pa.table({'a': arr}), "test_decimal.parquet")
In [18]: pq.read_table("test_decimal.parquet")
Out[18]:
pyarrow.Table
a: decimal(19, 3)
In [19]: pq.read_table("test_decimal.parquet").column('a')
Out[19]:
<pyarrow.lib.ChunkedArray object at 0x7fd0711e9f98>
[
[
-221360928884514619.392,
1.111
]
]
This happens here with a "decimal(19, 3)" type, when using 1.11 instead of
1.111, the decimal type is "decimal(19, 2)".
I am not too familiar with the decimal type, but I opened a JIRA issue for
this: https://issues.apache.org/jira/browse/PARQUET-1869
Joris
On Mon, 1 Jun 2020 at 23:39, Rich Bramante <[email protected]> wrote:
> Python 3.7.6 (default, Jan 30 2020, 10:29:04)
> [GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] on linux
> print(pyarrow.__version__)
> 0.17.1
>
> Seeing an issue where DECIMAL values written can seem to be corrupted
> based on very subtle changes to the data set. Example:
>
> #!/bin/python3
>
> import pandas as pd
> import decimal
> import pyarrow.parquet as pq
>
> #$ python3
> # Python 3.7.6 (default, Jan 30 2020, 10:29:04)
> # [GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] on linux
> # >>> print(pyarrow.__version__)
> # 0.17.1
>
> # Results in unexpected output
> df = pd.DataFrame({"values": [decimal.Decimal('9223372036854775808'),
> decimal.Decimal('18446744073709551616'), decimal.Decimal('2147483648'),
> decimal.Decimal('1.111'), decimal.Decimal('-2'), decimal.Decimal('0')]})
>
> df.to_parquet("/tmp/f")
> pq_file = pq.ParquetFile("/tmp/f")
> print (pq_file.read().to_pandas())
>
> #Values Read:
> # -221360928884514619.392,
> -442721857769029238.784,2147483648.000,1.111,-2.000,0.000
>
> # Results in expected output (only difference is 1.11 vs. 1.111)
> df = pd.DataFrame({"values": [decimal.Decimal('9223372036854775808'),
> decimal.Decimal('18446744073709551616'), decimal.Decimal('2147483648'),
> decimal.Decimal('1.11'), decimal.Decimal('-2'), decimal.Decimal('0')]})
>
> #Values Read:
>
> 9223372036854775808.00,18446744073709551616.00,2147483648.00,1.11,-2.00,0.00
>
> df.to_parquet("/tmp/f")
> pq_file = pq.ParquetFile("/tmp/f")
> print (pq_file.read().to_pandas())
>
>