hi Joris -- thank you for investigating. There's some code in the Parquet write path that deals with converting the 128-bit decimals to the Parquet representation which is usually smaller than 16 bytes per value, so I would guess the bug lies in this Arrow/128-bit decimal to Parquet FIXED_LEN_BYTE_ARRAY representation
On Tue, Jun 2, 2020 at 2:52 AM Joris Van den Bossche <[email protected]> wrote: > > Hi Rich, > > Thanks for the report. > It seems that the issue is in the parquet writing or reading itself, and not > the pandas<->pyarrow conversion. > > Converting from python to pyarrow looks OK: > > In [15]: arr = pa.array([decimal.Decimal('9223372036854775808'), > decimal.Decimal('1.111')]) > > In [16]: arr > Out[16]: > <pyarrow.lib.Decimal128Array object at 0x7fd07d79a468> > [ > 9223372036854775808.000, > 1.111 > ] > > But then writing and reading again to/from parquet gives the issue: > > In [17]: pq.write_table(pa.table({'a': arr}), "test_decimal.parquet") > > In [18]: pq.read_table("test_decimal.parquet") > Out[18]: > pyarrow.Table > a: decimal(19, 3) > > In [19]: pq.read_table("test_decimal.parquet").column('a') > Out[19]: > <pyarrow.lib.ChunkedArray object at 0x7fd0711e9f98> > [ > [ > -221360928884514619.392, > 1.111 > ] > ] > > This happens here with a "decimal(19, 3)" type, when using 1.11 instead of > 1.111, the decimal type is "decimal(19, 2)". > > I am not too familiar with the decimal type, but I opened a JIRA issue for > this: https://issues.apache.org/jira/browse/PARQUET-1869 > > Joris > > > On Mon, 1 Jun 2020 at 23:39, Rich Bramante <[email protected]> wrote: >> >> Python 3.7.6 (default, Jan 30 2020, 10:29:04) >> [GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] on linux >> print(pyarrow.__version__) >> 0.17.1 >> >> Seeing an issue where DECIMAL values written can seem to be corrupted based >> on very subtle changes to the data set. Example: >> >> #!/bin/python3 >> >> import pandas as pd >> import decimal >> import pyarrow.parquet as pq >> >> #$ python3 >> # Python 3.7.6 (default, Jan 30 2020, 10:29:04) >> # [GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] on linux >> # >>> print(pyarrow.__version__) >> # 0.17.1 >> >> # Results in unexpected output >> df = pd.DataFrame({"values": [decimal.Decimal('9223372036854775808'), >> decimal.Decimal('18446744073709551616'), decimal.Decimal('2147483648'), >> decimal.Decimal('1.111'), decimal.Decimal('-2'), decimal.Decimal('0')]}) >> >> df.to_parquet("/tmp/f") >> pq_file = pq.ParquetFile("/tmp/f") >> print (pq_file.read().to_pandas()) >> >> #Values Read: >> # -221360928884514619.392, >> -442721857769029238.784,2147483648.000,1.111,-2.000,0.000 >> >> # Results in expected output (only difference is 1.11 vs. 1.111) >> df = pd.DataFrame({"values": [decimal.Decimal('9223372036854775808'), >> decimal.Decimal('18446744073709551616'), decimal.Decimal('2147483648'), >> decimal.Decimal('1.11'), decimal.Decimal('-2'), decimal.Decimal('0')]}) >> >> #Values Read: >> 9223372036854775808.00,18446744073709551616.00,2147483648.00,1.11,-2.00,0.00 >> >> df.to_parquet("/tmp/f") >> pq_file = pq.ParquetFile("/tmp/f") >> print (pq_file.read().to_pandas()) >>
