Max Firman created ARROW-7350: --------------------------------- Summary: [Python] Parquet file metadata min and max statistics not decoded from bytes for Decimal data types Key: ARROW-7350 URL: https://issues.apache.org/jira/browse/ARROW-7350 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.1 Reporter: Max Firman
Parquet file metadata for Decimal type columns contain min and max values that are not decoded from bytes into Decimals. This causes issues in dependent libraries like Dask (see [https://github.com/dask/dask/issues/5647]). {code:python|title=Reproducible example|borderStyle=solid} from decimal import Decimal import random import pandas as pd import pyarrow.parquet as pq import pyarrow as pa NUM_DATA_POINTS_PER_PARTITION = 25 random.seed(0) data1 = [{"col1": Decimal(f"{random.randint(0, 999)}.{random.randint(0, 99)}")} for i in range(NUM_DATA_POINTS_PER_PARTITION)] df = pd.DataFrame(data1) table = pa.Table.from_pandas(df) pq.write_table(table, 'my_data.parquet') parquet_file = pq.ParquetFile('my_data.parquet') assert isinstance(parquet_file.metadata.row_group(0).column(0).statistics.min, Decimal) # <-- AssertionError here because min has type bytes rather than Decimal assert isinstance(parquet_file.metadata.row_group(0).column(0).statistics.max, Decimal) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)