Karl Dunkle Werner created ARROW-7345: -----------------------------------------
Summary: [Python] Writing partitions with NaNs silently drops data Key: ARROW-7345 URL: https://issues.apache.org/jira/browse/ARROW-7345 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.1 Reporter: Karl Dunkle Werner When writing a partitioned table, if the partitioning column has NA values, they're silently dropped. I think it would be helpful if there was a warning. Even better, from my perspective, would be writing out those partitions with a directory name like {{partition_col=NaN}}. Here's a small example where only the {{b = 2}} group is written out and the {{b = NaN}} group is dropped. {code:python} import os import tempfile import pyarrow.json import pyarrow.parquet from pathlib import Path # Create a dataset with NaN: json_str = """ {"a": 1, "b": 2} {"a": 2, "b": null} """ with tempfile.NamedTemporaryFile() as tf: tf = Path(tf.name) tf.write_text(json_str) table = pyarrow.json.read_json(tf) # Write out a partitioned dataset, using the NaN-containing column with tempfile.TemporaryDirectory() as out_dir: pyarrow.parquet.write_to_dataset(table, out_dir, partition_cols=["b"]) print(os.listdir(out_dir)) read_table = pyarrow.parquet.read_table(out_dir) print(f"Wrote out {table.shape[0]} rows, read back {read_table.shape[0]} row") # Output: #> ['b=2.0'] #> Wrote out 2 rows, read back 1 row {code} It looks like this caused by pandas dropping NaNs when doing [the {{groupby}} here|https://github.com/apache/arrow/blob/b16a3b53092ccfbc67e5a4e5c90be5913a67c8a5/python/pyarrow/parquet.py#L1434]. -- This message was sent by Atlassian Jira (v8.3.4#803005)