[ https://issues.apache.org/jira/browse/ARROW-7345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-7345: ----------------------------------------- Labels: parquet (was: ) > [Python] Writing partitions with NaNs silently drops data > --------------------------------------------------------- > > Key: ARROW-7345 > URL: https://issues.apache.org/jira/browse/ARROW-7345 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.15.1 > Reporter: Karl Dunkle Werner > Priority: Minor > Labels: parquet > > When writing a partitioned table, if the partitioning column has NA values, > they're silently dropped. I think it would be helpful if there was a warning. > Even better, from my perspective, would be writing out those partitions with > a directory name like {{partition_col=NaN}}. > Here's a small example where only the {{b = 2}} group is written out and the > {{b = NaN}} group is dropped. > {code:python} > import os > import tempfile > import pyarrow.json > import pyarrow.parquet > from pathlib import Path > # Create a dataset with NaN: > json_str = """ > {"a": 1, "b": 2} > {"a": 2, "b": null} > """ > with tempfile.NamedTemporaryFile() as tf: > tf = Path(tf.name) > tf.write_text(json_str) > table = pyarrow.json.read_json(tf) > # Write out a partitioned dataset, using the NaN-containing column > with tempfile.TemporaryDirectory() as out_dir: > pyarrow.parquet.write_to_dataset(table, out_dir, partition_cols=["b"]) > print(os.listdir(out_dir)) > read_table = pyarrow.parquet.read_table(out_dir) > print(f"Wrote out {table.shape[0]} rows, read back {read_table.shape[0]} row") > # Output: > #> ['b=2.0'] > #> Wrote out 2 rows, read back 1 row > {code} > > It looks like this caused by pandas dropping NaNs when doing [the {{groupby}} > here|https://github.com/apache/arrow/blob/b16a3b53092ccfbc67e5a4e5c90be5913a67c8a5/python/pyarrow/parquet.py#L1434]. -- This message was sent by Atlassian Jira (v8.3.4#803005)