Karl Dunkle Werner created ARROW-7345:
-----------------------------------------

             Summary: [Python] Writing partitions with NaNs silently drops data
                 Key: ARROW-7345
                 URL: https://issues.apache.org/jira/browse/ARROW-7345
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.15.1
            Reporter: Karl Dunkle Werner


When writing a partitioned table, if the partitioning column has NA values, 
they're silently dropped. I think it would be helpful if there was a warning. 
Even better, from my perspective, would be writing out those partitions with a 
directory name like {{partition_col=NaN}}. 

Here's a small example where only the {{b = 2}} group is written out and the 
{{b = NaN}} group is dropped.

{code:python}
import os
import tempfile
import pyarrow.json
import pyarrow.parquet
from pathlib import Path

# Create a dataset with NaN:
json_str = """
{"a": 1, "b": 2}
{"a": 2, "b": null}
"""
with tempfile.NamedTemporaryFile() as tf:
    tf = Path(tf.name)
    tf.write_text(json_str)
    table = pyarrow.json.read_json(tf)

# Write out a partitioned dataset, using the NaN-containing column
with tempfile.TemporaryDirectory() as out_dir:
    pyarrow.parquet.write_to_dataset(table, out_dir, partition_cols=["b"])
    print(os.listdir(out_dir))
    read_table = pyarrow.parquet.read_table(out_dir)
print(f"Wrote out {table.shape[0]} rows, read back {read_table.shape[0]} row")

# Output:
#> ['b=2.0']
#> Wrote out 2 rows, read back 1 row
{code}
 
It looks like this caused by pandas dropping NaNs when doing [the {{groupby}} 
here|https://github.com/apache/arrow/blob/b16a3b53092ccfbc67e5a4e5c90be5913a67c8a5/python/pyarrow/parquet.py#L1434].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to