Joris Van den Bossche created ARROW-8613: --------------------------------------------
Summary: [C++][Dataset] Raise error for unparsable partition value Key: ARROW-8613 URL: https://issues.apache.org/jira/browse/ARROW-8613 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Fix For: 1.0.0 Currently, when specifying a partitioning schema, but on of the partition field values cannot be parsed according to the specified type, you silently get null values for that partition field. Python example: {code:python} import pathlib import pyarrow.parquet as pq import pyarrow.datasets as d path = pathlib.Path(".") / "dataset_partition_schema_errors" path.mkdir(exist_ok=True) table = pa.table({"part": ["1_2", "1_2", "3_4", "3_4"], "values": range(4)}) pq.write_to_dataset(table, str(path), partition_cols=["part"]) {code} {code:java} In [17]: ds.dataset(path, partitioning="hive").to_table().to_pandas() Out[17]: values part 0 0 1_2 1 1 1_2 2 2 3_4 3 3 3_4 In [18]: partitioning = ds.partitioning(pa.schema([("part", pa.int64())]), flavor="hive") In [19]: ds.dataset(path, partitioning=partitioning).to_table().to_pandas() Out[19]: values part 0 0 NaN 1 1 NaN 2 2 NaN 3 3 NaN {code} Silently ignoring such a parse error doesn't seem the best default to me (since partition keys are quite essential). I think raising an error might be better? -- This message was sent by Atlassian Jira (v8.3.4#803005)