[ https://issues.apache.org/jira/browse/ARROW-8613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ben Kietzman resolved ARROW-8613. --------------------------------- Resolution: Fixed Issue resolved by pull request 7440 [https://github.com/apache/arrow/pull/7440] > [C++][Dataset] Raise error for unparsable partition value > --------------------------------------------------------- > > Key: ARROW-8613 > URL: https://issues.apache.org/jira/browse/ARROW-8613 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Reporter: Joris Van den Bossche > Assignee: Ben Kietzman > Priority: Major > Labels: dataset, pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Currently, when specifying a partitioning schema, but on of the partition > field values cannot be parsed according to the specified type, you silently > get null values for that partition field. > Python example: > {code:python} > import pathlib > import pyarrow.parquet as pq > import pyarrow.datasets as d > path = pathlib.Path(".") / "dataset_partition_schema_errors" > path.mkdir(exist_ok=True) > > > table = pa.table({"part": ["1_2", "1_2", "3_4", "3_4"], "values": range(4)}) > > pq.write_to_dataset(table, str(path), partition_cols=["part"]) > {code} > {code:java} > In [17]: ds.dataset(path, partitioning="hive").to_table().to_pandas() > Out[17]: > values part > 0 0 1_2 > 1 1 1_2 > 2 2 3_4 > 3 3 3_4 > In [18]: partitioning = ds.partitioning(pa.schema([("part", pa.int64())]), > flavor="hive") > > In [19]: ds.dataset(path, partitioning=partitioning).to_table().to_pandas() > Out[19]: > values part > 0 0 NaN > 1 1 NaN > 2 2 NaN > 3 3 NaN > {code} > Silently ignoring such a parse error doesn't seem the best default to me > (since partition keys are quite essential). I think raising an error might be > better? -- This message was sent by Atlassian Jira (v8.3.4#803005)