[ 
https://issues.apache.org/jira/browse/ARROW-8613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-8613.
---------------------------------
    Resolution: Fixed

Issue resolved by pull request 7440
[https://github.com/apache/arrow/pull/7440]

> [C++][Dataset] Raise error for unparsable partition value
> ---------------------------------------------------------
>
>                 Key: ARROW-8613
>                 URL: https://issues.apache.org/jira/browse/ARROW-8613
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: Ben Kietzman
>            Priority: Major
>              Labels: dataset, pull-request-available
>             Fix For: 1.0.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, when specifying a partitioning schema, but on of the partition 
> field values cannot be parsed according to the specified type, you silently 
> get null values for that partition field.
> Python example:
> {code:python}
> import pathlib              
> import pyarrow.parquet as pq 
> import pyarrow.datasets as d
> path = pathlib.Path(".") / "dataset_partition_schema_errors" 
> path.mkdir(exist_ok=True)                                                     
>                                                                               
>                                                
> table = pa.table({"part": ["1_2", "1_2", "3_4", "3_4"], "values": range(4)})  
>  
> pq.write_to_dataset(table, str(path), partition_cols=["part"]) 
> {code}
> {code:java}
> In [17]: ds.dataset(path, partitioning="hive").to_table().to_pandas() 
> Out[17]: 
>    values part
> 0       0  1_2
> 1       1  1_2
> 2       2  3_4
> 3       3  3_4
> In [18]: partitioning = ds.partitioning(pa.schema([("part", pa.int64())]), 
> flavor="hive")                                                                
>                                                           
> In [19]: ds.dataset(path, partitioning=partitioning).to_table().to_pandas()   
> Out[19]: 
>    values  part
> 0       0   NaN
> 1       1   NaN
> 2       2   NaN
> 3       3   NaN
> {code}
> Silently ignoring such a parse error doesn't seem the best default to me 
> (since partition keys are quite essential). I think raising an error might be 
> better? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to