[jira] [Updated] (ARROW-2444) [Python][C++] Better handle reading empty parquet files

Joris Van den Bossche (Jira) Thu, 12 Mar 2020 09:48:33 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche updated ARROW-2444:
-----------------------------------------
    Labels: dataset dataset-parquet-read parquet  (was: dataset parquet)

> [Python][C++] Better handle reading empty parquet files
> -------------------------------------------------------
>
>                 Key: ARROW-2444
>                 URL: https://issues.apache.org/jira/browse/ARROW-2444
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Jim Crist
>            Priority: Major
>              Labels: dataset, dataset-parquet-read, parquet
>             Fix For: 1.0.0
>
>
> From [https://github.com/dask/dask/pull/3387#issuecomment-380140003]
>  
> Currently pyarrow reads empty parts as float64, even if the underlying 
> columns have other dtypes. This can cause problems for pandas downstream, as 
> certain operations are only valid on certain dtypes, even if the columns are 
> empty.
>  
> Copying the comment Uwe over:
>  
> bq. {quote}This is the expected behaviour as an empty string column in Pandas 
> is simply an empty column of type object. Sadly object does not tell us much 
> about the type of the column at all. We return numpy.float64 in this case as 
> it's the most efficient type to store nulls in Pandas.{quote}
> {quote}This seems unintuitive at best to me. An empty object column in pandas 
> is treated differently in many operations than an empty float64 column (str 
> accessor is available, excluded from numeric operations, etc..). Having an 
> empty file read in as a different dtype than was written could lead to errors 
> in processing code downstream. Would arrow be willing to change this 
> behavior?{quote}
> We should probably use another method than `field.type.to_pandas_dtype()` in 
> this case. The column saved in Parquet should be saved with `NA` as type 
> which sadly does not provide enough information. 
> We also store the original dtype in the Pandas metadata that is used for the 
> actual DataFrame reconstruction later on. If we would also pick up the 
> metadata when it was written, we should be able to correctly reconstruct the 
> dtype.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-2444) [Python][C++] Better handle reading empty parquet files

Reply via email to