[ https://issues.apache.org/jira/browse/ARROW-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-2444: ----------------------------------------- Labels: dataset dataset-parquet-read parquet (was: dataset parquet) > [Python][C++] Better handle reading empty parquet files > ------------------------------------------------------- > > Key: ARROW-2444 > URL: https://issues.apache.org/jira/browse/ARROW-2444 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Reporter: Jim Crist > Priority: Major > Labels: dataset, dataset-parquet-read, parquet > Fix For: 1.0.0 > > > From [https://github.com/dask/dask/pull/3387#issuecomment-380140003] > > Currently pyarrow reads empty parts as float64, even if the underlying > columns have other dtypes. This can cause problems for pandas downstream, as > certain operations are only valid on certain dtypes, even if the columns are > empty. > > Copying the comment Uwe over: > > bq. {quote}This is the expected behaviour as an empty string column in Pandas > is simply an empty column of type object. Sadly object does not tell us much > about the type of the column at all. We return numpy.float64 in this case as > it's the most efficient type to store nulls in Pandas.{quote} > {quote}This seems unintuitive at best to me. An empty object column in pandas > is treated differently in many operations than an empty float64 column (str > accessor is available, excluded from numeric operations, etc..). Having an > empty file read in as a different dtype than was written could lead to errors > in processing code downstream. Would arrow be willing to change this > behavior?{quote} > We should probably use another method than `field.type.to_pandas_dtype()` in > this case. The column saved in Parquet should be saved with `NA` as type > which sadly does not provide enough information. > We also store the original dtype in the Pandas metadata that is used for the > actual DataFrame reconstruction later on. If we would also pick up the > metadata when it was written, we should be able to correctly reconstruct the > dtype. -- This message was sent by Atlassian Jira (v8.3.4#803005)