[ https://issues.apache.org/jira/browse/ARROW-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083010#comment-17083010 ]
Joris Van den Bossche commented on ARROW-2444: ---------------------------------------------- I didn't read the full discussion on the original dask PR, but from what I understand, I am not sure there is any issue left to solve. At least nowadays, an empty table with string or null types both convert to object dtype in pandas (and not float64): {code} In [1]: schema = pa.schema([("ints", pa.int64()), ("floats", pa.float64()), ("strings", pa.string()), ("nulls", pa.null())]) In [2]: schema.empty_table().to_pandas().dtypes Out[2]: ints int64 floats float64 strings object nulls object dtype: object {code} Also an empty dataframe roundtrips preserving the object dtype: {code} In [3]: df = pd.DataFrame({"ints": pd.Series([], dtype="int64"), "strings": pd.Series([], dtype=object)}) In [5]: df.dtypes Out[5]: ints int64 strings object dtype: object In [6]: df.to_parquet("test_empty_df.parquet") In [7]: pd.read_parquet("test_empty_df.parquet").dtypes Out[7]: ints int64 strings object dtype: object {code} On the pyarrow side, such a parquet file has a null dtype, and even when removing the pandas metadata, it still converts to object dtype: {code} In [9]: pq.read_table("test_empty_df.parquet") Out[9]: pyarrow.Table ints: int64 strings: null In [13]: pq.read_table("test_empty_df.parquet").replace_schema_metadata().to_pandas().dtypes Out[13]: ints int64 strings object dtype: object {code} Anybody who remembers the original issue who can confirm this is solved? Or is there still something remaining? > [Python][C++] Better handle reading empty parquet files > ------------------------------------------------------- > > Key: ARROW-2444 > URL: https://issues.apache.org/jira/browse/ARROW-2444 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Reporter: Jim Crist > Priority: Major > Labels: dataset, dataset-parquet-read, parquet > Fix For: 1.0.0 > > > From [https://github.com/dask/dask/pull/3387#issuecomment-380140003] > > Currently pyarrow reads empty parts as float64, even if the underlying > columns have other dtypes. This can cause problems for pandas downstream, as > certain operations are only valid on certain dtypes, even if the columns are > empty. > > Copying the comment Uwe over: > > bq. {quote}This is the expected behaviour as an empty string column in Pandas > is simply an empty column of type object. Sadly object does not tell us much > about the type of the column at all. We return numpy.float64 in this case as > it's the most efficient type to store nulls in Pandas.{quote} > {quote}This seems unintuitive at best to me. An empty object column in pandas > is treated differently in many operations than an empty float64 column (str > accessor is available, excluded from numeric operations, etc..). Having an > empty file read in as a different dtype than was written could lead to errors > in processing code downstream. Would arrow be willing to change this > behavior?{quote} > We should probably use another method than `field.type.to_pandas_dtype()` in > this case. The column saved in Parquet should be saved with `NA` as type > which sadly does not provide enough information. > We also store the original dtype in the Pandas metadata that is used for the > actual DataFrame reconstruction later on. If we would also pick up the > metadata when it was written, we should be able to correctly reconstruct the > dtype. -- This message was sent by Atlassian Jira (v8.3.4#803005)