[jira] [Commented] (ARROW-2444) [Python][C++] Better handle reading empty parquet files

Joris Van den Bossche (Jira) Tue, 14 Apr 2020 02:06:52 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083010#comment-17083010
 ]


Joris Van den Bossche commented on ARROW-2444:
----------------------------------------------

I didn't read the full discussion on the original dask PR, but from what I 
understand, I am not sure there is any issue left to solve.

At least nowadays, an empty table with string or null types both convert to 
object dtype in pandas (and not float64):

{code}
In [1]: schema = pa.schema([("ints", pa.int64()), ("floats", pa.float64()), 
("strings", pa.string()), ("nulls", pa.null())])                                
                                                       

In [2]: schema.empty_table().to_pandas().dtypes                                 
                                                                                
                                                   
Out[2]: 
ints         int64
floats     float64
strings     object
nulls       object
dtype: object
{code}

Also an empty dataframe roundtrips preserving the object dtype:

{code}
In [3]: df = pd.DataFrame({"ints": pd.Series([], dtype="int64"), "strings": 
pd.Series([], dtype=object)})                                                   
                                                       

In [5]: df.dtypes                                                               
                                                                                
                                                   
Out[5]: 
ints        int64
strings    object
dtype: object

In [6]: df.to_parquet("test_empty_df.parquet")                                  
                                                                                
                                                   

In [7]: pd.read_parquet("test_empty_df.parquet").dtypes                         
                                                                                
                                                   
Out[7]: 
ints        int64
strings    object
dtype: object
{code}

On the pyarrow side, such a parquet file has a null dtype, and even when 
removing the pandas metadata, it still converts to object dtype:

{code}
In [9]: pq.read_table("test_empty_df.parquet")                                  
                                                                                
                                                   
Out[9]: 
pyarrow.Table
ints: int64
strings: null

In [13]: 
pq.read_table("test_empty_df.parquet").replace_schema_metadata().to_pandas().dtypes
                                                                                
                                       
Out[13]: 
ints        int64
strings    object
dtype: object
{code}

Anybody who remembers the original issue who can confirm this is solved? Or is 
there still something remaining?

> [Python][C++] Better handle reading empty parquet files
> -------------------------------------------------------
>
>                 Key: ARROW-2444
>                 URL: https://issues.apache.org/jira/browse/ARROW-2444
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Jim Crist
>            Priority: Major
>              Labels: dataset, dataset-parquet-read, parquet
>             Fix For: 1.0.0
>
>
> From [https://github.com/dask/dask/pull/3387#issuecomment-380140003]
>  
> Currently pyarrow reads empty parts as float64, even if the underlying 
> columns have other dtypes. This can cause problems for pandas downstream, as 
> certain operations are only valid on certain dtypes, even if the columns are 
> empty.
>  
> Copying the comment Uwe over:
>  
> bq. {quote}This is the expected behaviour as an empty string column in Pandas 
> is simply an empty column of type object. Sadly object does not tell us much 
> about the type of the column at all. We return numpy.float64 in this case as 
> it's the most efficient type to store nulls in Pandas.{quote}
> {quote}This seems unintuitive at best to me. An empty object column in pandas 
> is treated differently in many operations than an empty float64 column (str 
> accessor is available, excluded from numeric operations, etc..). Having an 
> empty file read in as a different dtype than was written could lead to errors 
> in processing code downstream. Would arrow be willing to change this 
> behavior?{quote}
> We should probably use another method than `field.type.to_pandas_dtype()` in 
> this case. The column saved in Parquet should be saved with `NA` as type 
> which sadly does not provide enough information. 
> We also store the original dtype in the Pandas metadata that is used for the 
> actual DataFrame reconstruction later on. If we would also pick up the 
> metadata when it was written, we should be able to correctly reconstruct the 
> dtype.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-2444) [Python][C++] Better handle reading empty parquet files

Reply via email to