[jira] [Comment Edited] (ARROW-7002) Support pandas nullable integer type Int64

Joris Van den Bossche (Jira) Mon, 28 Oct 2019 03:19:37 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960902#comment-16960902
 ]


Joris Van den Bossche edited comment on ARROW-7002 at 10/28/19 10:18 AM:
-------------------------------------------------------------------------

Writing is already supported with pandas master and latest arrow (v0.15), so it 
is waiting on the next pandas release to have it in a stable version.

{code}
In [1]: from pyarrow import feather 
   ...: import pandas as pd 
   ...:  
   ...: col1 = pd.Series([0, None, 1, 23]).astype('Int64') 
   ...: col2 = pd.Series([1, 3, 2, 1]).astype('Int64') 
   ...:  
   ...: df = pd.DataFrame({'a': col1, 'b': col2}) 
   ...:  
   ...: feather.write_feather(df, '/tmp/foo') 
   ...:                                                                         
                                                                                
                                                   

In [2]: pd.read_feather('/tmp/foo')                                             
                                                                                
                                                   
Out[2]: 
      a  b
0   0.0  1
1   NaN  3
2   1.0  2
3  23.0  1
{code}

So converting to R should work properly. Reading it back in with Python will 
still give you a float array (if there were NaNs), as that is the default 
conversion of arrow integer to pandas. There is work going on to also preserve 
those specific pandas types in that case (see ARROW-2428).


was (Author: jorisvandenbossche):
Writing is already supported with pandas master and latest arrow (0.15), so it 
is waiting on the next pandas release to have it in a stable version.

{code}
In [1]: from pyarrow import feather 
   ...: import pandas as pd 
   ...:  
   ...: col1 = pd.Series([0, None, 1, 23]).astype('Int64') 
   ...: col2 = pd.Series([1, 3, 2, 1]).astype('Int64') 
   ...:  
   ...: df = pd.DataFrame({'a': col1, 'b': col2}) 
   ...:  
   ...: feather.write_feather(df, '/tmp/foo') 
   ...:                                                                         
                                                                                
                                                   

In [2]: pd.read_feather('/tmp/foo')                                             
                                                                                
                                                   
Out[2]: 
      a  b
0   0.0  1
1   NaN  3
2   1.0  2
3  23.0  1
{code}

Reading it back in will still give you a float array (if there were NaNs), as 
that is the default conversion of arrow integer to pandas. There is work going 
on to also preserve those specific pandas types in that case (see ARROW-2428).

> Support pandas nullable integer type Int64
> ------------------------------------------
>
>                 Key: ARROW-7002
>                 URL: https://issues.apache.org/jira/browse/ARROW-7002
>             Project: Apache Arrow
>          Issue Type: New Feature
>            Reporter: Christian Roth
>            Priority: Major
>
> Pandas has a nullable integer type Int64 which does not seem to be supported 
> by feather yet.
> {code:python}
> from pyarrow import feather
> import pandas as pd
> col1 = pd.Series([0, None, 1, 23]).astype('Int64')
> col2 = pd.Series([1, 3, 2, 1]).astype('Int64')
> df = pd.DataFrame({'a': col1, 'b': col2})
> feather.write_feather(df, '/tmp/foo')
> {code}
> Gives following error message:
> {code:java}
> ---------------------------------------------------------------------------
> ArrowTypeError                            Traceback (most recent call last)
> <ipython-input-107-8cc611a30355> in <module>
> ----> 1 feather.write_feather(df, '/tmp/foo')
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in 
> write_feather(df, dest)
>     181     writer = FeatherWriter(dest)
>     182     try:
> --> 183         writer.write(df)
>     184     except Exception:
>     185         # Try to make sure the resource is closed
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in 
> write(self, df)
>      92         # TODO(wesm): Remove this length check, see ARROW-1732
>      93         if len(df.columns) > 0:
> ---> 94             table = Table.from_pandas(df, preserve_index=False)
>      95             for i, name in enumerate(table.schema.names):
>      96                 col = table[i]
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.Table.from_pandas()
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py 
> in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
>     551     if nthreads == 1:
>     552         arrays = [convert_column(c, f)
> --> 553                   for c, f in zip(columns_to_convert, convert_fields)]
>     554     else:
>     555         from concurrent import futures
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py 
> in <listcomp>(.0)
>     551     if nthreads == 1:
>     552         arrays = [convert_column(c, f)
> --> 553                   for c, f in zip(columns_to_convert, convert_fields)]
>     554     else:
>     555         from concurrent import futures
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py 
> in convert_column(col, field)
>     542             e.args += ("Conversion failed for column {0!s} with type 
> {1!s}"
>     543                        .format(col.name, col.dtype),)
> --> 544             raise e
>     545         if not field_nullable and result.null_count > 0:
>     546             raise ValueError("Field {} was non-nullable but pandas 
> column "
> ~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py 
> in convert_column(col, field)
>     536 
>     537         try:
> --> 538             result = pa.array(col, type=type_, from_pandas=True, 
> safe=safe)
>     539         except (pa.ArrowInvalid,
>     540                 pa.ArrowNotImplementedError,
> ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for 
> column a with type Int64')
> {code}
> xref: 
> [https://stackoverflow.com/questions/58571419/exporting-dataframe-with-null-able-int64-from-pandas-to-r]
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-7002) Support pandas nullable integer type Int64

Reply via email to