[ 
https://issues.apache.org/jira/browse/ARROW-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536421#comment-16536421
 ] 

Wes McKinney commented on ARROW-2806:
-------------------------------------

In the context of basically any database-type system (PostgreSQL, Spark SQL, 
etc.), NaN is just another floating point value. It only happens to be that we 
use NaN as a null sentinel in pandas. When we pass {{from_pandas=True}}, then 
we should treat it as null because those are the pandas semantics. 

> [Python] Inconsistent handling of np.nan
> ----------------------------------------
>
>                 Key: ARROW-2806
>                 URL: https://issues.apache.org/jira/browse/ARROW-2806
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.9.0
>            Reporter: Uwe L. Korn
>            Priority: Major
>             Fix For: 0.10.0
>
>
> Currently we handle {{np.nan}} differently between having a list or a numpy 
> array as an input to {{pa.array()}}:
> {code}
> >>> pa.array(np.array([1, np.nan]))
> <pyarrow.lib.DoubleArray object at 0x11680bea8>
> [
>   1.0,
>   nan
> ]
> >>> pa.array([1., np.nan])
> Out[9]:
> <pyarrow.lib.DoubleArray object at 0x10bdacbd8>
> [
>   1.0,
>   NA
> ]
> {code}
> I would actually think the last one is the correct one. Especially once one 
> casts this to an integer column. There the first one produces a column with 
> INT_MIN and the second one produces a real null.
> But, in {{test_array_conversions_no_sentinel_values}} we check that 
> {{np.nan}} does not produce a Null.
> Even weirder: 
> {code}
> >>> df = pd.DataFrame({'a': [1., None]})
> >>> df
>      a
> 0  1.0
> 1  NaN
> >>> pa.Table.from_pandas(df).column(0)
> <Column name='a' type=DataType(double)>
> chunk 0: <pyarrow.lib.DoubleArray object at 0x104bbf958>
> [
>   1.0,
>   NA
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to