Hi Li,

It's correct that arrow uses "None" for null values when converting a
string array to numpy / pandas.
As far as I am aware, there is currently no option to control that
(and to make it use np.nan instead), and I am not sure there would be
much interest in adding such an option.

Now, I know this doesn't give an exact roundtrip in this case, but
pandas does treat both np.nan and None as missing values in object
dtype columns, so behaviour-wise this shouldn't give any difference
and the roundtrip is still faithful on that aspect.

Best,
Joris

On Tue, 8 Jun 2021 at 21:59, Li Jin <ice.xell...@gmail.com> wrote:
>
> Hello!
>
> Apologies if this has been brought before. I'd like to get devs' thoughts
> on this potential inconsistency of "what are the python objects for null
> values" between pandas and pyarrow.
>
> Demonstrated with the following example:
>
> (1)  pandas seems to use "np.NaN" to represent a missing value (with pandas
> 1.2.4):
>
> In [*32*]: df
>
> Out[*32*]:
>
>            value
>
> key
>
> 1    some_strign
>
>
> In [*33*]: df2
>
> Out[*33*]:
>
>                 value2
>
> key
>
> 2    some_other_string
>
>
> In [*34*]: df.join(df2)
>
> Out[*34*]:
>
>            value value2
>
> key
>
> 1    some_strign    *NaN*
>
>
>
> (2) pyarrow seems to use "None" to represent a missing value (4.0.1)
>
> >>> s = pd.Series(["some_string", np.NaN])
>
> >>> s
>
> 0    some_string
>
> 1            NaN
>
> dtype: object
>
> >>> pa.Array.from_pandas(s).to_pandas()
>
> 0    some_string
>
> 1           None
>
> dtype: object
>
>
> I have looked around the pyarrow doc and didn't find an option to use
> np.NaN for null values with to_pandas so it's a bit hard to get around trip
> consistency.
>
>
> I appreciate any thoughts on this as to how to achieve consistency here.
>
>
> Thanks!
>
> Li

Reply via email to