Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

Benjamin Kietzman Tue, 08 Jun 2021 18:37:27 -0700

As a workaround, the "fill_null" compute function can be used to replace
nulls with nans:


>>> nan = pa.scalar(np.NaN, type=pa.float64())
>>> pa.Array.from_pandas(s).fill_null(nan).to_pandas()

On Tue, Jun 8, 2021, 16:15 Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> Hi Li,
>
> It's correct that arrow uses "None" for null values when converting a
> string array to numpy / pandas.
> As far as I am aware, there is currently no option to control that
> (and to make it use np.nan instead), and I am not sure there would be
> much interest in adding such an option.
>
> Now, I know this doesn't give an exact roundtrip in this case, but
> pandas does treat both np.nan and None as missing values in object
> dtype columns, so behaviour-wise this shouldn't give any difference
> and the roundtrip is still faithful on that aspect.
>
> Best,
> Joris
>
> On Tue, 8 Jun 2021 at 21:59, Li Jin <ice.xell...@gmail.com> wrote:
> >
> > Hello!
> >
> > Apologies if this has been brought before. I'd like to get devs' thoughts
> > on this potential inconsistency of "what are the python objects for null
> > values" between pandas and pyarrow.
> >
> > Demonstrated with the following example:
> >
> > (1)  pandas seems to use "np.NaN" to represent a missing value (with
> pandas
> > 1.2.4):
> >
> > In [*32*]: df
> >
> > Out[*32*]:
> >
> >            value
> >
> > key
> >
> > 1    some_strign
> >
> >
> > In [*33*]: df2
> >
> > Out[*33*]:
> >
> >                 value2
> >
> > key
> >
> > 2    some_other_string
> >
> >
> > In [*34*]: df.join(df2)
> >
> > Out[*34*]:
> >
> >            value value2
> >
> > key
> >
> > 1    some_strign    *NaN*
> >
> >
> >
> > (2) pyarrow seems to use "None" to represent a missing value (4.0.1)
> >
> > >>> s = pd.Series(["some_string", np.NaN])
> >
> > >>> s
> >
> > 0    some_string
> >
> > 1            NaN
> >
> > dtype: object
> >
> > >>> pa.Array.from_pandas(s).to_pandas()
> >
> > 0    some_string
> >
> > 1           None
> >
> > dtype: object
> >
> >
> > I have looked around the pyarrow doc and didn't find an option to use
> > np.NaN for null values with to_pandas so it's a bit hard to get around
> trip
> > consistency.
> >
> >
> > I appreciate any thoughts on this as to how to achieve consistency here.
> >
> >
> > Thanks!
> >
> > Li
>

Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

Reply via email to