Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

Wes McKinney Wed, 09 Jun 2021 06:30:33 -0700

To my knowledge, "None" has always been the preferred null sentinel
value for object-dtype arrays in pandas, but since sometimes these
arrays originate from transposes or other join/append operations that
merge numeric arrays (which have NaN sentinels) into non-numeric
arrays to create object arrays, we were forced to deal with multiple
possible sentinel values.


All of this is a bit of an unfortunate artifact of pandas's use of
sentinel values and permissiveness around mixed-type arrays, and one
of the motivations I had for helping build the Arrow project in the
first place: to be the data structure and computing platform that I
wish had existed more than a decade ago.

On Wed, Jun 9, 2021 at 2:29 AM Joris Van den Bossche
<jorisvandenboss...@gmail.com> wrote:
>
> That won't help in this specific case, since it is for an array of
> strings (which you can't fill with NaN), and for floating point
> arrays, we already use np.nan as "null" representation when converting
> to numpy/pandas.
>
> On Wed, 9 Jun 2021 at 03:37, Benjamin Kietzman <bengil...@gmail.com> wrote:
> >
> > As a workaround, the "fill_null" compute function can be used to replace
> > nulls with nans:
> >
> > >>> nan = pa.scalar(np.NaN, type=pa.float64())
> > >>> pa.Array.from_pandas(s).fill_null(nan).to_pandas()
> >
> > On Tue, Jun 8, 2021, 16:15 Joris Van den Bossche <
> > jorisvandenboss...@gmail.com> wrote:
> >
> > > Hi Li,
> > >
> > > It's correct that arrow uses "None" for null values when converting a
> > > string array to numpy / pandas.
> > > As far as I am aware, there is currently no option to control that
> > > (and to make it use np.nan instead), and I am not sure there would be
> > > much interest in adding such an option.
> > >
> > > Now, I know this doesn't give an exact roundtrip in this case, but
> > > pandas does treat both np.nan and None as missing values in object
> > > dtype columns, so behaviour-wise this shouldn't give any difference
> > > and the roundtrip is still faithful on that aspect.
> > >
> > > Best,
> > > Joris
> > >
> > > On Tue, 8 Jun 2021 at 21:59, Li Jin <ice.xell...@gmail.com> wrote:
> > > >
> > > > Hello!
> > > >
> > > > Apologies if this has been brought before. I'd like to get devs' 
> > > > thoughts
> > > > on this potential inconsistency of "what are the python objects for null
> > > > values" between pandas and pyarrow.
> > > >
> > > > Demonstrated with the following example:
> > > >
> > > > (1)  pandas seems to use "np.NaN" to represent a missing value (with
> > > pandas
> > > > 1.2.4):
> > > >
> > > > In [*32*]: df
> > > >
> > > > Out[*32*]:
> > > >
> > > >            value
> > > >
> > > > key
> > > >
> > > > 1    some_strign
> > > >
> > > >
> > > > In [*33*]: df2
> > > >
> > > > Out[*33*]:
> > > >
> > > >                 value2
> > > >
> > > > key
> > > >
> > > > 2    some_other_string
> > > >
> > > >
> > > > In [*34*]: df.join(df2)
> > > >
> > > > Out[*34*]:
> > > >
> > > >            value value2
> > > >
> > > > key
> > > >
> > > > 1    some_strign    *NaN*
> > > >
> > > >
> > > >
> > > > (2) pyarrow seems to use "None" to represent a missing value (4.0.1)
> > > >
> > > > >>> s = pd.Series(["some_string", np.NaN])
> > > >
> > > > >>> s
> > > >
> > > > 0    some_string
> > > >
> > > > 1            NaN
> > > >
> > > > dtype: object
> > > >
> > > > >>> pa.Array.from_pandas(s).to_pandas()
> > > >
> > > > 0    some_string
> > > >
> > > > 1           None
> > > >
> > > > dtype: object
> > > >
> > > >
> > > > I have looked around the pyarrow doc and didn't find an option to use
> > > > np.NaN for null values with to_pandas so it's a bit hard to get around
> > > trip
> > > > consistency.
> > > >
> > > >
> > > > I appreciate any thoughts on this as to how to achieve consistency here.
> > > >
> > > >
> > > > Thanks!
> > > >
> > > > Li
> > >

Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

Reply via email to