To my knowledge, "None" has always been the preferred null sentinel value for object-dtype arrays in pandas, but since sometimes these arrays originate from transposes or other join/append operations that merge numeric arrays (which have NaN sentinels) into non-numeric arrays to create object arrays, we were forced to deal with multiple possible sentinel values.
All of this is a bit of an unfortunate artifact of pandas's use of sentinel values and permissiveness around mixed-type arrays, and one of the motivations I had for helping build the Arrow project in the first place: to be the data structure and computing platform that I wish had existed more than a decade ago. On Wed, Jun 9, 2021 at 2:29 AM Joris Van den Bossche <jorisvandenboss...@gmail.com> wrote: > > That won't help in this specific case, since it is for an array of > strings (which you can't fill with NaN), and for floating point > arrays, we already use np.nan as "null" representation when converting > to numpy/pandas. > > On Wed, 9 Jun 2021 at 03:37, Benjamin Kietzman <bengil...@gmail.com> wrote: > > > > As a workaround, the "fill_null" compute function can be used to replace > > nulls with nans: > > > > >>> nan = pa.scalar(np.NaN, type=pa.float64()) > > >>> pa.Array.from_pandas(s).fill_null(nan).to_pandas() > > > > On Tue, Jun 8, 2021, 16:15 Joris Van den Bossche < > > jorisvandenboss...@gmail.com> wrote: > > > > > Hi Li, > > > > > > It's correct that arrow uses "None" for null values when converting a > > > string array to numpy / pandas. > > > As far as I am aware, there is currently no option to control that > > > (and to make it use np.nan instead), and I am not sure there would be > > > much interest in adding such an option. > > > > > > Now, I know this doesn't give an exact roundtrip in this case, but > > > pandas does treat both np.nan and None as missing values in object > > > dtype columns, so behaviour-wise this shouldn't give any difference > > > and the roundtrip is still faithful on that aspect. > > > > > > Best, > > > Joris > > > > > > On Tue, 8 Jun 2021 at 21:59, Li Jin <ice.xell...@gmail.com> wrote: > > > > > > > > Hello! > > > > > > > > Apologies if this has been brought before. I'd like to get devs' > > > > thoughts > > > > on this potential inconsistency of "what are the python objects for null > > > > values" between pandas and pyarrow. > > > > > > > > Demonstrated with the following example: > > > > > > > > (1) pandas seems to use "np.NaN" to represent a missing value (with > > > pandas > > > > 1.2.4): > > > > > > > > In [*32*]: df > > > > > > > > Out[*32*]: > > > > > > > > value > > > > > > > > key > > > > > > > > 1 some_strign > > > > > > > > > > > > In [*33*]: df2 > > > > > > > > Out[*33*]: > > > > > > > > value2 > > > > > > > > key > > > > > > > > 2 some_other_string > > > > > > > > > > > > In [*34*]: df.join(df2) > > > > > > > > Out[*34*]: > > > > > > > > value value2 > > > > > > > > key > > > > > > > > 1 some_strign *NaN* > > > > > > > > > > > > > > > > (2) pyarrow seems to use "None" to represent a missing value (4.0.1) > > > > > > > > >>> s = pd.Series(["some_string", np.NaN]) > > > > > > > > >>> s > > > > > > > > 0 some_string > > > > > > > > 1 NaN > > > > > > > > dtype: object > > > > > > > > >>> pa.Array.from_pandas(s).to_pandas() > > > > > > > > 0 some_string > > > > > > > > 1 None > > > > > > > > dtype: object > > > > > > > > > > > > I have looked around the pyarrow doc and didn't find an option to use > > > > np.NaN for null values with to_pandas so it's a bit hard to get around > > > trip > > > > consistency. > > > > > > > > > > > > I appreciate any thoughts on this as to how to achieve consistency here. > > > > > > > > > > > > Thanks! > > > > > > > > Li > > >