Hi Li, It's correct that arrow uses "None" for null values when converting a string array to numpy / pandas. As far as I am aware, there is currently no option to control that (and to make it use np.nan instead), and I am not sure there would be much interest in adding such an option.
Now, I know this doesn't give an exact roundtrip in this case, but pandas does treat both np.nan and None as missing values in object dtype columns, so behaviour-wise this shouldn't give any difference and the roundtrip is still faithful on that aspect. Best, Joris On Tue, 8 Jun 2021 at 21:59, Li Jin <ice.xell...@gmail.com> wrote: > > Hello! > > Apologies if this has been brought before. I'd like to get devs' thoughts > on this potential inconsistency of "what are the python objects for null > values" between pandas and pyarrow. > > Demonstrated with the following example: > > (1) pandas seems to use "np.NaN" to represent a missing value (with pandas > 1.2.4): > > In [*32*]: df > > Out[*32*]: > > value > > key > > 1 some_strign > > > In [*33*]: df2 > > Out[*33*]: > > value2 > > key > > 2 some_other_string > > > In [*34*]: df.join(df2) > > Out[*34*]: > > value value2 > > key > > 1 some_strign *NaN* > > > > (2) pyarrow seems to use "None" to represent a missing value (4.0.1) > > >>> s = pd.Series(["some_string", np.NaN]) > > >>> s > > 0 some_string > > 1 NaN > > dtype: object > > >>> pa.Array.from_pandas(s).to_pandas() > > 0 some_string > > 1 None > > dtype: object > > > I have looked around the pyarrow doc and didn't find an option to use > np.NaN for null values with to_pandas so it's a bit hard to get around trip > consistency. > > > I appreciate any thoughts on this as to how to achieve consistency here. > > > Thanks! > > Li