Semantically, a NaN is defined according to the IEEE_754 for floating points, while a null represents any value whose value is undefined, unknown, etc.
An important set of problems that arrow solves is that it has a native representation for null values (independent of NaNs): arrow's in-memory model is designed ground up to support nulls; other in-memory representations sometimes use NaN or some other variations to represent nulls, which sometimes results in breaking memory alignments useful in compute. In Arrow, the value of a floating point array can be "non-null" or "null". When non-null, it can be any valid value for the corresponding type. For floats, that means any valid floating point number, including, NaN, inf, -0.0, 0.0, etc. Best, Jorge On Tue, Jun 8, 2021 at 9:59 PM Li Jin <ice.xell...@gmail.com> wrote: > Hello! > > Apologies if this has been brought before. I'd like to get devs' thoughts > on this potential inconsistency of "what are the python objects for null > values" between pandas and pyarrow. > > Demonstrated with the following example: > > (1) pandas seems to use "np.NaN" to represent a missing value (with pandas > 1.2.4): > > In [*32*]: df > > Out[*32*]: > > value > > key > > 1 some_strign > > > In [*33*]: df2 > > Out[*33*]: > > value2 > > key > > 2 some_other_string > > > In [*34*]: df.join(df2) > > Out[*34*]: > > value value2 > > key > > 1 some_strign *NaN* > > > > (2) pyarrow seems to use "None" to represent a missing value (4.0.1) > > >>> s = pd.Series(["some_string", np.NaN]) > > >>> s > > 0 some_string > > 1 NaN > > dtype: object > > >>> pa.Array.from_pandas(s).to_pandas() > > 0 some_string > > 1 None > > dtype: object > > > I have looked around the pyarrow doc and didn't find an option to use > np.NaN for null values with to_pandas so it's a bit hard to get around trip > consistency. > > > I appreciate any thoughts on this as to how to achieve consistency here. > > > Thanks! > > Li >