Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

Jorge Cardoso Leitão Tue, 08 Jun 2021 13:11:14 -0700

Semantically, a NaN is defined according to the IEEE_754 for floating
points, while a null represents any value whose value is undefined,
unknown, etc.


An important set of problems that arrow solves is that it has a native
representation for null values (independent of NaNs): arrow's in-memory
model is designed ground up to support nulls; other in-memory
representations sometimes use NaN or some other variations to represent
nulls, which sometimes results in breaking memory alignments useful in
compute.

In Arrow, the value of a floating point array can be "non-null" or "null".
When non-null, it can be any valid value for the corresponding type. For
floats, that means any valid floating point number, including, NaN, inf,
-0.0, 0.0, etc.

Best,
Jorge



On Tue, Jun 8, 2021 at 9:59 PM Li Jin <ice.xell...@gmail.com> wrote:

> Hello!
>
> Apologies if this has been brought before. I'd like to get devs' thoughts
> on this potential inconsistency of "what are the python objects for null
> values" between pandas and pyarrow.
>
> Demonstrated with the following example:
>
> (1)  pandas seems to use "np.NaN" to represent a missing value (with pandas
> 1.2.4):
>
> In [*32*]: df
>
> Out[*32*]:
>
>            value
>
> key
>
> 1    some_strign
>
>
> In [*33*]: df2
>
> Out[*33*]:
>
>                 value2
>
> key
>
> 2    some_other_string
>
>
> In [*34*]: df.join(df2)
>
> Out[*34*]:
>
>            value value2
>
> key
>
> 1    some_strign    *NaN*
>
>
>
> (2) pyarrow seems to use "None" to represent a missing value (4.0.1)
>
> >>> s = pd.Series(["some_string", np.NaN])
>
> >>> s
>
> 0    some_string
>
> 1            NaN
>
> dtype: object
>
> >>> pa.Array.from_pandas(s).to_pandas()
>
> 0    some_string
>
> 1           None
>
> dtype: object
>
>
> I have looked around the pyarrow doc and didn't find an option to use
> np.NaN for null values with to_pandas so it's a bit hard to get around trip
> consistency.
>
>
> I appreciate any thoughts on this as to how to achieve consistency here.
>
>
> Thanks!
>
> Li
>

Re: Representation of "null" values for non-numeric types in Arrow/Pandas interop

Reply via email to