What's the right way to convert Arrow arrays to numpy arrays in the
presence of nulls?

The first thing I reach for is array.to_numpy(zero_safe_copy=False). But
this has some behaviors that I found a little undesirable.

For numeric data (or at least int64 and float64), nulls are converted to
floating point NaNs and the resulting numpy array is recast from integer to
floating point. For example:

>>> pa.array([1, 2, 3, None, 5])
<pyarrow.lib.Int64Array object at 0x111b970a0>
[
  1,
  2,
  3,
  null,
  5
]
>>> a.to_numpy(False)
array([ 1.,  2.,  3., nan,  5.])

This can be problematic: *actual* floating point NaNs are mixed with nulls,
which is lossy:

>>> pa.array([1., 2., float("nan"), None]).to_numpy(False)
array([ 1.,  2., nan, nan])

Boolean arrays get converted into 'object'-dtyped numpy arrays, with
'True', 'False', and 'None', which is a little undesirable as well.

One tool in numpy for dealing with nullable data is masked arrays (
https://numpy.org/doc/stable/reference/maskedarray.html) which work
somewhat like Arrow arrays' validity bitmap. I was thinking of writing some
code that generates a numpy masked array from an arrow array, but I'd need
to get the validity bitmap itself, and it doesn't seem to be accessible in
any pyarrow APIs. Am I missing it?

Or, am I thinking about this wrong, and there's some other way to pull
nullable data out of arrow and into numpy?

Thanks,
Spencer

Reply via email to