Adding to Aldrin's very informative answer: the pyarrow.compute.is_null
function (
https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html)
returns a boolean array that can be converted to a mask for
numpy.ma.MaskedArray

On Tue, May 2, 2023, 18:26 Aldrin <octalene....@pm.me> wrote:

> I think per [1] and [2], because your data has null values, there is no
> good and supported approach to a zero-copy conversion to pandas or numpy.
> So, I think [3] to drop nulls, then use to_numpy() is the path of least
> resistance.
>
> If you want to try and do the masked array approach, you need to go from:
> (1) Array -> ArrayData, (2) ArrayData -> Buffer, (3) use the Buffer as
> appropriate.
>
> For (1), see [4]. For (2), see [5]. Then, [6] explains that for a
> fixed-width primitive data type, the first buffer is the validity bitmap. I
> am not sure that floats are fixed width, but I think they are. I know that
> Decimal types are a binary format.
>
> I think [7] will be helpful to see how the validity bitmap is used in C++,
> not sure how familiar you are, but I'm not sure how far down the rabbit
> hole you'd have to go to use the validity bitmap from python.
>
>
> [1]:
> https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions
> [2]: https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
> [3]:
> https://arrow.apache.org/docs/python/generated/pyarrow.compute.drop_null.html#pyarrow.compute.drop_null
> [4]:
> https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L219
> [5]:
> https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L173
> [6]:
> https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout
> [7]:
> https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection.cc#L102
>
>
> # ------------------------------
> # Aldrin
>
> https://github.com/drin/
> https://gitlab.com/octalene
>
> Sent with Proton Mail <https://proton.me/> secure email.
>
> ------- Original Message -------
> On Tuesday, May 2nd, 2023 at 15:38, Spencer Nelson <swnel...@uw.edu>
> wrote:
>
> What's the right way to convert Arrow arrays to numpy arrays in the
> presence of nulls?
>
> The first thing I reach for is array.to_numpy(zero_safe_copy=False). But
> this has some behaviors that I found a little undesirable.
>
> For numeric data (or at least int64 and float64), nulls are converted to
> floating point NaNs and the resulting numpy array is recast from integer to
> floating point. For example:
>
> >>> pa.array([1, 2, 3, None, 5])
> <pyarrow.lib.Int64Array object at 0x111b970a0>
> [
> 1,
> 2,
> 3,
> null,
> 5
> ]
> >>> a.to_numpy(False)
> array([ 1., 2., 3., nan, 5.])
>
> This can be problematic: *actual* floating point NaNs are mixed with
> nulls, which is lossy:
>
> >>> pa.array([1., 2., float("nan"), None]).to_numpy(False)
> array([ 1., 2., nan, nan])
>
> Boolean arrays get converted into 'object'-dtyped numpy arrays, with
> 'True', 'False', and 'None', which is a little undesirable as well.
>
> One tool in numpy for dealing with nullable data is masked arrays (
> https://numpy.org/doc/stable/reference/maskedarray.html) which work
> somewhat like Arrow arrays' validity bitmap. I was thinking of writing some
> code that generates a numpy masked array from an arrow array, but I'd need
> to get the validity bitmap itself, and it doesn't seem to be accessible in
> any pyarrow APIs. Am I missing it?
>
> Or, am I thinking about this wrong, and there's some other way to pull
> nullable data out of arrow and into numpy?
>
> Thanks,
> Spencer
>
>
>
>

Reply via email to