Thanks, both - this is helpful. pyarrow.compute.is_null is exactly what I
was looking for.
Masked arrays for fixed-width primitive types turn out to be reasonably
simple. I can call array.buffers() to get the underlying data buffer, and
use numpy.frombuffer on it. For the fixed-width primitives, it appears that
the memory layout is identical, so this works.
Then I can build the masked array with something like
`np.ma.masked_array(data_from_buffer, mask_from_is_null)` and it works
fine.
The whole thing:
```
import numpy as np
import pyarrow as pa
import pyarrow.compute as pc
def to_masked_array(array):
_, data_buf = array.buffers()
data = np.frombuffer(data_buf, array.dtype.to_pandas_dtype())
mask = pc.is_null(array)
return np.ma.masked_array(data, mask)
```
"array.dtype.to_pandas_dtype()" is a bit odd, there. There's a
pyarrow.from_numpy_dtype, but no pyarrow.to_numpy_dtype to go the other
way. to_pandas_dtype seems to work despite the name, though.
I don't think this could be made very simple for variable-length primitives
or complex arrow types, but I can live with that.
I believe this whole thing works with zero copy. Is this something I should
contribute back to pyarrow as the default behavior of to_numpy() when
presented with a fixed-width primitive list that has nulls?
On Tue, May 2, 2023 at 5:09 PM Steve Kim <[email protected]> wrote:
> Adding to Aldrin's very informative answer: the pyarrow. compute. is_null
> function (https: //arrow. apache. org/docs/python/generated/pyarrow.
> compute. is_null. html) returns a boolean array that can be converted to a
> mask for numpy. ma. MaskedArrayOn
> ZjQcmQRYFpfptBannerStart
> This Message Is From an Untrusted Sender
> You have not previously corresponded with this sender.
> See https://itconnect.uw.edu/email-tags for additional information.
> Please contact the UW-IT Service Center, [email protected] 206.221.5000, for
> assistance.
>
> ZjQcmQRYFpfptBannerEnd
> Adding to Aldrin's very informative answer: the pyarrow.compute.is_null
> function (
> https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html
> <https://urldefense.com/v3/__https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html__;!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5MVwaK3Z$>)
> returns a boolean array that can be converted to a mask for
> numpy.ma.MaskedArray
>
> On Tue, May 2, 2023, 18:26 Aldrin <[email protected]> wrote:
>
>> I think per [1] and [2], because your data has null values, there is no
>> good and supported approach to a zero-copy conversion to pandas or numpy.
>> So, I think [3] to drop nulls, then use to_numpy() is the path of least
>> resistance.
>>
>> If you want to try and do the masked array approach, you need to go from:
>> (1) Array -> ArrayData, (2) ArrayData -> Buffer, (3) use the Buffer as
>> appropriate.
>>
>> For (1), see [4]. For (2), see [5]. Then, [6] explains that for a
>> fixed-width primitive data type, the first buffer is the validity bitmap. I
>> am not sure that floats are fixed width, but I think they are. I know that
>> Decimal types are a binary format.
>>
>> I think [7] will be helpful to see how the validity bitmap is used in
>> C++, not sure how familiar you are, but I'm not sure how far down the
>> rabbit hole you'd have to go to use the validity bitmap from python.
>>
>>
>> [1]:
>> https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions
>> <https://urldefense.com/v3/__https://arrow.apache.org/docs/python/pandas.html*zero-copy-series-conversions__;Iw!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5Lq2L6-B$>
>> [2]: https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
>> <https://urldefense.com/v3/__https://arrow.apache.org/docs/python/numpy.html*arrow-to-numpy__;Iw!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5JMTtTZb$>
>> [3]:
>> https://arrow.apache.org/docs/python/generated/pyarrow.compute.drop_null.html#pyarrow.compute.drop_null
>> <https://urldefense.com/v3/__https://arrow.apache.org/docs/python/generated/pyarrow.compute.drop_null.html*pyarrow.compute.drop_null__;Iw!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5IYUQ_RH$>
>> [4]:
>> https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L219
>> <https://urldefense.com/v3/__https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd*L219__;Iw!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5CYYldrV$>
>> [5]:
>> https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L173
>> <https://urldefense.com/v3/__https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd*L173__;Iw!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5K5BOJl4$>
>> [6]:
>> https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout
>> <https://urldefense.com/v3/__https://arrow.apache.org/docs/format/Columnar.html*fixed-size-primitive-layout__;Iw!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5HcWlZ-Q$>
>> [7]:
>> https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection.cc#L102
>> <https://urldefense.com/v3/__https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection.cc*L102__;Iw!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5NoPc4WE$>
>>
>>
>> # ------------------------------
>> # Aldrin
>>
>> https://github.com/drin/
>> <https://urldefense.com/v3/__https://github.com/drin/__;!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5AoyieVN$>
>> https://gitlab.com/octalene
>> <https://urldefense.com/v3/__https://gitlab.com/octalene__;!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5ORopb5t$>
>>
>> Sent with Proton Mail
>> <https://urldefense.com/v3/__https://proton.me/__;!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5Mzn85Ej$>
>> secure email.
>>
>> ------- Original Message -------
>> On Tuesday, May 2nd, 2023 at 15:38, Spencer Nelson <[email protected]>
>> wrote:
>>
>> What's the right way to convert Arrow arrays to numpy arrays in the
>> presence of nulls?
>>
>> The first thing I reach for is array.to_numpy(zero_safe_copy=False). But
>> this has some behaviors that I found a little undesirable.
>>
>> For numeric data (or at least int64 and float64), nulls are converted to
>> floating point NaNs and the resulting numpy array is recast from integer to
>> floating point. For example:
>>
>> >>> pa.array([1, 2, 3, None, 5])
>> <pyarrow.lib.Int64Array object at 0x111b970a0>
>> [
>> 1,
>> 2,
>> 3,
>> null,
>> 5
>> ]
>> >>> a.to_numpy(False)
>> array([ 1., 2., 3., nan, 5.])
>>
>> This can be problematic: *actual* floating point NaNs are mixed with
>> nulls, which is lossy:
>>
>> >>> pa.array([1., 2., float("nan"), None]).to_numpy(False)
>> array([ 1., 2., nan, nan])
>>
>> Boolean arrays get converted into 'object'-dtyped numpy arrays, with
>> 'True', 'False', and 'None', which is a little undesirable as well.
>>
>> One tool in numpy for dealing with nullable data is masked arrays (
>> https://numpy.org/doc/stable/reference/maskedarray.html
>> <https://urldefense.com/v3/__https://numpy.org/doc/stable/reference/maskedarray.html__;!!K-Hz7m0Vt54!iwQjq6LF_NZBUUk-csVzlhzaStm04INmOSpDsekslZV5LMEqzkOamm-GfaNcZjC1ljF4koAU5CfS-IFZ$>)
>> which work somewhat like Arrow arrays' validity bitmap. I was thinking of
>> writing some code that generates a numpy masked array from an arrow array,
>> but I'd need to get the validity bitmap itself, and it doesn't seem to be
>> accessible in any pyarrow APIs. Am I missing it?
>>
>> Or, am I thinking about this wrong, and there's some other way to pull
>> nullable data out of arrow and into numpy?
>>
>> Thanks,
>> Spencer
>>
>>
>>
>>