Re: Python: Array.to_numpy(), nullable data, and masked arrays

Aldrin Tue, 02 May 2023 18:27:57 -0700

orrr maybe you can add both `nan_is_null` and `null_is_nan`?

The compute fn takes `nan_is_null` as an option to either return true (null) 
for NaN values or return false (not null) for NaN values.


The opposite can be used by the `to_numpy` function to return nulls as masked 
(true) or as unmasked (false).

This would require documentation to specify the resolution order (compute fn 
resolves `nan_is_null` first, then conversion function resolves `null_is_nan` 
second). I think it'd probably be more usable to define a single flag that 
controls both options, but just throwing the possibility out there.

Either way, if you open an issue and submit a PR then the various approaches 
can be discussed there also.

The implementation of the `is_null` compute function in C++ can be found at 
[2], just for future reference (I wanted to check that there isn't any 
repetitive work if it's called from the `to_numpy` function).


[1]: 
https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html#pyarrow.compute.is_null

[2]: 
https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_validity.cc#LL105C1-L105C1





# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene


Sent with Proton Mail secure email.

------- Original Message -------
On Tuesday, May 2nd, 2023 at 17:52, Aldrin <[email protected]> wrote:


> cool!
> 

> > Is this something I should contribute back to pyarrow...
> 

> probably!
> 

> > ...as the default behavior... when presented with a fixed-width primitive 
> > list that has nulls
> 

> I am not sure about this. I would assume the use of maskedarray can be mostly 
> hidden, so it's probably a good idea, but I would sometimes prefer something 
> like that to be explicit, especially since it has different behavior as you 
> mentioned before (e.g. mixes nulls with NaNs).
> 

> So, my preference would be to contribute it, but somehow using a flag (e.g. 
> 'drop_nulls' or 'use_validity') or something.
> 

> Based on the way `to_numpy` is written ([1]), I think adding a flag and 
> adding a condition after `ConvertArrayToPandas` is called seems like a 
> reasonable approach.
> 

> 

> [1]: https://github.com/apache/arrow/blob/main/python/pyarrow/array.pxi#L1527
> 

> 

> 

> 

> # ------------------------------
> 

> # Aldrin
> 

> 

> https://github.com/drin/
> 

> https://gitlab.com/octalene
> 

> 

> Sent with Proton Mail secure email.
> 

> ------- Original Message -------
> On Tuesday, May 2nd, 2023 at 17:27, Spencer Nelson <[email protected]> wrote:
> 

> 

> > Thanks, both - this is helpful. pyarrow.compute.is_null is exactly what I 
> > was looking for.
> > 

> > Masked arrays for fixed-width primitive types turn out to be reasonably 
> > simple. I can call array.buffers() to get the underlying data buffer, and 
> > use numpy.frombuffer on it. For the fixed-width primitives, it appears that 
> > the memory layout is identical, so this works.
> > Then I can build the masked array with something like 
> > `np.ma.masked_array(data_from_buffer, mask_from_is_null)` and it works fine.
> > The whole thing:
> > ```
> > import numpy as np
> > import pyarrow as pa
> > import pyarrow.compute as pc
> > 

> > def to_masked_array(array):
> > _, data_buf = array.buffers()
> > data = np.frombuffer(data_buf, array.dtype.to_pandas_dtype())
> > mask = pc.is_null(array)
> > return np.ma.masked_array(data, mask)
> > ```
> > 

> > "array.dtype.to_pandas_dtype()" is a bit odd, there. There's a 
> > pyarrow.from_numpy_dtype, but no pyarrow.to_numpy_dtype to go the other 
> > way. to_pandas_dtype seems to work despite the name, though.
> > 

> > I don't think this could be made very simple for variable-length primitives 
> > or complex arrow types, but I can live with that.
> > 

> > I believe this whole thing works with zero copy. Is this something I should 
> > contribute back to pyarrow as the default behavior of to_numpy() when 
> > presented with a fixed-width primitive list that has nulls?
> > 

> > On Tue, May 2, 2023 at 5:09 PM Steve Kim <[email protected]> wrote:
> > 

> > > This Message Is From an Untrusted Sender
> > > You have not previously corresponded with this sender.
> > > See https://itconnect.uw.edu/email-tags for additional information. 
> > > Please contact the UW-IT Service Center, [email protected] 206.221.5000, for 
> > > assistance.
> > > 

> > > Adding to Aldrin's very informative answer: the pyarrow.compute.is_null 
> > > function 
> > > (https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html)
> > >  returns a boolean array that can be converted to a mask for 
> > > numpy.ma.MaskedArray
> > > 

> > > On Tue, May 2, 2023, 18:26 Aldrin <[email protected]> wrote:
> > > 

> > > > I think per [1] and [2], because your data has null values, there is no 
> > > > good and supported approach to a zero-copy conversion to pandas or 
> > > > numpy. So, I think [3] to drop nulls, then use to_numpy() is the path 
> > > > of least resistance.
> > > > 

> > > > 

> > > > If you want to try and do the masked array approach, you need to go 
> > > > from: (1) Array -> ArrayData, (2) ArrayData -> Buffer, (3) use the 
> > > > Buffer as appropriate.
> > > > 

> > > > 

> > > > For (1), see [4]. For (2), see [5]. Then, [6] explains that for a 
> > > > fixed-width primitive data type, the first buffer is the validity 
> > > > bitmap. I am not sure that floats are fixed width, but I think they 
> > > > are. I know that Decimal types are a binary format.
> > > > 

> > > > 

> > > > I think [7] will be helpful to see how the validity bitmap is used in 
> > > > C++, not sure how familiar you are, but I'm not sure how far down the 
> > > > rabbit hole you'd have to go to use the validity bitmap from python.
> > > > 

> > > > 

> > > > 

> > > > 

> > > > 

> > > > [1]: 
> > > > https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions
> > > > 

> > > > [2]: https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
> > > > 

> > > > [3]: 
> > > > https://arrow.apache.org/docs/python/generated/pyarrow.compute.drop_null.html#pyarrow.compute.drop_null
> > > > 

> > > > [4]: 
> > > > https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L219
> > > > 

> > > > [5]: 
> > > > https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L173
> > > > 

> > > > [6]: 
> > > > https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout
> > > > 

> > > > [7]: 
> > > > https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection.cc#L102
> > > > 

> > > > 

> > > > 

> > > > 

> > > > # ------------------------------
> > > > 

> > > > # Aldrin
> > > > 

> > > > 

> > > > https://github.com/drin/
> > > > 

> > > > https://gitlab.com/octalene
> > > > 

> > > > 

> > > > Sent with Proton Mail secure email.
> > > > 

> > > > ------- Original Message -------
> > > > On Tuesday, May 2nd, 2023 at 15:38, Spencer Nelson <[email protected]> 
> > > > wrote:
> > > > 

> > > > 

> > > > > What's the right way to convert Arrow arrays to numpy arrays in the 
> > > > > presence of nulls?
> > > > > The first thing I reach for is array.to_numpy(zero_safe_copy=False). 
> > > > > But this has some behaviors that I found a little undesirable.
> > > > > 

> > > > > For numeric data (or at least int64 and float64), nulls are converted 
> > > > > to floating point NaNs and the resulting numpy array is recast from 
> > > > > integer to floating point. For example:
> > > > > 

> > > > > >>> pa.array([1, 2, 3, None, 5])
> > > > > <pyarrow.lib.Int64Array object at 0x111b970a0>
> > > > > [
> > > > > 1,
> > > > > 2,
> > > > > 3,
> > > > > null,
> > > > > 5
> > > > > ]
> > > > > >>> a.to_numpy(False)
> > > > > array([ 1., 2., 3., nan, 5.])
> > > > > This can be problematic: actual floating point NaNs are mixed with 
> > > > > nulls, which is lossy:
> > > > > 

> > > > > >>> pa.array([1., 2., float("nan"), None]).to_numpy(False)
> > > > > array([ 1., 2., nan, nan])
> > > > > 

> > > > > Boolean arrays get converted into 'object'-dtyped numpy arrays, with 
> > > > > 'True', 'False', and 'None', which is a little undesirable as well.
> > > > > 

> > > > > One tool in numpy for dealing with nullable data is masked arrays 
> > > > > (https://numpy.org/doc/stable/reference/maskedarray.html) which work 
> > > > > somewhat like Arrow arrays' validity bitmap. I was thinking of 
> > > > > writing some code that generates a numpy masked array from an arrow 
> > > > > array, but I'd need to get the validity bitmap itself, and it doesn't 
> > > > > seem to be accessible in any pyarrow APIs. Am I missing it?
> > > > > 

> > > > > Or, am I thinking about this wrong, and there's some other way to 
> > > > > pull nullable data out of arrow and into numpy?
> > > > > 

> > > > > Thanks,
> > > > > Spencer
> > > > > 

> > > > >

publickey - [email protected] - 0x21969656.asc
Description: application/pgp-keys

signature.asc
Description: OpenPGP digital signature

Re: Python: Array.to_numpy(), nullable data, and masked arrays

Reply via email to