Re: Python: Array.to_numpy(), nullable data, and masked arrays

Aldrin Tue, 02 May 2023 17:52:50 -0700

cool!

> Is this something I should contribute back to pyarrow...


probably!

> ...as the default behavior... when presented with a fixed-width primitive 
> list that has nulls

I am not sure about this. I would assume the use of maskedarray can be mostly 
hidden, so it's probably a good idea, but I would sometimes prefer something 
like that to be explicit, especially since it has different behavior as you 
mentioned before (e.g. mixes nulls with NaNs).

So, my preference would be to contribute it, but somehow using a flag (e.g. 
'drop_nulls' or 'use_validity') or something.

Based on the way `to_numpy` is written ([1]), I think adding a flag and adding 
a condition after `ConvertArrayToPandas` is called seems like a reasonable 
approach.


[1]: https://github.com/apache/arrow/blob/main/python/pyarrow/array.pxi#L1527




# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene


Sent with Proton Mail secure email.

------- Original Message -------
On Tuesday, May 2nd, 2023 at 17:27, Spencer Nelson <[email protected]> wrote:


> Thanks, both - this is helpful. pyarrow.compute.is_null is exactly what I was 
> looking for.
> 

> Masked arrays for fixed-width primitive types turn out to be reasonably 
> simple. I can call array.buffers() to get the underlying data buffer, and use 
> numpy.frombuffer on it. For the fixed-width primitives, it appears that the 
> memory layout is identical, so this works.
> Then I can build the masked array with something like 
> `np.ma.masked_array(data_from_buffer, mask_from_is_null)` and it works fine.
> The whole thing:
> ```
> import numpy as np
> import pyarrow as pa
> import pyarrow.compute as pc
> 

> def to_masked_array(array):
> _, data_buf = array.buffers()
> data = np.frombuffer(data_buf, array.dtype.to_pandas_dtype())
> mask = pc.is_null(array)
> return np.ma.masked_array(data, mask)
> ```
> 

> "array.dtype.to_pandas_dtype()" is a bit odd, there. There's a 
> pyarrow.from_numpy_dtype, but no pyarrow.to_numpy_dtype to go the other way. 
> to_pandas_dtype seems to work despite the name, though.
> 

> I don't think this could be made very simple for variable-length primitives 
> or complex arrow types, but I can live with that.
> 

> I believe this whole thing works with zero copy. Is this something I should 
> contribute back to pyarrow as the default behavior of to_numpy() when 
> presented with a fixed-width primitive list that has nulls?
> 

> On Tue, May 2, 2023 at 5:09 PM Steve Kim <[email protected]> wrote:
> 

> > This Message Is From an Untrusted Sender
> > You have not previously corresponded with this sender.
> > See https://itconnect.uw.edu/email-tags for additional information. Please 
> > contact the UW-IT Service Center, [email protected] 206.221.5000, for assistance.
> > 

> > Adding to Aldrin's very informative answer: the pyarrow.compute.is_null 
> > function 
> > (https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html)
> >  returns a boolean array that can be converted to a mask for 
> > numpy.ma.MaskedArray
> > 

> > On Tue, May 2, 2023, 18:26 Aldrin <[email protected]> wrote:
> > 

> > > I think per [1] and [2], because your data has null values, there is no 
> > > good and supported approach to a zero-copy conversion to pandas or numpy. 
> > > So, I think [3] to drop nulls, then use to_numpy() is the path of least 
> > > resistance.
> > > 

> > > 

> > > If you want to try and do the masked array approach, you need to go from: 
> > > (1) Array -> ArrayData, (2) ArrayData -> Buffer, (3) use the Buffer as 
> > > appropriate.
> > > 

> > > 

> > > For (1), see [4]. For (2), see [5]. Then, [6] explains that for a 
> > > fixed-width primitive data type, the first buffer is the validity bitmap. 
> > > I am not sure that floats are fixed width, but I think they are. I know 
> > > that Decimal types are a binary format.
> > > 

> > > 

> > > I think [7] will be helpful to see how the validity bitmap is used in 
> > > C++, not sure how familiar you are, but I'm not sure how far down the 
> > > rabbit hole you'd have to go to use the validity bitmap from python.
> > > 

> > > 

> > > 

> > > 

> > > 

> > > [1]: 
> > > https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions
> > > 

> > > [2]: https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
> > > 

> > > [3]: 
> > > https://arrow.apache.org/docs/python/generated/pyarrow.compute.drop_null.html#pyarrow.compute.drop_null
> > > 

> > > [4]: 
> > > https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L219
> > > 

> > > [5]: 
> > > https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L173
> > > 

> > > [6]: 
> > > https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout
> > > 

> > > [7]: 
> > > https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection.cc#L102
> > > 

> > > 

> > > 

> > > 

> > > # ------------------------------
> > > 

> > > # Aldrin
> > > 

> > > 

> > > https://github.com/drin/
> > > 

> > > https://gitlab.com/octalene
> > > 

> > > 

> > > Sent with Proton Mail secure email.
> > > 

> > > ------- Original Message -------
> > > On Tuesday, May 2nd, 2023 at 15:38, Spencer Nelson <[email protected]> 
> > > wrote:
> > > 

> > > 

> > > > What's the right way to convert Arrow arrays to numpy arrays in the 
> > > > presence of nulls?
> > > > The first thing I reach for is array.to_numpy(zero_safe_copy=False). 
> > > > But this has some behaviors that I found a little undesirable.
> > > > 

> > > > For numeric data (or at least int64 and float64), nulls are converted 
> > > > to floating point NaNs and the resulting numpy array is recast from 
> > > > integer to floating point. For example:
> > > > 

> > > > >>> pa.array([1, 2, 3, None, 5])
> > > > <pyarrow.lib.Int64Array object at 0x111b970a0>
> > > > [
> > > > 1,
> > > > 2,
> > > > 3,
> > > > null,
> > > > 5
> > > > ]
> > > > >>> a.to_numpy(False)
> > > > array([ 1., 2., 3., nan, 5.])
> > > > This can be problematic: actual floating point NaNs are mixed with 
> > > > nulls, which is lossy:
> > > > 

> > > > >>> pa.array([1., 2., float("nan"), None]).to_numpy(False)
> > > > array([ 1., 2., nan, nan])
> > > > 

> > > > Boolean arrays get converted into 'object'-dtyped numpy arrays, with 
> > > > 'True', 'False', and 'None', which is a little undesirable as well.
> > > > 

> > > > One tool in numpy for dealing with nullable data is masked arrays 
> > > > (https://numpy.org/doc/stable/reference/maskedarray.html) which work 
> > > > somewhat like Arrow arrays' validity bitmap. I was thinking of writing 
> > > > some code that generates a numpy masked array from an arrow array, but 
> > > > I'd need to get the validity bitmap itself, and it doesn't seem to be 
> > > > accessible in any pyarrow APIs. Am I missing it?
> > > > 

> > > > Or, am I thinking about this wrong, and there's some other way to pull 
> > > > nullable data out of arrow and into numpy?
> > > > 

> > > > Thanks,
> > > > Spencer
> > > > 

> > > >

publickey - [email protected] - 0x21969656.asc
Description: application/pgp-keys

signature.asc
Description: OpenPGP digital signature

Re: Python: Array.to_numpy(), nullable data, and masked arrays

Reply via email to