Re: Python: Array.to_numpy(), nullable data, and masked arrays

Aldrin Tue, 02 May 2023 18:36:39 -0700

mmm, just to clarify, based on the initial message, `null_is_nan=True` would 
represent the current default behavior of `to_numpy`. By adding that as a flag, 
modification to the `to_numpy` function can be preseved (if desired; if not, 
then my whole recommendation is moot).


On the other hand, the `is_null` compute function defaults to 
`nan_is_null=False`, and if we can set the option for that function, then it's 
possible to drop all NaN values when calling `to_numpy`.

So, controlling both seems desirable, even if we want to capture that behavior 
in a single flag for usability (or differently named flags)



# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene


Sent with Proton Mail secure email.

------- Original Message -------
On Tuesday, May 2nd, 2023 at 18:27, Aldrin <[email protected]> wrote:


> orrr maybe you can add both `nan_is_null` and `null_is_nan`?
> 

> The compute fn takes `nan_is_null` as an option to either return true (null) 
> for NaN values or return false (not null) for NaN values.
> 

> The opposite can be used by the `to_numpy` function to return nulls as masked 
> (true) or as unmasked (false).
> 

> This would require documentation to specify the resolution order (compute fn 
> resolves `nan_is_null` first, then conversion function resolves `null_is_nan` 
> second). I think it'd probably be more usable to define a single flag that 
> controls both options, but just throwing the possibility out there.
> 

> Either way, if you open an issue and submit a PR then the various approaches 
> can be discussed there also.
> 

> The implementation of the `is_null` compute function in C++ can be found at 
> [2], just for future reference (I wanted to check that there isn't any 
> repetitive work if it's called from the `to_numpy` function).
> 

> 

> [1]: 
> https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html#pyarrow.compute.is_null
> 

> [2]: 
> https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_validity.cc#LL105C1-L105C1
> 

> 

> 

> 

> 

> # ------------------------------
> 

> # Aldrin
> 

> 

> https://github.com/drin/
> 

> https://gitlab.com/octalene
> 

> 

> Sent with Proton Mail secure email.
> 

> ------- Original Message -------
> On Tuesday, May 2nd, 2023 at 17:52, Aldrin <[email protected]> wrote:
> 

> 

> > cool!
> > 

> > > Is this something I should contribute back to pyarrow...
> > 

> > probably!
> > 

> > > ...as the default behavior... when presented with a fixed-width primitive 
> > > list that has nulls
> > 

> > I am not sure about this. I would assume the use of maskedarray can be 
> > mostly hidden, so it's probably a good idea, but I would sometimes prefer 
> > something like that to be explicit, especially since it has different 
> > behavior as you mentioned before (e.g. mixes nulls with NaNs).
> > 

> > So, my preference would be to contribute it, but somehow using a flag (e.g. 
> > 'drop_nulls' or 'use_validity') or something.
> > 

> > Based on the way `to_numpy` is written ([1]), I think adding a flag and 
> > adding a condition after `ConvertArrayToPandas` is called seems like a 
> > reasonable approach.
> > 

> > 

> > [1]: 
> > https://github.com/apache/arrow/blob/main/python/pyarrow/array.pxi#L1527
> > 

> > 

> > 

> > 

> > # ------------------------------
> > 

> > # Aldrin
> > 

> > 

> > https://github.com/drin/
> > 

> > https://gitlab.com/octalene
> > 

> > 

> > Sent with Proton Mail secure email.
> > 

> > ------- Original Message -------
> > On Tuesday, May 2nd, 2023 at 17:27, Spencer Nelson <[email protected]> wrote:
> > 

> > 

> > > Thanks, both - this is helpful. pyarrow.compute.is_null is exactly what I 
> > > was looking for.
> > > 

> > > Masked arrays for fixed-width primitive types turn out to be reasonably 
> > > simple. I can call array.buffers() to get the underlying data buffer, and 
> > > use numpy.frombuffer on it. For the fixed-width primitives, it appears 
> > > that the memory layout is identical, so this works.
> > > Then I can build the masked array with something like 
> > > `np.ma.masked_array(data_from_buffer, mask_from_is_null)` and it works 
> > > fine.
> > > The whole thing:
> > > ```
> > > import numpy as np
> > > import pyarrow as pa
> > > import pyarrow.compute as pc
> > > 

> > > def to_masked_array(array):
> > > _, data_buf = array.buffers()
> > > data = np.frombuffer(data_buf, array.dtype.to_pandas_dtype())
> > > mask = pc.is_null(array)
> > > return np.ma.masked_array(data, mask)
> > > ```
> > > 

> > > "array.dtype.to_pandas_dtype()" is a bit odd, there. There's a 
> > > pyarrow.from_numpy_dtype, but no pyarrow.to_numpy_dtype to go the other 
> > > way. to_pandas_dtype seems to work despite the name, though.
> > > 

> > > I don't think this could be made very simple for variable-length 
> > > primitives or complex arrow types, but I can live with that.
> > > 

> > > I believe this whole thing works with zero copy. Is this something I 
> > > should contribute back to pyarrow as the default behavior of to_numpy() 
> > > when presented with a fixed-width primitive list that has nulls?
> > > 

> > > On Tue, May 2, 2023 at 5:09 PM Steve Kim <[email protected]> wrote:
> > > 

> > > > This Message Is From an Untrusted Sender
> > > > You have not previously corresponded with this sender.
> > > > See https://itconnect.uw.edu/email-tags for additional information. 
> > > > Please contact the UW-IT Service Center, [email protected] 206.221.5000, for 
> > > > assistance.
> > > > 

> > > > Adding to Aldrin's very informative answer: the pyarrow.compute.is_null 
> > > > function 
> > > > (https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html)
> > > >  returns a boolean array that can be converted to a mask for 
> > > > numpy.ma.MaskedArray
> > > > 

> > > > On Tue, May 2, 2023, 18:26 Aldrin <[email protected]> wrote:
> > > > 

> > > > > I think per [1] and [2], because your data has null values, there is 
> > > > > no good and supported approach to a zero-copy conversion to pandas or 
> > > > > numpy. So, I think [3] to drop nulls, then use to_numpy() is the path 
> > > > > of least resistance.
> > > > > 

> > > > > 

> > > > > If you want to try and do the masked array approach, you need to go 
> > > > > from: (1) Array -> ArrayData, (2) ArrayData -> Buffer, (3) use the 
> > > > > Buffer as appropriate.
> > > > > 

> > > > > 

> > > > > For (1), see [4]. For (2), see [5]. Then, [6] explains that for a 
> > > > > fixed-width primitive data type, the first buffer is the validity 
> > > > > bitmap. I am not sure that floats are fixed width, but I think they 
> > > > > are. I know that Decimal types are a binary format.
> > > > > 

> > > > > 

> > > > > I think [7] will be helpful to see how the validity bitmap is used in 
> > > > > C++, not sure how familiar you are, but I'm not sure how far down the 
> > > > > rabbit hole you'd have to go to use the validity bitmap from python.
> > > > > 

> > > > > 

> > > > > 

> > > > > 

> > > > > 

> > > > > [1]: 
> > > > > https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions
> > > > > 

> > > > > [2]: https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
> > > > > 

> > > > > [3]: 
> > > > > https://arrow.apache.org/docs/python/generated/pyarrow.compute.drop_null.html#pyarrow.compute.drop_null
> > > > > 

> > > > > [4]: 
> > > > > https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L219
> > > > > 

> > > > > [5]: 
> > > > > https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L173
> > > > > 

> > > > > [6]: 
> > > > > https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout
> > > > > 

> > > > > [7]: 
> > > > > https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection.cc#L102
> > > > > 

> > > > > 

> > > > > 

> > > > > 

> > > > > # ------------------------------
> > > > > 

> > > > > # Aldrin
> > > > > 

> > > > > 

> > > > > https://github.com/drin/
> > > > > 

> > > > > https://gitlab.com/octalene
> > > > > 

> > > > > 

> > > > > Sent with Proton Mail secure email.
> > > > > 

> > > > > ------- Original Message -------
> > > > > On Tuesday, May 2nd, 2023 at 15:38, Spencer Nelson <[email protected]> 
> > > > > wrote:
> > > > > 

> > > > > 

> > > > > > What's the right way to convert Arrow arrays to numpy arrays in the 
> > > > > > presence of nulls?
> > > > > > The first thing I reach for is 
> > > > > > array.to_numpy(zero_safe_copy=False). But this has some behaviors 
> > > > > > that I found a little undesirable.
> > > > > > 

> > > > > > For numeric data (or at least int64 and float64), nulls are 
> > > > > > converted to floating point NaNs and the resulting numpy array is 
> > > > > > recast from integer to floating point. For example:
> > > > > > 

> > > > > > >>> pa.array([1, 2, 3, None, 5])
> > > > > > <pyarrow.lib.Int64Array object at 0x111b970a0>
> > > > > > [
> > > > > > 1,
> > > > > > 2,
> > > > > > 3,
> > > > > > null,
> > > > > > 5
> > > > > > ]
> > > > > > >>> a.to_numpy(False)
> > > > > > array([ 1., 2., 3., nan, 5.])
> > > > > > This can be problematic: actual floating point NaNs are mixed with 
> > > > > > nulls, which is lossy:
> > > > > > 

> > > > > > >>> pa.array([1., 2., float("nan"), None]).to_numpy(False)
> > > > > > array([ 1., 2., nan, nan])
> > > > > > 

> > > > > > Boolean arrays get converted into 'object'-dtyped numpy arrays, 
> > > > > > with 'True', 'False', and 'None', which is a little undesirable as 
> > > > > > well.
> > > > > > 

> > > > > > One tool in numpy for dealing with nullable data is masked arrays 
> > > > > > (https://numpy.org/doc/stable/reference/maskedarray.html) which 
> > > > > > work somewhat like Arrow arrays' validity bitmap. I was thinking of 
> > > > > > writing some code that generates a numpy masked array from an arrow 
> > > > > > array, but I'd need to get the validity bitmap itself, and it 
> > > > > > doesn't seem to be accessible in any pyarrow APIs. Am I missing it?
> > > > > > 

> > > > > > Or, am I thinking about this wrong, and there's some other way to 
> > > > > > pull nullable data out of arrow and into numpy?
> > > > > > 

> > > > > > Thanks,
> > > > > > Spencer
> > > > > > 

> > > > > >

publickey - [email protected] - 0x21969656.asc
Description: application/pgp-keys

signature.asc
Description: OpenPGP digital signature

Re: Python: Array.to_numpy(), nullable data, and masked arrays

Reply via email to