I think per [1] and [2], because your data has null values, there is no good 
and supported approach to a zero-copy conversion to pandas or numpy. So, I 
think [3] to drop nulls, then use to_numpy() is the path of least resistance.


If you want to try and do the masked array approach, you need to go from: (1) 
Array -> ArrayData, (2) ArrayData -> Buffer, (3) use the Buffer as appropriate.


For (1), see [4]. For (2), see [5]. Then, [6] explains that for a fixed-width 
primitive data type, the first buffer is the validity bitmap. I am not sure 
that floats are fixed width, but I think they are. I know that Decimal types 
are a binary format.


I think [7] will be helpful to see how the validity bitmap is used in C++, not 
sure how familiar you are, but I'm not sure how far down the rabbit hole you'd 
have to go to use the validity bitmap from python.





[1]: 
https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions

[2]: https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy

[3]: 
https://arrow.apache.org/docs/python/generated/pyarrow.compute.drop_null.html#pyarrow.compute.drop_null

[4]: 
https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L219

[5]: 
https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L173

[6]: 
https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout

[7]: 
https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection.cc#L102




# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene


Sent with Proton Mail secure email.

------- Original Message -------
On Tuesday, May 2nd, 2023 at 15:38, Spencer Nelson <[email protected]> wrote:


> What's the right way to convert Arrow arrays to numpy arrays in the presence 
> of nulls?
> The first thing I reach for is array.to_numpy(zero_safe_copy=False). But this 
> has some behaviors that I found a little undesirable.
> 

> For numeric data (or at least int64 and float64), nulls are converted to 
> floating point NaNs and the resulting numpy array is recast from integer to 
> floating point. For example:
> 

> >>> pa.array([1, 2, 3, None, 5])
> <pyarrow.lib.Int64Array object at 0x111b970a0>
> [
> 1,
> 2,
> 3,
> null,
> 5
> ]
> >>> a.to_numpy(False)
> array([ 1., 2., 3., nan, 5.])
> This can be problematic: actual floating point NaNs are mixed with nulls, 
> which is lossy:
> 

> >>> pa.array([1., 2., float("nan"), None]).to_numpy(False)
> array([ 1., 2., nan, nan])
> 

> Boolean arrays get converted into 'object'-dtyped numpy arrays, with 'True', 
> 'False', and 'None', which is a little undesirable as well.
> 

> One tool in numpy for dealing with nullable data is masked arrays 
> (https://numpy.org/doc/stable/reference/maskedarray.html) which work somewhat 
> like Arrow arrays' validity bitmap. I was thinking of writing some code that 
> generates a numpy masked array from an arrow array, but I'd need to get the 
> validity bitmap itself, and it doesn't seem to be accessible in any pyarrow 
> APIs. Am I missing it?
> 

> Or, am I thinking about this wrong, and there's some other way to pull 
> nullable data out of arrow and into numpy?
> 

> Thanks,
> Spencer
> 

> 

Attachment: publickey - [email protected] - 0x21969656.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to