I think per [1] and [2], because your data has null values, there is no good and supported approach to a zero-copy conversion to pandas or numpy. So, I think [3] to drop nulls, then use to_numpy() is the path of least resistance.
If you want to try and do the masked array approach, you need to go from: (1) Array -> ArrayData, (2) ArrayData -> Buffer, (3) use the Buffer as appropriate. For (1), see [4]. For (2), see [5]. Then, [6] explains that for a fixed-width primitive data type, the first buffer is the validity bitmap. I am not sure that floats are fixed width, but I think they are. I know that Decimal types are a binary format. I think [7] will be helpful to see how the validity bitmap is used in C++, not sure how familiar you are, but I'm not sure how far down the rabbit hole you'd have to go to use the validity bitmap from python. [1]: https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions [2]: https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy [3]: https://arrow.apache.org/docs/python/generated/pyarrow.compute.drop_null.html#pyarrow.compute.drop_null [4]: https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L219 [5]: https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L173 [6]: https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout [7]: https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection.cc#L102 # ------------------------------ # Aldrin https://github.com/drin/ https://gitlab.com/octalene Sent with Proton Mail secure email. ------- Original Message ------- On Tuesday, May 2nd, 2023 at 15:38, Spencer Nelson <[email protected]> wrote: > What's the right way to convert Arrow arrays to numpy arrays in the presence > of nulls? > The first thing I reach for is array.to_numpy(zero_safe_copy=False). But this > has some behaviors that I found a little undesirable. > > For numeric data (or at least int64 and float64), nulls are converted to > floating point NaNs and the resulting numpy array is recast from integer to > floating point. For example: > > >>> pa.array([1, 2, 3, None, 5]) > <pyarrow.lib.Int64Array object at 0x111b970a0> > [ > 1, > 2, > 3, > null, > 5 > ] > >>> a.to_numpy(False) > array([ 1., 2., 3., nan, 5.]) > This can be problematic: actual floating point NaNs are mixed with nulls, > which is lossy: > > >>> pa.array([1., 2., float("nan"), None]).to_numpy(False) > array([ 1., 2., nan, nan]) > > Boolean arrays get converted into 'object'-dtyped numpy arrays, with 'True', > 'False', and 'None', which is a little undesirable as well. > > One tool in numpy for dealing with nullable data is masked arrays > (https://numpy.org/doc/stable/reference/maskedarray.html) which work somewhat > like Arrow arrays' validity bitmap. I was thinking of writing some code that > generates a numpy masked array from an arrow array, but I'd need to get the > validity bitmap itself, and it doesn't seem to be accessible in any pyarrow > APIs. Am I missing it? > > Or, am I thinking about this wrong, and there's some other way to pull > nullable data out of arrow and into numpy? > > Thanks, > Spencer > >
publickey - [email protected] - 0x21969656.asc
Description: application/pgp-keys
signature.asc
Description: OpenPGP digital signature
