Adding to Aldrin's very informative answer: the pyarrow.compute.is_null function ( https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html) returns a boolean array that can be converted to a mask for numpy.ma.MaskedArray
On Tue, May 2, 2023, 18:26 Aldrin <octalene....@pm.me> wrote: > I think per [1] and [2], because your data has null values, there is no > good and supported approach to a zero-copy conversion to pandas or numpy. > So, I think [3] to drop nulls, then use to_numpy() is the path of least > resistance. > > If you want to try and do the masked array approach, you need to go from: > (1) Array -> ArrayData, (2) ArrayData -> Buffer, (3) use the Buffer as > appropriate. > > For (1), see [4]. For (2), see [5]. Then, [6] explains that for a > fixed-width primitive data type, the first buffer is the validity bitmap. I > am not sure that floats are fixed width, but I think they are. I know that > Decimal types are a binary format. > > I think [7] will be helpful to see how the validity bitmap is used in C++, > not sure how familiar you are, but I'm not sure how far down the rabbit > hole you'd have to go to use the validity bitmap from python. > > > [1]: > https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions > [2]: https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy > [3]: > https://arrow.apache.org/docs/python/generated/pyarrow.compute.drop_null.html#pyarrow.compute.drop_null > [4]: > https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L219 > [5]: > https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L173 > [6]: > https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout > [7]: > https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection.cc#L102 > > > # ------------------------------ > # Aldrin > > https://github.com/drin/ > https://gitlab.com/octalene > > Sent with Proton Mail <https://proton.me/> secure email. > > ------- Original Message ------- > On Tuesday, May 2nd, 2023 at 15:38, Spencer Nelson <swnel...@uw.edu> > wrote: > > What's the right way to convert Arrow arrays to numpy arrays in the > presence of nulls? > > The first thing I reach for is array.to_numpy(zero_safe_copy=False). But > this has some behaviors that I found a little undesirable. > > For numeric data (or at least int64 and float64), nulls are converted to > floating point NaNs and the resulting numpy array is recast from integer to > floating point. For example: > > >>> pa.array([1, 2, 3, None, 5]) > <pyarrow.lib.Int64Array object at 0x111b970a0> > [ > 1, > 2, > 3, > null, > 5 > ] > >>> a.to_numpy(False) > array([ 1., 2., 3., nan, 5.]) > > This can be problematic: *actual* floating point NaNs are mixed with > nulls, which is lossy: > > >>> pa.array([1., 2., float("nan"), None]).to_numpy(False) > array([ 1., 2., nan, nan]) > > Boolean arrays get converted into 'object'-dtyped numpy arrays, with > 'True', 'False', and 'None', which is a little undesirable as well. > > One tool in numpy for dealing with nullable data is masked arrays ( > https://numpy.org/doc/stable/reference/maskedarray.html) which work > somewhat like Arrow arrays' validity bitmap. I was thinking of writing some > code that generates a numpy masked array from an arrow array, but I'd need > to get the validity bitmap itself, and it doesn't seem to be accessible in > any pyarrow APIs. Am I missing it? > > Or, am I thinking about this wrong, and there's some other way to pull > nullable data out of arrow and into numpy? > > Thanks, > Spencer > > > >