orrr maybe you can add both `nan_is_null` and `null_is_nan`? The compute fn takes `nan_is_null` as an option to either return true (null) for NaN values or return false (not null) for NaN values.
The opposite can be used by the `to_numpy` function to return nulls as masked (true) or as unmasked (false). This would require documentation to specify the resolution order (compute fn resolves `nan_is_null` first, then conversion function resolves `null_is_nan` second). I think it'd probably be more usable to define a single flag that controls both options, but just throwing the possibility out there. Either way, if you open an issue and submit a PR then the various approaches can be discussed there also. The implementation of the `is_null` compute function in C++ can be found at [2], just for future reference (I wanted to check that there isn't any repetitive work if it's called from the `to_numpy` function). [1]: https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html#pyarrow.compute.is_null [2]: https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_validity.cc#LL105C1-L105C1 # ------------------------------ # Aldrin https://github.com/drin/ https://gitlab.com/octalene Sent with Proton Mail secure email. ------- Original Message ------- On Tuesday, May 2nd, 2023 at 17:52, Aldrin <[email protected]> wrote: > cool! > > > Is this something I should contribute back to pyarrow... > > probably! > > > ...as the default behavior... when presented with a fixed-width primitive > > list that has nulls > > I am not sure about this. I would assume the use of maskedarray can be mostly > hidden, so it's probably a good idea, but I would sometimes prefer something > like that to be explicit, especially since it has different behavior as you > mentioned before (e.g. mixes nulls with NaNs). > > So, my preference would be to contribute it, but somehow using a flag (e.g. > 'drop_nulls' or 'use_validity') or something. > > Based on the way `to_numpy` is written ([1]), I think adding a flag and > adding a condition after `ConvertArrayToPandas` is called seems like a > reasonable approach. > > > [1]: https://github.com/apache/arrow/blob/main/python/pyarrow/array.pxi#L1527 > > > > > # ------------------------------ > > # Aldrin > > > https://github.com/drin/ > > https://gitlab.com/octalene > > > Sent with Proton Mail secure email. > > ------- Original Message ------- > On Tuesday, May 2nd, 2023 at 17:27, Spencer Nelson <[email protected]> wrote: > > > > Thanks, both - this is helpful. pyarrow.compute.is_null is exactly what I > > was looking for. > > > > Masked arrays for fixed-width primitive types turn out to be reasonably > > simple. I can call array.buffers() to get the underlying data buffer, and > > use numpy.frombuffer on it. For the fixed-width primitives, it appears that > > the memory layout is identical, so this works. > > Then I can build the masked array with something like > > `np.ma.masked_array(data_from_buffer, mask_from_is_null)` and it works fine. > > The whole thing: > > ``` > > import numpy as np > > import pyarrow as pa > > import pyarrow.compute as pc > > > > def to_masked_array(array): > > _, data_buf = array.buffers() > > data = np.frombuffer(data_buf, array.dtype.to_pandas_dtype()) > > mask = pc.is_null(array) > > return np.ma.masked_array(data, mask) > > ``` > > > > "array.dtype.to_pandas_dtype()" is a bit odd, there. There's a > > pyarrow.from_numpy_dtype, but no pyarrow.to_numpy_dtype to go the other > > way. to_pandas_dtype seems to work despite the name, though. > > > > I don't think this could be made very simple for variable-length primitives > > or complex arrow types, but I can live with that. > > > > I believe this whole thing works with zero copy. Is this something I should > > contribute back to pyarrow as the default behavior of to_numpy() when > > presented with a fixed-width primitive list that has nulls? > > > > On Tue, May 2, 2023 at 5:09 PM Steve Kim <[email protected]> wrote: > > > > > This Message Is From an Untrusted Sender > > > You have not previously corresponded with this sender. > > > See https://itconnect.uw.edu/email-tags for additional information. > > > Please contact the UW-IT Service Center, [email protected] 206.221.5000, for > > > assistance. > > > > > > Adding to Aldrin's very informative answer: the pyarrow.compute.is_null > > > function > > > (https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html) > > > returns a boolean array that can be converted to a mask for > > > numpy.ma.MaskedArray > > > > > > On Tue, May 2, 2023, 18:26 Aldrin <[email protected]> wrote: > > > > > > > I think per [1] and [2], because your data has null values, there is no > > > > good and supported approach to a zero-copy conversion to pandas or > > > > numpy. So, I think [3] to drop nulls, then use to_numpy() is the path > > > > of least resistance. > > > > > > > > > > > > If you want to try and do the masked array approach, you need to go > > > > from: (1) Array -> ArrayData, (2) ArrayData -> Buffer, (3) use the > > > > Buffer as appropriate. > > > > > > > > > > > > For (1), see [4]. For (2), see [5]. Then, [6] explains that for a > > > > fixed-width primitive data type, the first buffer is the validity > > > > bitmap. I am not sure that floats are fixed width, but I think they > > > > are. I know that Decimal types are a binary format. > > > > > > > > > > > > I think [7] will be helpful to see how the validity bitmap is used in > > > > C++, not sure how familiar you are, but I'm not sure how far down the > > > > rabbit hole you'd have to go to use the validity bitmap from python. > > > > > > > > > > > > > > > > > > > > > > > > [1]: > > > > https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions > > > > > > > > [2]: https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy > > > > > > > > [3]: > > > > https://arrow.apache.org/docs/python/generated/pyarrow.compute.drop_null.html#pyarrow.compute.drop_null > > > > > > > > [4]: > > > > https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L219 > > > > > > > > [5]: > > > > https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L173 > > > > > > > > [6]: > > > > https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout > > > > > > > > [7]: > > > > https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection.cc#L102 > > > > > > > > > > > > > > > > > > > > # ------------------------------ > > > > > > > > # Aldrin > > > > > > > > > > > > https://github.com/drin/ > > > > > > > > https://gitlab.com/octalene > > > > > > > > > > > > Sent with Proton Mail secure email. > > > > > > > > ------- Original Message ------- > > > > On Tuesday, May 2nd, 2023 at 15:38, Spencer Nelson <[email protected]> > > > > wrote: > > > > > > > > > > > > > What's the right way to convert Arrow arrays to numpy arrays in the > > > > > presence of nulls? > > > > > The first thing I reach for is array.to_numpy(zero_safe_copy=False). > > > > > But this has some behaviors that I found a little undesirable. > > > > > > > > > > For numeric data (or at least int64 and float64), nulls are converted > > > > > to floating point NaNs and the resulting numpy array is recast from > > > > > integer to floating point. For example: > > > > > > > > > > >>> pa.array([1, 2, 3, None, 5]) > > > > > <pyarrow.lib.Int64Array object at 0x111b970a0> > > > > > [ > > > > > 1, > > > > > 2, > > > > > 3, > > > > > null, > > > > > 5 > > > > > ] > > > > > >>> a.to_numpy(False) > > > > > array([ 1., 2., 3., nan, 5.]) > > > > > This can be problematic: actual floating point NaNs are mixed with > > > > > nulls, which is lossy: > > > > > > > > > > >>> pa.array([1., 2., float("nan"), None]).to_numpy(False) > > > > > array([ 1., 2., nan, nan]) > > > > > > > > > > Boolean arrays get converted into 'object'-dtyped numpy arrays, with > > > > > 'True', 'False', and 'None', which is a little undesirable as well. > > > > > > > > > > One tool in numpy for dealing with nullable data is masked arrays > > > > > (https://numpy.org/doc/stable/reference/maskedarray.html) which work > > > > > somewhat like Arrow arrays' validity bitmap. I was thinking of > > > > > writing some code that generates a numpy masked array from an arrow > > > > > array, but I'd need to get the validity bitmap itself, and it doesn't > > > > > seem to be accessible in any pyarrow APIs. Am I missing it? > > > > > > > > > > Or, am I thinking about this wrong, and there's some other way to > > > > > pull nullable data out of arrow and into numpy? > > > > > > > > > > Thanks, > > > > > Spencer > > > > > > > > > >
publickey - [email protected] - 0x21969656.asc
Description: application/pgp-keys
signature.asc
Description: OpenPGP digital signature
