mmm, just to clarify, based on the initial message, `null_is_nan=True` would represent the current default behavior of `to_numpy`. By adding that as a flag, modification to the `to_numpy` function can be preseved (if desired; if not, then my whole recommendation is moot).
On the other hand, the `is_null` compute function defaults to `nan_is_null=False`, and if we can set the option for that function, then it's possible to drop all NaN values when calling `to_numpy`. So, controlling both seems desirable, even if we want to capture that behavior in a single flag for usability (or differently named flags) # ------------------------------ # Aldrin https://github.com/drin/ https://gitlab.com/octalene Sent with Proton Mail secure email. ------- Original Message ------- On Tuesday, May 2nd, 2023 at 18:27, Aldrin <[email protected]> wrote: > orrr maybe you can add both `nan_is_null` and `null_is_nan`? > > The compute fn takes `nan_is_null` as an option to either return true (null) > for NaN values or return false (not null) for NaN values. > > The opposite can be used by the `to_numpy` function to return nulls as masked > (true) or as unmasked (false). > > This would require documentation to specify the resolution order (compute fn > resolves `nan_is_null` first, then conversion function resolves `null_is_nan` > second). I think it'd probably be more usable to define a single flag that > controls both options, but just throwing the possibility out there. > > Either way, if you open an issue and submit a PR then the various approaches > can be discussed there also. > > The implementation of the `is_null` compute function in C++ can be found at > [2], just for future reference (I wanted to check that there isn't any > repetitive work if it's called from the `to_numpy` function). > > > [1]: > https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html#pyarrow.compute.is_null > > [2]: > https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/scalar_validity.cc#LL105C1-L105C1 > > > > > > # ------------------------------ > > # Aldrin > > > https://github.com/drin/ > > https://gitlab.com/octalene > > > Sent with Proton Mail secure email. > > ------- Original Message ------- > On Tuesday, May 2nd, 2023 at 17:52, Aldrin <[email protected]> wrote: > > > > cool! > > > > > Is this something I should contribute back to pyarrow... > > > > probably! > > > > > ...as the default behavior... when presented with a fixed-width primitive > > > list that has nulls > > > > I am not sure about this. I would assume the use of maskedarray can be > > mostly hidden, so it's probably a good idea, but I would sometimes prefer > > something like that to be explicit, especially since it has different > > behavior as you mentioned before (e.g. mixes nulls with NaNs). > > > > So, my preference would be to contribute it, but somehow using a flag (e.g. > > 'drop_nulls' or 'use_validity') or something. > > > > Based on the way `to_numpy` is written ([1]), I think adding a flag and > > adding a condition after `ConvertArrayToPandas` is called seems like a > > reasonable approach. > > > > > > [1]: > > https://github.com/apache/arrow/blob/main/python/pyarrow/array.pxi#L1527 > > > > > > > > > > # ------------------------------ > > > > # Aldrin > > > > > > https://github.com/drin/ > > > > https://gitlab.com/octalene > > > > > > Sent with Proton Mail secure email. > > > > ------- Original Message ------- > > On Tuesday, May 2nd, 2023 at 17:27, Spencer Nelson <[email protected]> wrote: > > > > > > > Thanks, both - this is helpful. pyarrow.compute.is_null is exactly what I > > > was looking for. > > > > > > Masked arrays for fixed-width primitive types turn out to be reasonably > > > simple. I can call array.buffers() to get the underlying data buffer, and > > > use numpy.frombuffer on it. For the fixed-width primitives, it appears > > > that the memory layout is identical, so this works. > > > Then I can build the masked array with something like > > > `np.ma.masked_array(data_from_buffer, mask_from_is_null)` and it works > > > fine. > > > The whole thing: > > > ``` > > > import numpy as np > > > import pyarrow as pa > > > import pyarrow.compute as pc > > > > > > def to_masked_array(array): > > > _, data_buf = array.buffers() > > > data = np.frombuffer(data_buf, array.dtype.to_pandas_dtype()) > > > mask = pc.is_null(array) > > > return np.ma.masked_array(data, mask) > > > ``` > > > > > > "array.dtype.to_pandas_dtype()" is a bit odd, there. There's a > > > pyarrow.from_numpy_dtype, but no pyarrow.to_numpy_dtype to go the other > > > way. to_pandas_dtype seems to work despite the name, though. > > > > > > I don't think this could be made very simple for variable-length > > > primitives or complex arrow types, but I can live with that. > > > > > > I believe this whole thing works with zero copy. Is this something I > > > should contribute back to pyarrow as the default behavior of to_numpy() > > > when presented with a fixed-width primitive list that has nulls? > > > > > > On Tue, May 2, 2023 at 5:09 PM Steve Kim <[email protected]> wrote: > > > > > > > This Message Is From an Untrusted Sender > > > > You have not previously corresponded with this sender. > > > > See https://itconnect.uw.edu/email-tags for additional information. > > > > Please contact the UW-IT Service Center, [email protected] 206.221.5000, for > > > > assistance. > > > > > > > > Adding to Aldrin's very informative answer: the pyarrow.compute.is_null > > > > function > > > > (https://arrow.apache.org/docs/python/generated/pyarrow.compute.is_null.html) > > > > returns a boolean array that can be converted to a mask for > > > > numpy.ma.MaskedArray > > > > > > > > On Tue, May 2, 2023, 18:26 Aldrin <[email protected]> wrote: > > > > > > > > > I think per [1] and [2], because your data has null values, there is > > > > > no good and supported approach to a zero-copy conversion to pandas or > > > > > numpy. So, I think [3] to drop nulls, then use to_numpy() is the path > > > > > of least resistance. > > > > > > > > > > > > > > > If you want to try and do the masked array approach, you need to go > > > > > from: (1) Array -> ArrayData, (2) ArrayData -> Buffer, (3) use the > > > > > Buffer as appropriate. > > > > > > > > > > > > > > > For (1), see [4]. For (2), see [5]. Then, [6] explains that for a > > > > > fixed-width primitive data type, the first buffer is the validity > > > > > bitmap. I am not sure that floats are fixed width, but I think they > > > > > are. I know that Decimal types are a binary format. > > > > > > > > > > > > > > > I think [7] will be helpful to see how the validity bitmap is used in > > > > > C++, not sure how familiar you are, but I'm not sure how far down the > > > > > rabbit hole you'd have to go to use the validity bitmap from python. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1]: > > > > > https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions > > > > > > > > > > [2]: https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy > > > > > > > > > > [3]: > > > > > https://arrow.apache.org/docs/python/generated/pyarrow.compute.drop_null.html#pyarrow.compute.drop_null > > > > > > > > > > [4]: > > > > > https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L219 > > > > > > > > > > [5]: > > > > > https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L173 > > > > > > > > > > [6]: > > > > > https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout > > > > > > > > > > [7]: > > > > > https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_selection.cc#L102 > > > > > > > > > > > > > > > > > > > > > > > > > # ------------------------------ > > > > > > > > > > # Aldrin > > > > > > > > > > > > > > > https://github.com/drin/ > > > > > > > > > > https://gitlab.com/octalene > > > > > > > > > > > > > > > Sent with Proton Mail secure email. > > > > > > > > > > ------- Original Message ------- > > > > > On Tuesday, May 2nd, 2023 at 15:38, Spencer Nelson <[email protected]> > > > > > wrote: > > > > > > > > > > > > > > > > What's the right way to convert Arrow arrays to numpy arrays in the > > > > > > presence of nulls? > > > > > > The first thing I reach for is > > > > > > array.to_numpy(zero_safe_copy=False). But this has some behaviors > > > > > > that I found a little undesirable. > > > > > > > > > > > > For numeric data (or at least int64 and float64), nulls are > > > > > > converted to floating point NaNs and the resulting numpy array is > > > > > > recast from integer to floating point. For example: > > > > > > > > > > > > >>> pa.array([1, 2, 3, None, 5]) > > > > > > <pyarrow.lib.Int64Array object at 0x111b970a0> > > > > > > [ > > > > > > 1, > > > > > > 2, > > > > > > 3, > > > > > > null, > > > > > > 5 > > > > > > ] > > > > > > >>> a.to_numpy(False) > > > > > > array([ 1., 2., 3., nan, 5.]) > > > > > > This can be problematic: actual floating point NaNs are mixed with > > > > > > nulls, which is lossy: > > > > > > > > > > > > >>> pa.array([1., 2., float("nan"), None]).to_numpy(False) > > > > > > array([ 1., 2., nan, nan]) > > > > > > > > > > > > Boolean arrays get converted into 'object'-dtyped numpy arrays, > > > > > > with 'True', 'False', and 'None', which is a little undesirable as > > > > > > well. > > > > > > > > > > > > One tool in numpy for dealing with nullable data is masked arrays > > > > > > (https://numpy.org/doc/stable/reference/maskedarray.html) which > > > > > > work somewhat like Arrow arrays' validity bitmap. I was thinking of > > > > > > writing some code that generates a numpy masked array from an arrow > > > > > > array, but I'd need to get the validity bitmap itself, and it > > > > > > doesn't seem to be accessible in any pyarrow APIs. Am I missing it? > > > > > > > > > > > > Or, am I thinking about this wrong, and there's some other way to > > > > > > pull nullable data out of arrow and into numpy? > > > > > > > > > > > > Thanks, > > > > > > Spencer > > > > > > > > > > > >
publickey - [email protected] - 0x21969656.asc
Description: application/pgp-keys
signature.asc
Description: OpenPGP digital signature
