hi Rhys, On Mon, Dec 10, 2018 at 9:53 AM Rhys Ulerich <rhys.uler...@twosigma.com> wrote: > > 'Morning, > > > > Regarding https://arrow.apache.org/docs/memory_layout.html, how should > is_valid be interpreted for primitive types that have their own notions of > is_valid? > > > > Concretely, how should folks interpret a "valid NaN" (is_valid 1 with float > NaN) versus an "invalid NaN" (is valid 0 with float NaN)? In RFC-ese, MUST > individual NaNs be valid? Or, MUST floats all be valid by omitting the > validity bitset? >
In floating point types, NaN is a valid value. I think you're talking about systems that use sentinel values to represent nulls. The Arrow columnar format does not have any notion of sentinel values. So if you want other Arrow systems to recognize your values as being null, then you must construct the validity bitmap accordingly. > > > I ask because otherwise I can see a bunch of different systems interpreting > this detail in many different ways. That'd be an interop nightmare. > Especially since understanding why NaNs sneak into large datasets is already > quite a hassle. > It is up to applications to determine what NaN means. It would not be appropriate for Arrow to assume anything, particularly since most database systems (AFAIK) distinguish NaN and NULL. For example, in Python interop, we recognize NaN as null when converting to Arrow, but _only_ if the data originated from pandas: https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/type_traits.h#L102 In [1]: import pyarrow as pa In [2]: import numpy as np In [3]: arr = np.array([1, np.nan]) In [4]: arr1 = pa.array(arr) In [5]: arr2 = pa.array(arr, from_pandas=True) In [6]: arr1 Out[6]: <pyarrow.lib.DoubleArray object at 0x7ffa3c8a1188> [ 1, nan ] In [7]: arr2 Out[7]: <pyarrow.lib.DoubleArray object at 0x7ffa1ef42bd8> [ 1, null ] In [8]: arr1.null_count Out[8]: 0 In [9]: arr2.null_count Out[9]: 1 In R, NaN and NA are distinct https://github.com/apache/arrow/commit/3ab4a0f481211c5d115845519eb9398dc02e2e24#diff-4b43b0aee35624cd95b910189b3dc231 > > > Anyhow, it seems worth addressing this gap at the written specification level. > What would you suggest? We could add a statement to be explicit that no special / sentinel values (which includes NaN) are recognized as null. - Wes > > > (Apologies if this has been discussed previously-- I've found no searchable > mailing list archives under > http://mail-archives.apache.org/mod_mbox/arrow-dev/ or > https://cwiki.apache.org/confluence/display/ARROW.) > > > > Thanks, > > Rhys