+1 on NaNs being an interop nightmare already, especially for those who work with multiple coding languages at the same time.
Issues regarding NaNs may be found at https://issues.apache.org/jira/browse/ARROW-2806?jql=text%20~%20%22NaN%22 <https://issues.apache.org/jira/browse/ARROW-2806?jql=text%20~%20%22NaN%22>. The last issue I see was from July 2018, with Python, and marked resolved 17 July 2018. The description may be helpful. Regards, Donald E. Foss | @DonaldFoss <https://twitter.com/DonaldFoss> Never Stop Learning! ------ __o ----_`\<,_ ---(_)/ (_) > On Dec 10, 2018, at 10:47 AM, Rhys Ulerich <rhys.uler...@twosigma.com> wrote: > > 'Morning, > > > > Regarding https://arrow.apache.org/docs/memory_layout.html, how should > is_valid be interpreted for primitive types that have their own notions of > is_valid? > > > > Concretely, how should folks interpret a "valid NaN" (is_valid 1 with float > NaN) versus an "invalid NaN" (is valid 0 with float NaN)? In RFC-ese, MUST > individual NaNs be valid? Or, MUST floats all be valid by omitting the > validity bitset? > > > > I ask because otherwise I can see a bunch of different systems interpreting > this detail in many different ways. That'd be an interop nightmare. > Especially since understanding why NaNs sneak into large datasets is already > quite a hassle. > > > > Anyhow, it seems worth addressing this gap at the written specification level. > > > > (Apologies if this has been discussed previously-- I've found no searchable > mailing list archives under > http://mail-archives.apache.org/mod_mbox/arrow-dev/ or > https://cwiki.apache.org/confluence/display/ARROW.) > > > > Thanks, > > Rhys