Re: valid NaNs versus invalid NaNs?

2018-12-10 Thread Donald Foss
Alternately Rhys, what Wes said. :)

Donald E. Foss | @DonaldFoss 
Never Stop Learning!
-- __o
_`\<,_
---(_)/ (_)

> On Dec 10, 2018, at 11:23 AM, Donald Foss  wrote:
> 
> +1 on NaNs being an interop nightmare already, especially for those who work 
> with multiple coding languages at the same time.
> 
> Issues regarding NaNs may be found at 
> https://issues.apache.org/jira/browse/ARROW-2806?jql=text%20~%20%22NaN%22 
> . 
> The last issue I see was from July 2018, with Python, and marked resolved 17 
> July 2018. The description may be helpful.
> 
> Regards,
> 
> Donald E. Foss | @DonaldFoss 
> Never Stop Learning!
> -- __o
> _`\<,_
> ---(_)/ (_)
> 
>> On Dec 10, 2018, at 10:47 AM, Rhys Ulerich > > wrote:
>> 
>> 'Morning,
>> 
>> 
>> 
>> Regarding https://arrow.apache.org/docs/memory_layout.html 
>> , how should is_valid be 
>> interpreted for primitive types that have their own notions of is_valid?
>> 
>> 
>> 
>> Concretely, how should folks interpret a "valid NaN" (is_valid 1 with float 
>> NaN) versus an "invalid NaN" (is valid 0 with float NaN)?  In RFC-ese, MUST 
>> individual NaNs be valid?  Or, MUST floats all be valid by omitting the 
>> validity bitset?
>> 
>> 
>> 
>> I ask because otherwise I can see a bunch of different systems interpreting 
>> this detail in many different ways.  That'd be an interop nightmare.  
>> Especially since understanding why NaNs sneak into large datasets is already 
>> quite a hassle.
>> 
>> 
>> 
>> Anyhow, it seems worth addressing this gap at the written specification 
>> level.
>> 
>> 
>> 
>> (Apologies if this has been discussed previously-- I've found no searchable 
>> mailing list archives under 
>> http://mail-archives.apache.org/mod_mbox/arrow-dev/ 
>>  or 
>> https://cwiki.apache.org/confluence/display/ARROW 
>> .)
>> 
>> 
>> 
>> Thanks,
>> 
>> Rhys
> 



Re: valid NaNs versus invalid NaNs?

2018-12-10 Thread Donald Foss
+1 on NaNs being an interop nightmare already, especially for those who work 
with multiple coding languages at the same time.

Issues regarding NaNs may be found at 
https://issues.apache.org/jira/browse/ARROW-2806?jql=text%20~%20%22NaN%22 
. 
The last issue I see was from July 2018, with Python, and marked resolved 17 
July 2018. The description may be helpful.

Regards,

Donald E. Foss | @DonaldFoss 
Never Stop Learning!
-- __o
_`\<,_
---(_)/ (_)

> On Dec 10, 2018, at 10:47 AM, Rhys Ulerich  wrote:
> 
> 'Morning,
> 
> 
> 
> Regarding https://arrow.apache.org/docs/memory_layout.html, how should 
> is_valid be interpreted for primitive types that have their own notions of 
> is_valid?
> 
> 
> 
> Concretely, how should folks interpret a "valid NaN" (is_valid 1 with float 
> NaN) versus an "invalid NaN" (is valid 0 with float NaN)?  In RFC-ese, MUST 
> individual NaNs be valid?  Or, MUST floats all be valid by omitting the 
> validity bitset?
> 
> 
> 
> I ask because otherwise I can see a bunch of different systems interpreting 
> this detail in many different ways.  That'd be an interop nightmare.  
> Especially since understanding why NaNs sneak into large datasets is already 
> quite a hassle.
> 
> 
> 
> Anyhow, it seems worth addressing this gap at the written specification level.
> 
> 
> 
> (Apologies if this has been discussed previously-- I've found no searchable 
> mailing list archives under 
> http://mail-archives.apache.org/mod_mbox/arrow-dev/ or 
> https://cwiki.apache.org/confluence/display/ARROW.)
> 
> 
> 
> Thanks,
> 
> Rhys



RE: valid NaNs versus invalid NaNs?

2018-12-10 Thread Rhys Ulerich
>> Anyhow, it seems worth addressing this gap at the written specification 
>> level.
> What would you suggest? We could add a statement to be explicit that no 
> special / sentinel values (which includes NaN) are recognized as null.

I like your suggestion Wes.  Please consider making that amendment (or similar) 
in the next specification update.

Cheers,
Rhys


Re: valid NaNs versus invalid NaNs?

2018-12-10 Thread Wes McKinney
hi Rhys,

On Mon, Dec 10, 2018 at 9:53 AM Rhys Ulerich  wrote:
>
> 'Morning,
>
>
>
> Regarding https://arrow.apache.org/docs/memory_layout.html, how should 
> is_valid be interpreted for primitive types that have their own notions of 
> is_valid?
>
>
>
> Concretely, how should folks interpret a "valid NaN" (is_valid 1 with float 
> NaN) versus an "invalid NaN" (is valid 0 with float NaN)?  In RFC-ese, MUST 
> individual NaNs be valid?  Or, MUST floats all be valid by omitting the 
> validity bitset?
>

In floating point types, NaN is a valid value. I think you're talking
about systems that use sentinel values to represent nulls. The Arrow
columnar format does not have any notion of sentinel values. So if you
want other Arrow systems to recognize your values as being null, then
you must construct the validity bitmap accordingly.

>
>
> I ask because otherwise I can see a bunch of different systems interpreting 
> this detail in many different ways.  That'd be an interop nightmare.  
> Especially since understanding why NaNs sneak into large datasets is already 
> quite a hassle.
>

It is up to applications to determine what NaN means. It would not be
appropriate for Arrow to assume anything, particularly since most
database systems (AFAIK) distinguish NaN and NULL.

For example, in Python interop, we recognize NaN as null when
converting to Arrow, but _only_ if the data originated from pandas:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/type_traits.h#L102

In [1]: import pyarrow as pa

In [2]: import numpy as np

In [3]: arr = np.array([1, np.nan])

In [4]: arr1 = pa.array(arr)

In [5]: arr2 = pa.array(arr, from_pandas=True)

In [6]: arr1
Out[6]:

[
  1,
  nan
]

In [7]: arr2
Out[7]:

[
  1,
  null
]

In [8]: arr1.null_count
Out[8]: 0

In [9]: arr2.null_count
Out[9]: 1

In R, NaN and NA are distinct

https://github.com/apache/arrow/commit/3ab4a0f481211c5d115845519eb9398dc02e2e24#diff-4b43b0aee35624cd95b910189b3dc231

>
>
> Anyhow, it seems worth addressing this gap at the written specification level.
>

What would you suggest? We could add a statement to be explicit that
no special / sentinel values (which includes NaN) are recognized as
null.

- Wes

>
>
> (Apologies if this has been discussed previously-- I've found no searchable 
> mailing list archives under 
> http://mail-archives.apache.org/mod_mbox/arrow-dev/ or 
> https://cwiki.apache.org/confluence/display/ARROW.)
>
>
>
> Thanks,
>
> Rhys