Indeed it's not about sanitizing nulls; it's about how nulls should
interact with boolean (and other) expressions.

For purposes of discussion, I'm naming the current approach of propagating
null "NaN logic" (since all expressions involving NaN evaluate to NaN).

To give some context for this discussion, I'm currently working on support
for filter expressions (ARROW-6243).

As an example of when this would come into play, let there be a dataset
spanning several files. The older files have an IPV4 column while the newer
files have IPV6 as well.
With NaN logic the expression (IPV4=="127.0.0.1" or IPV6=="::1") yields
null for all of the older files since they lack an IPV6 column (regardless
of their IPV4 column) which
seems undesirable.

Could you explain what you mean by "safest"?
Under NaN logic, the Kleene result can be recovered with
(coalesce(IPV4=="127.0.0.1", false) or coalesce(IPV6=="::1", false))
Under Kleene logic, the NaN result can be recovered with (case IPV4 is null
or IPV6 is null when 1 then null else IPV4=="127.0.0.1" or IPV6=="::1" end)
I don't think we're losing information either way.

I'm not attached to either system but I'd like to understand and document
the rationale behind our choice.

On Thu, Aug 29, 2019 at 1:14 PM Antoine Pitrou <anto...@python.org> wrote:

>
> IIUC it's not about sanitizing to false.  Ben explained it in more
> detail in private to me, perhaps he want to copy that explanation here ;-)
>
> Regards
>
> Antoine.
>
>
> Le 29/08/2019 à 19:05, Wes McKinney a écrit :
> > hi Ben,
> >
> > My instinct is that always propagating null (at least by default) is
> > the safest choice. Applications can choose to sanitize null to false
> > if that's what they want semantically.
> >
> > - Wes
> >
> > On Thu, Aug 29, 2019 at 8:37 AM Ben Kietzman <ben.kietz...@rstudio.com>
> wrote:
> >>
> >> To my knowledge, there isn't explicit documentation on how null slots
> in an
> >> array should be interpreted. SQL uses Kleene logic, wherein a null is
> >> explicitly an unknown rather than a special value. This yields for
> example
> >> `(null AND false) -> false`, since `(x AND false) -> false` for all
> >> possible values of x. This is also the behavior of Gandiva's boolean
> >> expressions.
> >>
> >> By contrast the boolean kernels implement something closer to the
> behavior
> >> of NaN: `(null AND false) -> null`. I think this is simply an error in
> the
> >> boolean kernels but in any case I think explicit documentation should be
> >> added to prevent future confusion.
> >>
> >> https://issues.apache.org/jira/browse/ARROW-6386
>

Reply via email to