I created the ticket https://issues.apache.org/jira/browse/ARROW-6396,
I think we can offer both.

François


On Thu, Aug 29, 2019 at 5:10 PM Ben Kietzman <ben.kietz...@rstudio.com> wrote:
>
> Indeed it's not about sanitizing nulls; it's about how nulls should
> interact with boolean (and other) expressions.
>
> For purposes of discussion, I'm naming the current approach of propagating
> null "NaN logic" (since all expressions involving NaN evaluate to NaN).
>
> To give some context for this discussion, I'm currently working on support
> for filter expressions (ARROW-6243).
>
> As an example of when this would come into play, let there be a dataset
> spanning several files. The older files have an IPV4 column while the newer
> files have IPV6 as well.
> With NaN logic the expression (IPV4=="127.0.0.1" or IPV6=="::1") yields
> null for all of the older files since they lack an IPV6 column (regardless
> of their IPV4 column) which
> seems undesirable.
>
> Could you explain what you mean by "safest"?
> Under NaN logic, the Kleene result can be recovered with
> (coalesce(IPV4=="127.0.0.1", false) or coalesce(IPV6=="::1", false))
> Under Kleene logic, the NaN result can be recovered with (case IPV4 is null
> or IPV6 is null when 1 then null else IPV4=="127.0.0.1" or IPV6=="::1" end)
> I don't think we're losing information either way.
>
> I'm not attached to either system but I'd like to understand and document
> the rationale behind our choice.
>
> On Thu, Aug 29, 2019 at 1:14 PM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > IIUC it's not about sanitizing to false.  Ben explained it in more
> > detail in private to me, perhaps he want to copy that explanation here ;-)
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 29/08/2019 à 19:05, Wes McKinney a écrit :
> > > hi Ben,
> > >
> > > My instinct is that always propagating null (at least by default) is
> > > the safest choice. Applications can choose to sanitize null to false
> > > if that's what they want semantically.
> > >
> > > - Wes
> > >
> > > On Thu, Aug 29, 2019 at 8:37 AM Ben Kietzman <ben.kietz...@rstudio.com>
> > wrote:
> > >>
> > >> To my knowledge, there isn't explicit documentation on how null slots
> > in an
> > >> array should be interpreted. SQL uses Kleene logic, wherein a null is
> > >> explicitly an unknown rather than a special value. This yields for
> > example
> > >> `(null AND false) -> false`, since `(x AND false) -> false` for all
> > >> possible values of x. This is also the behavior of Gandiva's boolean
> > >> expressions.
> > >>
> > >> By contrast the boolean kernels implement something closer to the
> > behavior
> > >> of NaN: `(null AND false) -> null`. I think this is simply an error in
> > the
> > >> boolean kernels but in any case I think explicit documentation should be
> > >> added to prevent future confusion.
> > >>
> > >> https://issues.apache.org/jira/browse/ARROW-6386
> >

Reply via email to