I created the ticket https://issues.apache.org/jira/browse/ARROW-6396, I think we can offer both.
François On Thu, Aug 29, 2019 at 5:10 PM Ben Kietzman <ben.kietz...@rstudio.com> wrote: > > Indeed it's not about sanitizing nulls; it's about how nulls should > interact with boolean (and other) expressions. > > For purposes of discussion, I'm naming the current approach of propagating > null "NaN logic" (since all expressions involving NaN evaluate to NaN). > > To give some context for this discussion, I'm currently working on support > for filter expressions (ARROW-6243). > > As an example of when this would come into play, let there be a dataset > spanning several files. The older files have an IPV4 column while the newer > files have IPV6 as well. > With NaN logic the expression (IPV4=="127.0.0.1" or IPV6=="::1") yields > null for all of the older files since they lack an IPV6 column (regardless > of their IPV4 column) which > seems undesirable. > > Could you explain what you mean by "safest"? > Under NaN logic, the Kleene result can be recovered with > (coalesce(IPV4=="127.0.0.1", false) or coalesce(IPV6=="::1", false)) > Under Kleene logic, the NaN result can be recovered with (case IPV4 is null > or IPV6 is null when 1 then null else IPV4=="127.0.0.1" or IPV6=="::1" end) > I don't think we're losing information either way. > > I'm not attached to either system but I'd like to understand and document > the rationale behind our choice. > > On Thu, Aug 29, 2019 at 1:14 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > IIUC it's not about sanitizing to false. Ben explained it in more > > detail in private to me, perhaps he want to copy that explanation here ;-) > > > > Regards > > > > Antoine. > > > > > > Le 29/08/2019 à 19:05, Wes McKinney a écrit : > > > hi Ben, > > > > > > My instinct is that always propagating null (at least by default) is > > > the safest choice. Applications can choose to sanitize null to false > > > if that's what they want semantically. > > > > > > - Wes > > > > > > On Thu, Aug 29, 2019 at 8:37 AM Ben Kietzman <ben.kietz...@rstudio.com> > > wrote: > > >> > > >> To my knowledge, there isn't explicit documentation on how null slots > > in an > > >> array should be interpreted. SQL uses Kleene logic, wherein a null is > > >> explicitly an unknown rather than a special value. This yields for > > example > > >> `(null AND false) -> false`, since `(x AND false) -> false` for all > > >> possible values of x. This is also the behavior of Gandiva's boolean > > >> expressions. > > >> > > >> By contrast the boolean kernels implement something closer to the > > behavior > > >> of NaN: `(null AND false) -> null`. I think this is simply an error in > > the > > >> boolean kernels but in any case I think explicit documentation should be > > >> added to prevent future confusion. > > >> > > >> https://issues.apache.org/jira/browse/ARROW-6386 > >