Re: [DISCUSS] Ternary logic
I created the ticket https://issues.apache.org/jira/browse/ARROW-6396, I think we can offer both. François On Thu, Aug 29, 2019 at 5:10 PM Ben Kietzman wrote: > > Indeed it's not about sanitizing nulls; it's about how nulls should > interact with boolean (and other) expressions. > > For purposes of discussion, I'm naming the current approach of propagating > null "NaN logic" (since all expressions involving NaN evaluate to NaN). > > To give some context for this discussion, I'm currently working on support > for filter expressions (ARROW-6243). > > As an example of when this would come into play, let there be a dataset > spanning several files. The older files have an IPV4 column while the newer > files have IPV6 as well. > With NaN logic the expression (IPV4=="127.0.0.1" or IPV6=="::1") yields > null for all of the older files since they lack an IPV6 column (regardless > of their IPV4 column) which > seems undesirable. > > Could you explain what you mean by "safest"? > Under NaN logic, the Kleene result can be recovered with > (coalesce(IPV4=="127.0.0.1", false) or coalesce(IPV6=="::1", false)) > Under Kleene logic, the NaN result can be recovered with (case IPV4 is null > or IPV6 is null when 1 then null else IPV4=="127.0.0.1" or IPV6=="::1" end) > I don't think we're losing information either way. > > I'm not attached to either system but I'd like to understand and document > the rationale behind our choice. > > On Thu, Aug 29, 2019 at 1:14 PM Antoine Pitrou wrote: > > > > > IIUC it's not about sanitizing to false. Ben explained it in more > > detail in private to me, perhaps he want to copy that explanation here ;-) > > > > Regards > > > > Antoine. > > > > > > Le 29/08/2019 à 19:05, Wes McKinney a écrit : > > > hi Ben, > > > > > > My instinct is that always propagating null (at least by default) is > > > the safest choice. Applications can choose to sanitize null to false > > > if that's what they want semantically. > > > > > > - Wes > > > > > > On Thu, Aug 29, 2019 at 8:37 AM Ben Kietzman > > wrote: > > >> > > >> To my knowledge, there isn't explicit documentation on how null slots > > in an > > >> array should be interpreted. SQL uses Kleene logic, wherein a null is > > >> explicitly an unknown rather than a special value. This yields for > > example > > >> `(null AND false) -> false`, since `(x AND false) -> false` for all > > >> possible values of x. This is also the behavior of Gandiva's boolean > > >> expressions. > > >> > > >> By contrast the boolean kernels implement something closer to the > > behavior > > >> of NaN: `(null AND false) -> null`. I think this is simply an error in > > the > > >> boolean kernels but in any case I think explicit documentation should be > > >> added to prevent future confusion. > > >> > > >> https://issues.apache.org/jira/browse/ARROW-6386 > >
Re: [DISCUSS] Ternary logic
Indeed it's not about sanitizing nulls; it's about how nulls should interact with boolean (and other) expressions. For purposes of discussion, I'm naming the current approach of propagating null "NaN logic" (since all expressions involving NaN evaluate to NaN). To give some context for this discussion, I'm currently working on support for filter expressions (ARROW-6243). As an example of when this would come into play, let there be a dataset spanning several files. The older files have an IPV4 column while the newer files have IPV6 as well. With NaN logic the expression (IPV4=="127.0.0.1" or IPV6=="::1") yields null for all of the older files since they lack an IPV6 column (regardless of their IPV4 column) which seems undesirable. Could you explain what you mean by "safest"? Under NaN logic, the Kleene result can be recovered with (coalesce(IPV4=="127.0.0.1", false) or coalesce(IPV6=="::1", false)) Under Kleene logic, the NaN result can be recovered with (case IPV4 is null or IPV6 is null when 1 then null else IPV4=="127.0.0.1" or IPV6=="::1" end) I don't think we're losing information either way. I'm not attached to either system but I'd like to understand and document the rationale behind our choice. On Thu, Aug 29, 2019 at 1:14 PM Antoine Pitrou wrote: > > IIUC it's not about sanitizing to false. Ben explained it in more > detail in private to me, perhaps he want to copy that explanation here ;-) > > Regards > > Antoine. > > > Le 29/08/2019 à 19:05, Wes McKinney a écrit : > > hi Ben, > > > > My instinct is that always propagating null (at least by default) is > > the safest choice. Applications can choose to sanitize null to false > > if that's what they want semantically. > > > > - Wes > > > > On Thu, Aug 29, 2019 at 8:37 AM Ben Kietzman > wrote: > >> > >> To my knowledge, there isn't explicit documentation on how null slots > in an > >> array should be interpreted. SQL uses Kleene logic, wherein a null is > >> explicitly an unknown rather than a special value. This yields for > example > >> `(null AND false) -> false`, since `(x AND false) -> false` for all > >> possible values of x. This is also the behavior of Gandiva's boolean > >> expressions. > >> > >> By contrast the boolean kernels implement something closer to the > behavior > >> of NaN: `(null AND false) -> null`. I think this is simply an error in > the > >> boolean kernels but in any case I think explicit documentation should be > >> added to prevent future confusion. > >> > >> https://issues.apache.org/jira/browse/ARROW-6386 >
Re: [DISCUSS] Ternary logic
IIUC it's not about sanitizing to false. Ben explained it in more detail in private to me, perhaps he want to copy that explanation here ;-) Regards Antoine. Le 29/08/2019 à 19:05, Wes McKinney a écrit : > hi Ben, > > My instinct is that always propagating null (at least by default) is > the safest choice. Applications can choose to sanitize null to false > if that's what they want semantically. > > - Wes > > On Thu, Aug 29, 2019 at 8:37 AM Ben Kietzman wrote: >> >> To my knowledge, there isn't explicit documentation on how null slots in an >> array should be interpreted. SQL uses Kleene logic, wherein a null is >> explicitly an unknown rather than a special value. This yields for example >> `(null AND false) -> false`, since `(x AND false) -> false` for all >> possible values of x. This is also the behavior of Gandiva's boolean >> expressions. >> >> By contrast the boolean kernels implement something closer to the behavior >> of NaN: `(null AND false) -> null`. I think this is simply an error in the >> boolean kernels but in any case I think explicit documentation should be >> added to prevent future confusion. >> >> https://issues.apache.org/jira/browse/ARROW-6386
Re: [DISCUSS] Ternary logic
hi Ben, My instinct is that always propagating null (at least by default) is the safest choice. Applications can choose to sanitize null to false if that's what they want semantically. - Wes On Thu, Aug 29, 2019 at 8:37 AM Ben Kietzman wrote: > > To my knowledge, there isn't explicit documentation on how null slots in an > array should be interpreted. SQL uses Kleene logic, wherein a null is > explicitly an unknown rather than a special value. This yields for example > `(null AND false) -> false`, since `(x AND false) -> false` for all > possible values of x. This is also the behavior of Gandiva's boolean > expressions. > > By contrast the boolean kernels implement something closer to the behavior > of NaN: `(null AND false) -> null`. I think this is simply an error in the > boolean kernels but in any case I think explicit documentation should be > added to prevent future confusion. > > https://issues.apache.org/jira/browse/ARROW-6386
[DISCUSS] Ternary logic
To my knowledge, there isn't explicit documentation on how null slots in an array should be interpreted. SQL uses Kleene logic, wherein a null is explicitly an unknown rather than a special value. This yields for example `(null AND false) -> false`, since `(x AND false) -> false` for all possible values of x. This is also the behavior of Gandiva's boolean expressions. By contrast the boolean kernels implement something closer to the behavior of NaN: `(null AND false) -> null`. I think this is simply an error in the boolean kernels but in any case I think explicit documentation should be added to prevent future confusion. https://issues.apache.org/jira/browse/ARROW-6386