In the SQL world, the second point isn't right. It is still the case that
not(equal("col", "x")) is notEqual("col", "x"). Boolean logic (well, three
valued logic) in SQL is just strange relative to programming languages:

   - null *=* "x" -> null
   - null *is distinct from* "x" -> true
   - *not*(null) -> null
   - null *and* true -> null
   - null *or* false -> null

We absolutely need a null safe equals function (<=> or "is distinct from")
also, which is what we currently have as equals. So we really need to have
the logical operators also treat null as a special value.

.. Owen



On Fri, Sep 18, 2020 at 5:54 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> It would be nice to avoid the problem by changing the semantics of
> Iceberg’s notNull, but I don’t think that’s a good idea for 2 main
> reasons.
>
> First, I think that API users creating expressions directly expect the
> current behavior. It would be surprising to a user if a notEqual
> expression didn’t return nulls. People integrating Iceberg in SQL engines
> will be more aware of SQL semantics, especially if the behavior we choose
> is documented. I think for API uses, the current behavior is better.
>
> Second, some evaluations require expressions to be rewritten without not,
> so we have a utility that pushes not down an expression tree to the leaf
> predicates. Rewriting not(equal("col", "x")) will result in notEqual("col",
> "x"). If we were to change the semantics of notEqual, then this rewrite
> would no longer be valid. If col is null then it is not equal to x, and
> negating that result is true. But notEqual would give a different answer
> so we can’t rewrite it.
>
> We can work around the rewrite problem by adding Expressions.sqlNotEqual
> method for engines to call that has the SQL semantics by returning 
> and(notEqual("col",
> "x"), notNull("col")).
>
> For pushdown, we should add tests for these cases and rewrite expressions
> to account for the difference. Iceberg should push notEqual("col", "x")
> to ORC as SQL (col != 'x' or col is null). Presto can similarly translate col
> != 'x' to and(notEqual("col", "x"), notNull("col").
>
> rb
>
> On Fri, Sep 18, 2020 at 9:29 AM Owen O'Malley <owen.omal...@gmail.com>
> wrote:
>
>> I think that we should follow the SQL semantics to prevent surprises when
>> SQL engines integrate with Iceberg.
>>
>> .. Owen
>>
>> On Thu, Sep 17, 2020 at 9:08 PM Shardul Mahadik <
>> shardulsmaha...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I noticed that Iceberg's predicates are not compatible with SQL
>>> predicates when it comes to handling NULL values. In SQL, if any of the
>>> operands of a scalar comparison predicate is NULL, then the resultant truth
>>> value of the predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a
>>> NULL in SQL and not FALSE. If such predicates are used as filters, the
>>> resultant output will be different for Iceberg v/s SQL. e.g.
>>> `.filter(notEqual(column, 'x'))` in Iceberg will return rows excluding ‘x’
>>> but including NULL. The same thing in Presto SQL `WHERE column != 'x'` will
>>> return rows excluding both ‘x’ and NULL. So essentially, Iceberg can return
>>> more rows than required when an engine pushes down these predicates,
>>> however the engines will filter out these additional rows, so everything
>>> seems good. But modules like iceberg-data and iceberg-mr which rely solely
>>> on Iceberg's expression evaluators for filtering will return the additional
>>> rows. Should we change the behavior of Iceberg expressions to be more
>>> SQL-like or should we keep this behavior and document the differences when
>>> compared with SQL?
>>>
>>> This also has some implications on predicate pushdown e.g. ORC follows
>>> SQL semantics and if we try to push down Iceberg predicates, simply
>>> converting Iceberg's 'NOT EQUAL' to ORC's 'NOT EQUAL' will be insufficient
>>> as it does not return NULLs contrary to what Iceberg expects.
>>>
>>> Thanks,
>>> Shardul
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to