I think that we should follow the SQL semantics to prevent surprises when SQL engines integrate with Iceberg.
.. Owen On Thu, Sep 17, 2020 at 9:08 PM Shardul Mahadik <shardulsmaha...@gmail.com> wrote: > Hi all, > > I noticed that Iceberg's predicates are not compatible with SQL predicates > when it comes to handling NULL values. In SQL, if any of the operands of a > scalar comparison predicate is NULL, then the resultant truth value of the > predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a NULL in SQL and > not FALSE. If such predicates are used as filters, the resultant output > will be different for Iceberg v/s SQL. e.g. `.filter(notEqual(column, > 'x'))` in Iceberg will return rows excluding ‘x’ but including NULL. The > same thing in Presto SQL `WHERE column != 'x'` will return rows excluding > both ‘x’ and NULL. So essentially, Iceberg can return more rows than > required when an engine pushes down these predicates, however the engines > will filter out these additional rows, so everything seems good. But > modules like iceberg-data and iceberg-mr which rely solely on Iceberg's > expression evaluators for filtering will return the additional rows. Should > we change the behavior of Iceberg expressions to be more SQL-like or should > we keep this behavior and document the differences when compared with SQL? > > This also has some implications on predicate pushdown e.g. ORC follows SQL > semantics and if we try to push down Iceberg predicates, simply converting > Iceberg's 'NOT EQUAL' to ORC's 'NOT EQUAL' will be insufficient as it does > not return NULLs contrary to what Iceberg expects. > > Thanks, > Shardul >