I think that we should follow the SQL semantics to prevent surprises when
SQL engines integrate with Iceberg.

.. Owen

On Thu, Sep 17, 2020 at 9:08 PM Shardul Mahadik <shardulsmaha...@gmail.com>
wrote:

> Hi all,
>
> I noticed that Iceberg's predicates are not compatible with SQL predicates
> when it comes to handling NULL values. In SQL, if any of the operands of a
> scalar comparison predicate is NULL, then the resultant truth value of the
> predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a NULL in SQL and
> not FALSE. If such predicates are used as filters, the resultant output
> will be different for Iceberg v/s SQL. e.g. `.filter(notEqual(column,
> 'x'))` in Iceberg will return rows excluding ‘x’ but including NULL. The
> same thing in Presto SQL `WHERE column != 'x'` will return rows excluding
> both ‘x’ and NULL. So essentially, Iceberg can return more rows than
> required when an engine pushes down these predicates, however the engines
> will filter out these additional rows, so everything seems good. But
> modules like iceberg-data and iceberg-mr which rely solely on Iceberg's
> expression evaluators for filtering will return the additional rows. Should
> we change the behavior of Iceberg expressions to be more SQL-like or should
> we keep this behavior and document the differences when compared with SQL?
>
> This also has some implications on predicate pushdown e.g. ORC follows SQL
> semantics and if we try to push down Iceberg predicates, simply converting
> Iceberg's 'NOT EQUAL' to ORC's 'NOT EQUAL' will be insufficient as it does
> not return NULLs contrary to what Iceberg expects.
>
> Thanks,
> Shardul
>

Reply via email to