Re: SQL compatibility of Iceberg Expressions

Owen O'Malley Fri, 18 Sep 2020 16:38:57 -0700

No, you can translate these expressions, but you have to evaluate the
entire expression. For example:


"col1 = 'x' and col2 in (1,2)" becomes col1 = 'x' and col2 in (1,2)
"not(col1 = 'x' and col2 in (1,2))" becomes (col1 != 'x' or col2 not in
(1,2)) and col1 is not null and col2 is not null

Furthermore, ORC does (and Parquet should) already use these semantics.
Therefore, you'll end up translating on both sides:

hive        -\
presto      --+-> Iceberg -+--> ORC
spark sql   -/              \-> Parquet

Since the non-sql use cases have fewer pushdown predicates, having a
translation on that side seems less error-prone.

.. Owen


On Fri, Sep 18, 2020 at 10:54 PM Ryan Blue <[email protected]>
wrote:

> Are you saying that we can't fix this by rewriting expressions to
> translate from SQL to more natural semantics?
>
> On Fri, Sep 18, 2020 at 3:28 PM Owen O'Malley <[email protected]>
> wrote:
>
>> In the SQL world, the second point isn't right. It is still the case that
>> not(equal("col", "x")) is notEqual("col", "x"). Boolean logic (well, three
>> valued logic) in SQL is just strange relative to programming languages:
>>
>>    - null *=* "x" -> null
>>    - null *is distinct from* "x" -> true
>>    - *not*(null) -> null
>>    - null *and* true -> null
>>    - null *or* false -> null
>>
>> We absolutely need a null safe equals function (<=> or "is distinct
>> from") also, which is what we currently have as equals. So we really need
>> to have the logical operators also treat null as a special value.
>>
>> .. Owen
>>
>>
>>
>> On Fri, Sep 18, 2020 at 5:54 PM Ryan Blue <[email protected]>
>> wrote:
>>
>>> It would be nice to avoid the problem by changing the semantics of
>>> Iceberg’s notNull, but I don’t think that’s a good idea for 2 main
>>> reasons.
>>>
>>> First, I think that API users creating expressions directly expect the
>>> current behavior. It would be surprising to a user if a notEqual
>>> expression didn’t return nulls. People integrating Iceberg in SQL engines
>>> will be more aware of SQL semantics, especially if the behavior we choose
>>> is documented. I think for API uses, the current behavior is better.
>>>
>>> Second, some evaluations require expressions to be rewritten without not,
>>> so we have a utility that pushes not down an expression tree to the
>>> leaf predicates. Rewriting not(equal("col", "x")) will result in 
>>> notEqual("col",
>>> "x"). If we were to change the semantics of notEqual, then this rewrite
>>> would no longer be valid. If col is null then it is not equal to x, and
>>> negating that result is true. But notEqual would give a different
>>> answer so we can’t rewrite it.
>>>
>>> We can work around the rewrite problem by adding Expressions.sqlNotEqual
>>> method for engines to call that has the SQL semantics by returning 
>>> and(notEqual("col",
>>> "x"), notNull("col")).
>>>
>>> For pushdown, we should add tests for these cases and rewrite
>>> expressions to account for the difference. Iceberg should push 
>>> notEqual("col",
>>> "x") to ORC as SQL (col != 'x' or col is null). Presto can similarly
>>> translate col != 'x' to and(notEqual("col", "x"), notNull("col").
>>>
>>> rb
>>>
>>> On Fri, Sep 18, 2020 at 9:29 AM Owen O'Malley <[email protected]>
>>> wrote:
>>>
>>>> I think that we should follow the SQL semantics to prevent surprises
>>>> when SQL engines integrate with Iceberg.
>>>>
>>>> .. Owen
>>>>
>>>> On Thu, Sep 17, 2020 at 9:08 PM Shardul Mahadik <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I noticed that Iceberg's predicates are not compatible with SQL
>>>>> predicates when it comes to handling NULL values. In SQL, if any of the
>>>>> operands of a scalar comparison predicate is NULL, then the resultant 
>>>>> truth
>>>>> value of the predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a
>>>>> NULL in SQL and not FALSE. If such predicates are used as filters, the
>>>>> resultant output will be different for Iceberg v/s SQL. e.g.
>>>>> `.filter(notEqual(column, 'x'))` in Iceberg will return rows excluding ‘x’
>>>>> but including NULL. The same thing in Presto SQL `WHERE column != 'x'` 
>>>>> will
>>>>> return rows excluding both ‘x’ and NULL. So essentially, Iceberg can 
>>>>> return
>>>>> more rows than required when an engine pushes down these predicates,
>>>>> however the engines will filter out these additional rows, so everything
>>>>> seems good. But modules like iceberg-data and iceberg-mr which rely solely
>>>>> on Iceberg's expression evaluators for filtering will return the 
>>>>> additional
>>>>> rows. Should we change the behavior of Iceberg expressions to be more
>>>>> SQL-like or should we keep this behavior and document the differences when
>>>>> compared with SQL?
>>>>>
>>>>> This also has some implications on predicate pushdown e.g. ORC follows
>>>>> SQL semantics and if we try to push down Iceberg predicates, simply
>>>>> converting Iceberg's 'NOT EQUAL' to ORC's 'NOT EQUAL' will be insufficient
>>>>> as it does not return NULLs contrary to what Iceberg expects.
>>>>>
>>>>> Thanks,
>>>>> Shardul
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: SQL compatibility of Iceberg Expressions

Reply via email to