If we can translate to an equivalent expression, then I think it is better to do that. Iceberg expressions don't support SQL semantics and it would be a big project to convert them. Afterward, we would just have more complicated expressions that are harder to maintain because SQL semantics are counter-intuitive. If we can translate into expressions that are easier to maintain and have understandable semantics, then I think we will be better off in the long term. And we won't have to rewrite so much and find a work around for the notEqual <-> not(equal) rewrite problem.
We should still make it easy to choose the SQL semantics with new factory methods. I think we will need `sqlNotNull`, `sqlIn`, and `isDistinctFrom`. Any others? For ORC, we will want to rewrite. But for Parquet we don't currently push any filters down through the Parquet API. Instead we have our own filter implementations in Iceberg that use Iceberg expressions directly. On Fri, Sep 18, 2020 at 4:38 PM Owen O'Malley <owen.omal...@gmail.com> wrote: > No, you can translate these expressions, but you have to evaluate the > entire expression. For example: > > "col1 = 'x' and col2 in (1,2)" becomes col1 = 'x' and col2 in (1,2) > "not(col1 = 'x' and col2 in (1,2))" becomes (col1 != 'x' or col2 not in > (1,2)) and col1 is not null and col2 is not null > > Furthermore, ORC does (and Parquet should) already use these semantics. > Therefore, you'll end up translating on both sides: > > hive -\ > presto --+-> Iceberg -+--> ORC > spark sql -/ \-> Parquet > > Since the non-sql use cases have fewer pushdown predicates, having a > translation on that side seems less error-prone. > > .. Owen > > > On Fri, Sep 18, 2020 at 10:54 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> Are you saying that we can't fix this by rewriting expressions to >> translate from SQL to more natural semantics? >> >> On Fri, Sep 18, 2020 at 3:28 PM Owen O'Malley <owen.omal...@gmail.com> >> wrote: >> >>> In the SQL world, the second point isn't right. It is still the case >>> that not(equal("col", "x")) is notEqual("col", "x"). Boolean logic (well, >>> three valued logic) in SQL is just strange relative to programming >>> languages: >>> >>> - null *=* "x" -> null >>> - null *is distinct from* "x" -> true >>> - *not*(null) -> null >>> - null *and* true -> null >>> - null *or* false -> null >>> >>> We absolutely need a null safe equals function (<=> or "is distinct >>> from") also, which is what we currently have as equals. So we really need >>> to have the logical operators also treat null as a special value. >>> >>> .. Owen >>> >>> >>> >>> On Fri, Sep 18, 2020 at 5:54 PM Ryan Blue <rb...@netflix.com.invalid> >>> wrote: >>> >>>> It would be nice to avoid the problem by changing the semantics of >>>> Iceberg’s notNull, but I don’t think that’s a good idea for 2 main >>>> reasons. >>>> >>>> First, I think that API users creating expressions directly expect the >>>> current behavior. It would be surprising to a user if a notEqual >>>> expression didn’t return nulls. People integrating Iceberg in SQL engines >>>> will be more aware of SQL semantics, especially if the behavior we choose >>>> is documented. I think for API uses, the current behavior is better. >>>> >>>> Second, some evaluations require expressions to be rewritten without >>>> not, so we have a utility that pushes not down an expression tree to >>>> the leaf predicates. Rewriting not(equal("col", "x")) will result in >>>> notEqual("col", >>>> "x"). If we were to change the semantics of notEqual, then this >>>> rewrite would no longer be valid. If col is null then it is not equal >>>> to x, and negating that result is true. But notEqual would give a >>>> different answer so we can’t rewrite it. >>>> >>>> We can work around the rewrite problem by adding >>>> Expressions.sqlNotEqual method for engines to call that has the SQL >>>> semantics by returning and(notEqual("col", "x"), notNull("col")). >>>> >>>> For pushdown, we should add tests for these cases and rewrite >>>> expressions to account for the difference. Iceberg should push >>>> notEqual("col", >>>> "x") to ORC as SQL (col != 'x' or col is null). Presto can similarly >>>> translate col != 'x' to and(notEqual("col", "x"), notNull("col"). >>>> >>>> rb >>>> >>>> On Fri, Sep 18, 2020 at 9:29 AM Owen O'Malley <owen.omal...@gmail.com> >>>> wrote: >>>> >>>>> I think that we should follow the SQL semantics to prevent surprises >>>>> when SQL engines integrate with Iceberg. >>>>> >>>>> .. Owen >>>>> >>>>> On Thu, Sep 17, 2020 at 9:08 PM Shardul Mahadik < >>>>> shardulsmaha...@gmail.com> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I noticed that Iceberg's predicates are not compatible with SQL >>>>>> predicates when it comes to handling NULL values. In SQL, if any of the >>>>>> operands of a scalar comparison predicate is NULL, then the resultant >>>>>> truth >>>>>> value of the predicate is UNKNOWN. e.g. `SELECT NULL != 1` will return a >>>>>> NULL in SQL and not FALSE. If such predicates are used as filters, the >>>>>> resultant output will be different for Iceberg v/s SQL. e.g. >>>>>> `.filter(notEqual(column, 'x'))` in Iceberg will return rows excluding >>>>>> ‘x’ >>>>>> but including NULL. The same thing in Presto SQL `WHERE column != 'x'` >>>>>> will >>>>>> return rows excluding both ‘x’ and NULL. So essentially, Iceberg can >>>>>> return >>>>>> more rows than required when an engine pushes down these predicates, >>>>>> however the engines will filter out these additional rows, so everything >>>>>> seems good. But modules like iceberg-data and iceberg-mr which rely >>>>>> solely >>>>>> on Iceberg's expression evaluators for filtering will return the >>>>>> additional >>>>>> rows. Should we change the behavior of Iceberg expressions to be more >>>>>> SQL-like or should we keep this behavior and document the differences >>>>>> when >>>>>> compared with SQL? >>>>>> >>>>>> This also has some implications on predicate pushdown e.g. ORC >>>>>> follows SQL semantics and if we try to push down Iceberg predicates, >>>>>> simply >>>>>> converting Iceberg's 'NOT EQUAL' to ORC's 'NOT EQUAL' will be >>>>>> insufficient >>>>>> as it does not return NULLs contrary to what Iceberg expects. >>>>>> >>>>>> Thanks, >>>>>> Shardul >>>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > -- Ryan Blue Software Engineer Netflix