Hi all,

Recently we start an effort to achieve feature parity between Spark and
PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764

This goes very well. We've added many missing features(parser rules,
built-in functions, etc.) to Spark, and also corrected several
inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
Many thanks to all the people that contribute to it!

There are several cases when adding a PostgreSQL feature:
1. Spark doesn't have this feature: just add it.
2. Spark has this feature, but the behavior is different:
    2.1 Spark's behavior doesn't make sense: change it to follow SQL
standard and PostgreSQL, with a legacy config to restore the behavior.
    2.2 Spark's behavior makes sense but violates SQL standard: change the
behavior to follow SQL standard and PostgreSQL, when the ansi mode is
enabled (default false).
    2.3 Spark's behavior makes sense and doesn't violate SQL standard: adds
the PostgreSQL behavior under the PostgreSQL dialect (default is Spark
native dialect).

The PostgreSQL dialect itself is a good idea. It can help users to migrate
PostgreSQL workloads to Spark. Other databases have this strategy too. For
example, DB2 provides an oracle dialect
<https://www.ibm.com/developerworks/data/library/techarticle/dm-0907oracleappsondb2/index.html>
.

However, there are so many differences between Spark and PostgreSQL,
including SQL parsing, type coercion, function/operator behavior, data
types, etc. I'm afraid that we may spend a lot of effort on it, and make
the Spark codebase pretty complicated, but still not able to provide a
usable PostgreSQL dialect.

Furthermore, it's not clear to me how many users have the requirement of
migrating PostgreSQL workloads. I think it's much more important to make
Spark ANSI-compliant first, which doesn't need that much of work.

Recently I've seen multiple PRs adding PostgreSQL cast functions, while our
own cast function is not ANSI-compliant yet. This makes me think that, we
should do something to properly prioritize ANSI mode over other dialects.

Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it from
the codebase before it's too late. Curently we only have 3 features under
PostgreSQL dialect:
1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also
allowed as true string.
2. `date - date`  returns interval in Spark (SQL standard behavior), but
return int in PostgreSQL
3. `int / int` returns double in Spark, but returns int in PostgreSQL.
(there is no standard)

We should still add PostgreSQL features that Spark doesn't have, or Spark's
behavior violates SQL standard. But for others, let's just update the
answer files of PostgreSQL tests.

Any comments are welcome!

Thanks,
Wenchen

Reply via email to