I think the general guideline is to promote Spark's own CREATE TABLE syntax
instead of the Hive one. Previously these two rules are mutually exclusive
because the native syntax requires the USING clause while the Hive
syntax makes ROW FORMAT or STORED AS clause optional.

It's a good move to make the USING clause optional, which makes it easier
to write the native CREATE TABLE syntax. Unfortunately, it leads to some
conflicts with the Hive CREATE TABLE syntax, but I don't see a serious
problem here. If a user just writes CREATE TABLE without USING or ROW
FORMAT or STORED AS, does it matter what table we create? Internally the
parser rules conflict and we pick the native syntax depending on the rule
order. But the user-facing behavior looks fine.

CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in 3.0. Shall
we simply remove EXTERNAL from the native CREATE TABLE syntax? Then CREATE
EXTERNAL TABLE creates Hive table like 2.4.

On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> Hi devs,
>
> I'd like to initiate discussion and hear the voices for resolving
> ambiguous parser rule between two "create table"s being brought by
> SPARK-30098 [1].
>
> Previously, "create table" parser rules were clearly distinguished via
> "USING provider", which was very intuitive and deterministic. Say, DDL
> query creates "Hive" table unless "USING provider" is specified,
> (Please refer the parser rule in branch-2.4 [2])
>
> After SPARK-30098, "create table" parser rules became ambiguous (please
> refer the parser rule in branch-3.0 [3]) - the factors differentiating two
> rules are only "ROW FORMAT" and "STORED AS" which are all defined as
> "optional". Now it relies on the "order" of parser rule which end users
> would have no idea to reason about, and very unintuitive.
>
> Furthermore, undocumented rule of EXTERNAL (added in the first rule to
> provide better message) brought more confusion (I've described the broken
> existing query via SPARK-30436 [4]).
>
> Personally I'd like to see two rules mutually exclusive, instead of trying
> to document the difference and talk end users to be careful about their
> query. I'm seeing two ways to make rules be mutually exclusive:
>
> 1. Add some identifier in create Hive table rule, like `CREATE ... "HIVE"
> TABLE ...`.
>
> pros. This is the simplest way to distinguish between two rules.
> cons. This would lead end users to change their query if they intend to
> create Hive table. (Given we will also provide legacy option I'm feeling
> this is acceptable.)
>
> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>
> pros. Less invasive for existing queries.
> cons. Less intuitive, because they have been optional and now become
> mandatory to fall into the second rule.
>
> Would like to hear everyone's voices; better ideas are welcome!
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
> https://issues.apache.org/jira/browse/SPARK-30098
> 2.
> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
> 3.
> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
> 4. https://issues.apache.org/jira/browse/SPARK-30436
>
>

Reply via email to