Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Jungtaek Lim Wed, 18 Mar 2020 04:02:49 -0700

A bit correction: the example I provided for vice versa is not really a
correct case for vice versa. It's actually same case (intended to use rule
2 which is not default) but different result.


On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> My concern is that although we simply think about the change to
> mark "USING provider" to be optional in rule 1, but in reality the change
> is most likely swapping the default rule for CREATE TABLE, which was "rule
> 2", and now "rule 1" (it would be the happy case of migration doc if the
> swap happens as intended), and there're still couple of things which make
> the query still fall into rule 2 which is non-trivial to reason about and
> also not easy to explain.
>
> I only mentioned ROW FORMAT and STORED AS as the case to fall into rule 2
> to simplify the problem statement, but they're not only the case - using
> col_name1:col_type1 would make the query fall into rule 2 regardless of any
> other properties.
>
> What's the matter? In Spark 2.x, if end users want to use rule 1 (which is
> not default) and specify the parameters which are only available in rule 1,
> it clearly requires "USING provider" - parser will throw error if there're
> any mistakes on specifying parameters. End users could set intention
> clearly which rule the query should be bound. If the query fails to bind
> the rule as intended, it's simply denied.
>
> In Spark 3.x, parser may not help figuring out the case where end users
> intend to use rule 2 (which is not default) but some mistake on specifying
> parameters - it could just "silently" bound into rule 1 and it may be even
> executed without any error. Vice versa happens, but in odd way - e.g.
> CREATE EXTERNAL TABLE ... LOCATION fails with weird message CREATE EXTERNAL
> TABLE is not supported.
>
> It's deterministic for end users only if they're fully understand the
> difference between the twos and also understand how Spark would apply the
> rules to make the query fall into one of the rule. I'm not sure how we can
> only improve documentation to make things be clear, but if the approach
> would be explaining the difference of rules and guide the tip to make the
> query be bound to the specific rule, the same could be applied to parser
> rule to address the root cause.
>
>
> On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> Document-wise, yes, it's confusing as a simple CREATE TABLE fits both
>> native and Hive syntax. I'm fine with some changes to make it less
>> confusing, as long as the user-facing behavior doesn't change. For example,
>> define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy config
>> is false.
>>
>> I still don't get your point about what's the real problem to end users.
>> There is no ambiguity as the behavior is deterministic, although we rely on
>> optional fields and rule order which is bad. It's hard to document, but I
>> don't think that's a big problem to end users.
>>
>> For the legacy config, it does make the implementation more complicated,
>> but it's invisible to most end users (we don't document it) and can be
>> super useful to some users who want the queries to keep working in 3.0
>> without rewriting.
>>
>> If your only concern is documentation, I totally agree that we should
>> improve it.
>>
>> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Thanks for sharing your view.
>>>
>>> I agree with you it's good for Spark to promote Spark's own CREATE TABLE
>>> syntax. The thing is, we still leave Hive CREATE TABLE syntax unchanged -
>>> it's being said as "convenience" but I'm not sure I can agree with.
>>>
>>> I'll quote my comments in SPARK-31136 here again to make the problem
>>> statement be clearer:
>>>
>>> I think the parser implementation around CREATE TABLE brings ambiguity
>>> which is not documented anywhere. It wasn't ambiguous because we forced to
>>> specify USE provider if it's not a Hive table. Now it's either default
>>> provider or Hive according to which options are provided, which seems to be
>>> non-trivial to reason about. (End users would never know, as it's
>>> completely from parser rule.)
>>>
>>> I feel this as the issue of "not breaking old behavior". The parser rule
>>> gets pretty much complicated due to support legacy config. Not breaking
>>> anything would make us be stuck eventually.
>>>
>>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md
>>>
>>> Since Spark 3.0, CREATE TABLE without a specific provider will use the
>>> value of spark.sql.sources.default as its provider. In Spark version 2.4
>>> and earlier, it was hive. To restore the behavior before Spark 3.0, you can
>>> set spark.sql.legacy.createHiveTableByDefault.enabled to true.
>>>
>>>
>>> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we didn't
>>> describe anything for this.
>>>
>>>
>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md
>>>
>>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1 col_type1
>>> [ COMMENT col_comment1 ], ... ) ] [USING data_source] [ OPTIONS (
>>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, col_name2, ...
>>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name [
>>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [
>>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ] [
>>> AS select_statement ]
>>>
>>>
>>>
>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md
>>>
>>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ (
>>> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT
>>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT
>>> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT
>>> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ TBLPROPERTIES (
>>> key1=val1, key2=val2, ... ) ] [ AS select_statement ]
>>>
>>>
>>> At least we should describe that parser will try to match the first case
>>> (create table ~ using data source), and fail back to second case; even
>>> though we describe this it's not intuitive to reason about which rule the
>>> DDL query will fall into. As I commented earlier, "ROW FORMAT" and "STORED
>>> AS" are the options which make DDL query fall into the second case, but
>>> they're described as "optional" so it's hard to catch the gotcha.
>>>
>>> Furthermore, while we document the syntax as above, in reality we allow
>>> "EXTERNAL" in first rule (and throw error), which ends up existing DDL
>>> query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add "USING
>>> hive" parser will throw error. It now requires "ROW FORMAT" or "STORED AS".
>>>
>>>
>>> Simply saying, do we really think end users should stop and try to match
>>> their query against the parser rules (or orders when we explain in the doc)
>>> by themselves to understand which provider the table will leverage? I'm
>>> sorry but I think we are making bad assumption on end users which is a
>>> serious problem.
>>>
>>> If we really want to promote Spark's one for CREATE TABLE, then would it
>>> really matter to treat Hive CREATE TABLE be "exceptional" one and try to
>>> isolate each other? What's the point of providing a legacy config to go
>>> back to the old one even we fear about breaking something to make it better
>>> or clearer? We do think that table provider is important (hence the change
>>> was done), then is it still a trivial problem whether the provider is
>>> affected by specifying the "optional" fields?
>>>
>>>
>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>>
>>>> I think the general guideline is to promote Spark's own CREATE TABLE
>>>> syntax instead of the Hive one. Previously these two rules are mutually
>>>> exclusive because the native syntax requires the USING clause while the
>>>> Hive syntax makes ROW FORMAT or STORED AS clause optional.
>>>>
>>>> It's a good move to make the USING clause optional, which makes it
>>>> easier to write the native CREATE TABLE syntax. Unfortunately, it leads to
>>>> some conflicts with the Hive CREATE TABLE syntax, but I don't see a serious
>>>> problem here. If a user just writes CREATE TABLE without USING or ROW
>>>> FORMAT or STORED AS, does it matter what table we create? Internally the
>>>> parser rules conflict and we pick the native syntax depending on the rule
>>>> order. But the user-facing behavior looks fine.
>>>>
>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in 3.0.
>>>> Shall we simply remove EXTERNAL from the native CREATE TABLE syntax? Then
>>>> CREATE EXTERNAL TABLE creates Hive table like 2.4.
>>>>
>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> Hi devs,
>>>>>
>>>>> I'd like to initiate discussion and hear the voices for resolving
>>>>> ambiguous parser rule between two "create table"s being brought by
>>>>> SPARK-30098 [1].
>>>>>
>>>>> Previously, "create table" parser rules were clearly distinguished via
>>>>> "USING provider", which was very intuitive and deterministic. Say, DDL
>>>>> query creates "Hive" table unless "USING provider" is specified,
>>>>> (Please refer the parser rule in branch-2.4 [2])
>>>>>
>>>>> After SPARK-30098, "create table" parser rules became ambiguous
>>>>> (please refer the parser rule in branch-3.0 [3]) - the factors
>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which are
>>>>> all defined as "optional". Now it relies on the "order" of parser rule
>>>>> which end users would have no idea to reason about, and very unintuitive.
>>>>>
>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first rule to
>>>>> provide better message) brought more confusion (I've described the broken
>>>>> existing query via SPARK-30436 [4]).
>>>>>
>>>>> Personally I'd like to see two rules mutually exclusive, instead of
>>>>> trying to document the difference and talk end users to be careful about
>>>>> their query. I'm seeing two ways to make rules be mutually exclusive:
>>>>>
>>>>> 1. Add some identifier in create Hive table rule, like `CREATE ...
>>>>> "HIVE" TABLE ...`.
>>>>>
>>>>> pros. This is the simplest way to distinguish between two rules.
>>>>> cons. This would lead end users to change their query if they intend
>>>>> to create Hive table. (Given we will also provide legacy option I'm 
>>>>> feeling
>>>>> this is acceptable.)
>>>>>
>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>>>>
>>>>> pros. Less invasive for existing queries.
>>>>> cons. Less intuitive, because they have been optional and now become
>>>>> mandatory to fall into the second rule.
>>>>>
>>>>> Would like to hear everyone's voices; better ideas are welcome!
>>>>>
>>>>> Thanks,
>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>
>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>> syntax
>>>>> https://issues.apache.org/jira/browse/SPARK-30098
>>>>> 2.
>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>> 3.
>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>>>>
>>>>>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Reply via email to