Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Wenchen Fan Wed, 18 Mar 2020 04:50:41 -0700

The fact that we have 2 CREATE TABLE syntax is already confusing many
users. Shall we only document the native syntax? Then users don't need to
worry about which rule their query fits and they don't need to spend a lot
of time understanding the subtle difference between these 2 syntaxes.


On Wed, Mar 18, 2020 at 7:01 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> A bit correction: the example I provided for vice versa is not really a
> correct case for vice versa. It's actually same case (intended to use rule
> 2 which is not default) but different result.
>
> On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
> wrote:
>
>> My concern is that although we simply think about the change to
>> mark "USING provider" to be optional in rule 1, but in reality the change
>> is most likely swapping the default rule for CREATE TABLE, which was "rule
>> 2", and now "rule 1" (it would be the happy case of migration doc if the
>> swap happens as intended), and there're still couple of things which make
>> the query still fall into rule 2 which is non-trivial to reason about and
>> also not easy to explain.
>>
>> I only mentioned ROW FORMAT and STORED AS as the case to fall into rule 2
>> to simplify the problem statement, but they're not only the case - using
>> col_name1:col_type1 would make the query fall into rule 2 regardless of any
>> other properties.
>>
>> What's the matter? In Spark 2.x, if end users want to use rule 1 (which
>> is not default) and specify the parameters which are only available in rule
>> 1, it clearly requires "USING provider" - parser will throw error if
>> there're any mistakes on specifying parameters. End users could set
>> intention clearly which rule the query should be bound. If the query fails
>> to bind the rule as intended, it's simply denied.
>>
>> In Spark 3.x, parser may not help figuring out the case where end users
>> intend to use rule 2 (which is not default) but some mistake on specifying
>> parameters - it could just "silently" bound into rule 1 and it may be even
>> executed without any error. Vice versa happens, but in odd way - e.g.
>> CREATE EXTERNAL TABLE ... LOCATION fails with weird message CREATE EXTERNAL
>> TABLE is not supported.
>>
>> It's deterministic for end users only if they're fully understand the
>> difference between the twos and also understand how Spark would apply the
>> rules to make the query fall into one of the rule. I'm not sure how we can
>> only improve documentation to make things be clear, but if the approach
>> would be explaining the difference of rules and guide the tip to make the
>> query be bound to the specific rule, the same could be applied to parser
>> rule to address the root cause.
>>
>>
>> On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> Document-wise, yes, it's confusing as a simple CREATE TABLE fits both
>>> native and Hive syntax. I'm fine with some changes to make it less
>>> confusing, as long as the user-facing behavior doesn't change. For example,
>>> define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy config
>>> is false.
>>>
>>> I still don't get your point about what's the real problem to end users.
>>> There is no ambiguity as the behavior is deterministic, although we rely on
>>> optional fields and rule order which is bad. It's hard to document, but I
>>> don't think that's a big problem to end users.
>>>
>>> For the legacy config, it does make the implementation more complicated,
>>> but it's invisible to most end users (we don't document it) and can be
>>> super useful to some users who want the queries to keep working in 3.0
>>> without rewriting.
>>>
>>> If your only concern is documentation, I totally agree that we should
>>> improve it.
>>>
>>> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Thanks for sharing your view.
>>>>
>>>> I agree with you it's good for Spark to promote Spark's own CREATE
>>>> TABLE syntax. The thing is, we still leave Hive CREATE TABLE syntax
>>>> unchanged - it's being said as "convenience" but I'm not sure I can agree
>>>> with.
>>>>
>>>> I'll quote my comments in SPARK-31136 here again to make the problem
>>>> statement be clearer:
>>>>
>>>> I think the parser implementation around CREATE TABLE brings ambiguity
>>>> which is not documented anywhere. It wasn't ambiguous because we forced to
>>>> specify USE provider if it's not a Hive table. Now it's either default
>>>> provider or Hive according to which options are provided, which seems to be
>>>> non-trivial to reason about. (End users would never know, as it's
>>>> completely from parser rule.)
>>>>
>>>> I feel this as the issue of "not breaking old behavior". The parser
>>>> rule gets pretty much complicated due to support legacy config. Not
>>>> breaking anything would make us be stuck eventually.
>>>>
>>>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md
>>>>
>>>> Since Spark 3.0, CREATE TABLE without a specific provider will use the
>>>> value of spark.sql.sources.default as its provider. In Spark version 2.4
>>>> and earlier, it was hive. To restore the behavior before Spark 3.0, you can
>>>> set spark.sql.legacy.createHiveTableByDefault.enabled to true.
>>>>
>>>>
>>>> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we didn't
>>>> describe anything for this.
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md
>>>>
>>>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1 col_type1
>>>> [ COMMENT col_comment1 ], ... ) ] [USING data_source] [ OPTIONS (
>>>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, col_name2, ...
>>>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name [
>>>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [
>>>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ] [
>>>> AS select_statement ]
>>>>
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md
>>>>
>>>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ (
>>>> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT
>>>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT
>>>> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT
>>>> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ TBLPROPERTIES (
>>>> key1=val1, key2=val2, ... ) ] [ AS select_statement ]
>>>>
>>>>
>>>> At least we should describe that parser will try to match the first
>>>> case (create table ~ using data source), and fail back to second case; even
>>>> though we describe this it's not intuitive to reason about which rule the
>>>> DDL query will fall into. As I commented earlier, "ROW FORMAT" and "STORED
>>>> AS" are the options which make DDL query fall into the second case, but
>>>> they're described as "optional" so it's hard to catch the gotcha.
>>>>
>>>> Furthermore, while we document the syntax as above, in reality we allow
>>>> "EXTERNAL" in first rule (and throw error), which ends up existing DDL
>>>> query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add "USING
>>>> hive" parser will throw error. It now requires "ROW FORMAT" or "STORED AS".
>>>>
>>>>
>>>> Simply saying, do we really think end users should stop and try to
>>>> match their query against the parser rules (or orders when we explain in
>>>> the doc) by themselves to understand which provider the table will
>>>> leverage? I'm sorry but I think we are making bad assumption on end users
>>>> which is a serious problem.
>>>>
>>>> If we really want to promote Spark's one for CREATE TABLE, then would
>>>> it really matter to treat Hive CREATE TABLE be "exceptional" one and try to
>>>> isolate each other? What's the point of providing a legacy config to go
>>>> back to the old one even we fear about breaking something to make it better
>>>> or clearer? We do think that table provider is important (hence the change
>>>> was done), then is it still a trivial problem whether the provider is
>>>> affected by specifying the "optional" fields?
>>>>
>>>>
>>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cloud0...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think the general guideline is to promote Spark's own CREATE TABLE
>>>>> syntax instead of the Hive one. Previously these two rules are mutually
>>>>> exclusive because the native syntax requires the USING clause while the
>>>>> Hive syntax makes ROW FORMAT or STORED AS clause optional.
>>>>>
>>>>> It's a good move to make the USING clause optional, which makes it
>>>>> easier to write the native CREATE TABLE syntax. Unfortunately, it leads to
>>>>> some conflicts with the Hive CREATE TABLE syntax, but I don't see a 
>>>>> serious
>>>>> problem here. If a user just writes CREATE TABLE without USING or ROW
>>>>> FORMAT or STORED AS, does it matter what table we create? Internally the
>>>>> parser rules conflict and we pick the native syntax depending on the rule
>>>>> order. But the user-facing behavior looks fine.
>>>>>
>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in 3.0.
>>>>> Shall we simply remove EXTERNAL from the native CREATE TABLE syntax? Then
>>>>> CREATE EXTERNAL TABLE creates Hive table like 2.4.
>>>>>
>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> Hi devs,
>>>>>>
>>>>>> I'd like to initiate discussion and hear the voices for resolving
>>>>>> ambiguous parser rule between two "create table"s being brought by
>>>>>> SPARK-30098 [1].
>>>>>>
>>>>>> Previously, "create table" parser rules were clearly distinguished
>>>>>> via "USING provider", which was very intuitive and deterministic. Say, 
>>>>>> DDL
>>>>>> query creates "Hive" table unless "USING provider" is specified,
>>>>>> (Please refer the parser rule in branch-2.4 [2])
>>>>>>
>>>>>> After SPARK-30098, "create table" parser rules became ambiguous
>>>>>> (please refer the parser rule in branch-3.0 [3]) - the factors
>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which are
>>>>>> all defined as "optional". Now it relies on the "order" of parser rule
>>>>>> which end users would have no idea to reason about, and very unintuitive.
>>>>>>
>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first rule
>>>>>> to provide better message) brought more confusion (I've described the
>>>>>> broken existing query via SPARK-30436 [4]).
>>>>>>
>>>>>> Personally I'd like to see two rules mutually exclusive, instead of
>>>>>> trying to document the difference and talk end users to be careful about
>>>>>> their query. I'm seeing two ways to make rules be mutually exclusive:
>>>>>>
>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE ...
>>>>>> "HIVE" TABLE ...`.
>>>>>>
>>>>>> pros. This is the simplest way to distinguish between two rules.
>>>>>> cons. This would lead end users to change their query if they intend
>>>>>> to create Hive table. (Given we will also provide legacy option I'm 
>>>>>> feeling
>>>>>> this is acceptable.)
>>>>>>
>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one.
>>>>>>
>>>>>> pros. Less invasive for existing queries.
>>>>>> cons. Less intuitive, because they have been optional and now become
>>>>>> mandatory to fall into the second rule.
>>>>>>
>>>>>> Would like to hear everyone's voices; better ideas are welcome!
>>>>>>
>>>>>> Thanks,
>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>
>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE
>>>>>> syntax
>>>>>> https://issues.apache.org/jira/browse/SPARK-30098
>>>>>> 2.
>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>> 3.
>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436
>>>>>>
>>>>>>

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

Reply via email to