A bit correction: the example I provided for vice versa is not really a correct case for vice versa. It's actually same case (intended to use rule 2 which is not default) but different result.
On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim <[email protected]> wrote: > My concern is that although we simply think about the change to > mark "USING provider" to be optional in rule 1, but in reality the change > is most likely swapping the default rule for CREATE TABLE, which was "rule > 2", and now "rule 1" (it would be the happy case of migration doc if the > swap happens as intended), and there're still couple of things which make > the query still fall into rule 2 which is non-trivial to reason about and > also not easy to explain. > > I only mentioned ROW FORMAT and STORED AS as the case to fall into rule 2 > to simplify the problem statement, but they're not only the case - using > col_name1:col_type1 would make the query fall into rule 2 regardless of any > other properties. > > What's the matter? In Spark 2.x, if end users want to use rule 1 (which is > not default) and specify the parameters which are only available in rule 1, > it clearly requires "USING provider" - parser will throw error if there're > any mistakes on specifying parameters. End users could set intention > clearly which rule the query should be bound. If the query fails to bind > the rule as intended, it's simply denied. > > In Spark 3.x, parser may not help figuring out the case where end users > intend to use rule 2 (which is not default) but some mistake on specifying > parameters - it could just "silently" bound into rule 1 and it may be even > executed without any error. Vice versa happens, but in odd way - e.g. > CREATE EXTERNAL TABLE ... LOCATION fails with weird message CREATE EXTERNAL > TABLE is not supported. > > It's deterministic for end users only if they're fully understand the > difference between the twos and also understand how Spark would apply the > rules to make the query fall into one of the rule. I'm not sure how we can > only improve documentation to make things be clear, but if the approach > would be explaining the difference of rules and guide the tip to make the > query be bound to the specific rule, the same could be applied to parser > rule to address the root cause. > > > On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <[email protected]> wrote: > >> Document-wise, yes, it's confusing as a simple CREATE TABLE fits both >> native and Hive syntax. I'm fine with some changes to make it less >> confusing, as long as the user-facing behavior doesn't change. For example, >> define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy config >> is false. >> >> I still don't get your point about what's the real problem to end users. >> There is no ambiguity as the behavior is deterministic, although we rely on >> optional fields and rule order which is bad. It's hard to document, but I >> don't think that's a big problem to end users. >> >> For the legacy config, it does make the implementation more complicated, >> but it's invisible to most end users (we don't document it) and can be >> super useful to some users who want the queries to keep working in 3.0 >> without rewriting. >> >> If your only concern is documentation, I totally agree that we should >> improve it. >> >> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim < >> [email protected]> wrote: >> >>> Thanks for sharing your view. >>> >>> I agree with you it's good for Spark to promote Spark's own CREATE TABLE >>> syntax. The thing is, we still leave Hive CREATE TABLE syntax unchanged - >>> it's being said as "convenience" but I'm not sure I can agree with. >>> >>> I'll quote my comments in SPARK-31136 here again to make the problem >>> statement be clearer: >>> >>> I think the parser implementation around CREATE TABLE brings ambiguity >>> which is not documented anywhere. It wasn't ambiguous because we forced to >>> specify USE provider if it's not a Hive table. Now it's either default >>> provider or Hive according to which options are provided, which seems to be >>> non-trivial to reason about. (End users would never know, as it's >>> completely from parser rule.) >>> >>> I feel this as the issue of "not breaking old behavior". The parser rule >>> gets pretty much complicated due to support legacy config. Not breaking >>> anything would make us be stuck eventually. >>> >>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md >>> >>> Since Spark 3.0, CREATE TABLE without a specific provider will use the >>> value of spark.sql.sources.default as its provider. In Spark version 2.4 >>> and earlier, it was hive. To restore the behavior before Spark 3.0, you can >>> set spark.sql.legacy.createHiveTableByDefault.enabled to true. >>> >>> >>> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we didn't >>> describe anything for this. >>> >>> >>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md >>> >>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1 col_type1 >>> [ COMMENT col_comment1 ], ... ) ] [USING data_source] [ OPTIONS ( >>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, col_name2, ... >>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name [ >>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [ >>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ] [ >>> AS select_statement ] >>> >>> >>> >>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md >>> >>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ ( >>> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT >>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT >>> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT >>> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ TBLPROPERTIES ( >>> key1=val1, key2=val2, ... ) ] [ AS select_statement ] >>> >>> >>> At least we should describe that parser will try to match the first case >>> (create table ~ using data source), and fail back to second case; even >>> though we describe this it's not intuitive to reason about which rule the >>> DDL query will fall into. As I commented earlier, "ROW FORMAT" and "STORED >>> AS" are the options which make DDL query fall into the second case, but >>> they're described as "optional" so it's hard to catch the gotcha. >>> >>> Furthermore, while we document the syntax as above, in reality we allow >>> "EXTERNAL" in first rule (and throw error), which ends up existing DDL >>> query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add "USING >>> hive" parser will throw error. It now requires "ROW FORMAT" or "STORED AS". >>> >>> >>> Simply saying, do we really think end users should stop and try to match >>> their query against the parser rules (or orders when we explain in the doc) >>> by themselves to understand which provider the table will leverage? I'm >>> sorry but I think we are making bad assumption on end users which is a >>> serious problem. >>> >>> If we really want to promote Spark's one for CREATE TABLE, then would it >>> really matter to treat Hive CREATE TABLE be "exceptional" one and try to >>> isolate each other? What's the point of providing a legacy config to go >>> back to the old one even we fear about breaking something to make it better >>> or clearer? We do think that table provider is important (hence the change >>> was done), then is it still a trivial problem whether the provider is >>> affected by specifying the "optional" fields? >>> >>> >>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <[email protected]> wrote: >>> >>>> I think the general guideline is to promote Spark's own CREATE TABLE >>>> syntax instead of the Hive one. Previously these two rules are mutually >>>> exclusive because the native syntax requires the USING clause while the >>>> Hive syntax makes ROW FORMAT or STORED AS clause optional. >>>> >>>> It's a good move to make the USING clause optional, which makes it >>>> easier to write the native CREATE TABLE syntax. Unfortunately, it leads to >>>> some conflicts with the Hive CREATE TABLE syntax, but I don't see a serious >>>> problem here. If a user just writes CREATE TABLE without USING or ROW >>>> FORMAT or STORED AS, does it matter what table we create? Internally the >>>> parser rules conflict and we pick the native syntax depending on the rule >>>> order. But the user-facing behavior looks fine. >>>> >>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in 3.0. >>>> Shall we simply remove EXTERNAL from the native CREATE TABLE syntax? Then >>>> CREATE EXTERNAL TABLE creates Hive table like 2.4. >>>> >>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim < >>>> [email protected]> wrote: >>>> >>>>> Hi devs, >>>>> >>>>> I'd like to initiate discussion and hear the voices for resolving >>>>> ambiguous parser rule between two "create table"s being brought by >>>>> SPARK-30098 [1]. >>>>> >>>>> Previously, "create table" parser rules were clearly distinguished via >>>>> "USING provider", which was very intuitive and deterministic. Say, DDL >>>>> query creates "Hive" table unless "USING provider" is specified, >>>>> (Please refer the parser rule in branch-2.4 [2]) >>>>> >>>>> After SPARK-30098, "create table" parser rules became ambiguous >>>>> (please refer the parser rule in branch-3.0 [3]) - the factors >>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" which are >>>>> all defined as "optional". Now it relies on the "order" of parser rule >>>>> which end users would have no idea to reason about, and very unintuitive. >>>>> >>>>> Furthermore, undocumented rule of EXTERNAL (added in the first rule to >>>>> provide better message) brought more confusion (I've described the broken >>>>> existing query via SPARK-30436 [4]). >>>>> >>>>> Personally I'd like to see two rules mutually exclusive, instead of >>>>> trying to document the difference and talk end users to be careful about >>>>> their query. I'm seeing two ways to make rules be mutually exclusive: >>>>> >>>>> 1. Add some identifier in create Hive table rule, like `CREATE ... >>>>> "HIVE" TABLE ...`. >>>>> >>>>> pros. This is the simplest way to distinguish between two rules. >>>>> cons. This would lead end users to change their query if they intend >>>>> to create Hive table. (Given we will also provide legacy option I'm >>>>> feeling >>>>> this is acceptable.) >>>>> >>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one. >>>>> >>>>> pros. Less invasive for existing queries. >>>>> cons. Less intuitive, because they have been optional and now become >>>>> mandatory to fall into the second rule. >>>>> >>>>> Would like to hear everyone's voices; better ideas are welcome! >>>>> >>>>> Thanks, >>>>> Jungtaek Lim (HeartSaVioR) >>>>> >>>>> 1. SPARK-30098 Use default datasource as provider for CREATE TABLE >>>>> syntax >>>>> https://issues.apache.org/jira/browse/SPARK-30098 >>>>> 2. >>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 >>>>> 3. >>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 >>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436 >>>>> >>>>>
