Big +1 to have one single unified CREATE TABLE syntax. In general, we can say there are 2 ways to specify the table provider: USING clause and ROW FORMAT/STORED AS clause. These 2 ways are mutually exclusive. If none is specified, it implicitly indicates USING defaultSource .
I'm fine with a few special cases which can indicate the table provider, like EXTERNAL indicates Hive Serde table. A few thoughts: 1. SKEWED BY ...: We support it in Hive syntax just to fail it with a nice error message. We can support it in the unified syntax as well, and fail it. 2. PARTITIONED BY colTypeList: I think we can support it in the unified syntax. Just make sure it doesn't appear together with PARTITIONED BY transformList. 3. OPTIONS: We can either map it to Hive Serde properties, or let it indicate non-Hive tables. On Thu, Mar 19, 2020 at 1:00 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > Thanks Nicholas for the side comment; you'll need to interpret "CREATE > TABLE USING HIVE FORMAT" as CREATE TABLE using "HIVE FORMAT", but yes it > may add the confusion. > > Ryan, thanks for the detailed analysis and proposal. That's what I would > like to see in discussion thread. > > I'm open to solutions which enable end users to specify their intention > properly - my main concern of SPARK-30098 is that it becomes unclear which > provider the query will use in create table unless USING provider is > explicitly specified. If the new proposal makes clear on this, that should > be better than now. > > Replying inline: > > On Thu, Mar 19, 2020 at 11:06 AM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Side comment: The current docs for CREATE TABLE >> <https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table.md#description> >> add to the confusion by describing the Hive-compatible command as "CREATE >> TABLE USING HIVE FORMAT", but neither "USING" nor "HIVE FORMAT" are >> actually part of the syntax >> <https://github.com/apache/spark/blob/4237251861c79f3176de7cf5232f0388ec5d946e/docs/sql-ref-syntax-ddl-create-table-hiveformat.md> >> . >> >> On Wed, Mar 18, 2020 at 8:31 PM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> Jungtaek, it sounds like you consider the two rules to be separate >>> syntaxes with their own consistency rules. For example, if I am using the >>> Hive syntax rule, then the PARTITIONED BY clause adds new (partition) >>> columns and requires types for those columns; if I’m using the Spark syntax >>> rule with USING then PARTITIONED BY must reference existing columns and >>> cannot include types. >>> >>> I agree that this is confusing to users! We should fix it, but I don’t >>> think the right solution is to continue to have two rules with divergent >>> syntax. >>> >>> This is confusing to users because they don’t know anything about >>> separate parser rules. All the user sees is that sometimes PARTITION BY >>> requires types and sometimes it doesn’t. Yes, we could add a keyword, >>> HIVE, to signal that the syntax is borrowed from Hive for that case, >>> but that actually breaks queries that run in Hive. >>> >> That might less matter, because SPARK-30098 (and I guess your proposal as > well) enforces end users to add "USING HIVE" for their queries to enable > Hive provider in any way, even only when the query matches with rule 1 > (conditional). Once they decide to create Hive table, the query might have > to be changed, or they have to change the default provider, or they have to > enable legacy config. > > >> I think the right solution is to unify the two syntaxes. I don’t think >>> they are so different that it isn’t possible. Here are the differences I >>> see: >>> >>> - Only in Hive: >>> - EXTERNAL >>> - skewSpec: SKEWED BY ... >>> - rowFormat: ROW FORMAT DELIMITED ..., ROW FORMAT SERDE ... >>> - createFileFormat: STORED AS ... >>> - Only in Spark: >>> - OPTIONS property list >>> - Different syntax/interpretation: >>> - PARTITIONED BY transformList / PARTITIONED BY colTypeList >>> >>> ":" after column name is another one only supported in Hive, though > that's relatively minor to support it in unified syntax. > >> >>> - >>> >>> For the clauses that are supported in one but not the other, we can add >>> them to a unified rule as optional clauses. The AST builder would then >>> validate what makes sense or not (e.g., stored as with using or row format >>> delimited) and finally pass the remaining data on using the >>> CreateTableStatement. That statement would be handled like we do for >>> the Spark rule today, but with extra metadata to pass along. This is also a >>> step toward being able to put Hive behind the DSv2 API because we’d be able >>> to pass all of the Hive metadata clauses to the v2 catalog. >>> >>> The only difficult part is handling PARTITIONED BY. But in that case, >>> we can use two different syntaxes from the same CREATE TABLE rule. If >>> types are included, we use the Hive PARTITIONED BY syntax and convert >>> in the AST builder to normalize to a single representation. >>> >> The proposal looks promising - it may add some sort of complexity but > sounds like worth adding. > > One thing to make clear - in unified syntax we only rely on explicit > provider, or default provider, right? I would concern if the proposal > automatically uses Hive provider if Hive specific clauses are being used. > Yes as I said earlier it may make end users' query to be changed, but > better than uncertain. > > Btw, if the main purpose to add native syntax and change it by default is > to discontinue supporting Hive create table rule sooner, simply dropping > rule 2 with providing legacy config is still one of valid options I think. > > >> What do you both think? This would make the behavior more clear and take >>> a step toward getting rid of Hive-specific code. >>> >>> On Wed, Mar 18, 2020 at 4:45 PM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> >>>> I'm trying to understand the reason you have been suggesting to keep >>>> the real thing unchanged but change doc instead. Could you please elaborate >>>> why? End users would blame us when they hit the case their query doesn't >>>> work as intended (1) and found the fact it's undocumented (2) and hard to >>>> understand even from the Spark codebase (3). >>>> >>>> For me, addressing the root issue adopting your suggestion would be >>>> "dropping the rule 2" and only supporting it with legacy config on. We >>>> would say to end users, you need to enable the legacy config to leverage >>>> Hive create table syntax, or just use beeline with Hive connected. >>>> >>>> But since we are even thinking about native syntax as a first class and >>>> dropping Hive one implicitly (hide in doc) or explicitly, does it really >>>> matter we require a marker (like "HIVE") in rule 2 and isolate it? It would >>>> have even less confusion than Spark 2.x, since we will require end users to >>>> fill the marker regarding Hive when creating Hive table, easier to classify >>>> than "USING provider". >>>> >>>> If we think native syntax would cover many cases end users have been >>>> creating Hive table in Spark (say, USING hive would simply work for them), >>>> I'm OK to drop the rule 2 and lead end users to enable the legacy config if >>>> really needed. If not, let's continue "fixing" the issue. >>>> >>>> (Another valid approach would be consolidating two rules into one, and >>>> defining support of parameters per provider, e.g. EXTERNAL, STORED AS, ROW >>>> FORMAT, etc. are only supported in Hive provider.) >>>> >>>> >>>> On Wed, Mar 18, 2020 at 8:47 PM Wenchen Fan <cloud0...@gmail.com> >>>> wrote: >>>> >>>>> The fact that we have 2 CREATE TABLE syntax is already confusing many >>>>> users. Shall we only document the native syntax? Then users don't need to >>>>> worry about which rule their query fits and they don't need to spend a lot >>>>> of time understanding the subtle difference between these 2 syntaxes. >>>>> >>>>> On Wed, Mar 18, 2020 at 7:01 PM Jungtaek Lim < >>>>> kabhwan.opensou...@gmail.com> wrote: >>>>> >>>>>> A bit correction: the example I provided for vice versa is not really >>>>>> a correct case for vice versa. It's actually same case (intended to use >>>>>> rule 2 which is not default) but different result. >>>>>> >>>>>> On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim < >>>>>> kabhwan.opensou...@gmail.com> wrote: >>>>>> >>>>>>> My concern is that although we simply think about the change to >>>>>>> mark "USING provider" to be optional in rule 1, but in reality the >>>>>>> change >>>>>>> is most likely swapping the default rule for CREATE TABLE, which was >>>>>>> "rule >>>>>>> 2", and now "rule 1" (it would be the happy case of migration doc if the >>>>>>> swap happens as intended), and there're still couple of things which >>>>>>> make >>>>>>> the query still fall into rule 2 which is non-trivial to reason about >>>>>>> and >>>>>>> also not easy to explain. >>>>>>> >>>>>>> I only mentioned ROW FORMAT and STORED AS as the case to fall into >>>>>>> rule 2 to simplify the problem statement, but they're not only the case >>>>>>> - using col_name1:col_type1 would make the query fall into rule 2 >>>>>>> regardless of any other properties. >>>>>>> >>>>>>> What's the matter? In Spark 2.x, if end users want to use rule 1 >>>>>>> (which is not default) and specify the parameters which are only >>>>>>> available >>>>>>> in rule 1, it clearly requires "USING provider" - parser will throw >>>>>>> error >>>>>>> if there're any mistakes on specifying parameters. End users could set >>>>>>> intention clearly which rule the query should be bound. If the query >>>>>>> fails >>>>>>> to bind the rule as intended, it's simply denied. >>>>>>> >>>>>>> In Spark 3.x, parser may not help figuring out the case where end >>>>>>> users intend to use rule 2 (which is not default) but some mistake on >>>>>>> specifying parameters - it could just "silently" bound into rule 1 and >>>>>>> it >>>>>>> may be even executed without any error. Vice versa happens, but in odd >>>>>>> way >>>>>>> - e.g. CREATE EXTERNAL TABLE ... LOCATION fails with weird message >>>>>>> CREATE >>>>>>> EXTERNAL TABLE is not supported. >>>>>>> >>>>>>> It's deterministic for end users only if they're fully understand >>>>>>> the difference between the twos and also understand how Spark would >>>>>>> apply >>>>>>> the rules to make the query fall into one of the rule. I'm not sure how >>>>>>> we >>>>>>> can only improve documentation to make things be clear, but if the >>>>>>> approach >>>>>>> would be explaining the difference of rules and guide the tip to make >>>>>>> the >>>>>>> query be bound to the specific rule, the same could be applied to parser >>>>>>> rule to address the root cause. >>>>>>> >>>>>>> >>>>>>> On Wed, Mar 18, 2020 at 6:24 PM Wenchen Fan <cloud0...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Document-wise, yes, it's confusing as a simple CREATE TABLE fits >>>>>>>> both native and Hive syntax. I'm fine with some changes to make it less >>>>>>>> confusing, as long as the user-facing behavior doesn't change. For >>>>>>>> example, >>>>>>>> define "ROW FORMAT" or "STORED AS" as mandatory only if the legacy >>>>>>>> config >>>>>>>> is false. >>>>>>>> >>>>>>>> I still don't get your point about what's the real problem to end >>>>>>>> users. There is no ambiguity as the behavior is deterministic, >>>>>>>> although we >>>>>>>> rely on optional fields and rule order which is bad. It's hard to >>>>>>>> document, >>>>>>>> but I don't think that's a big problem to end users. >>>>>>>> >>>>>>>> For the legacy config, it does make the implementation more >>>>>>>> complicated, but it's invisible to most end users (we don't document >>>>>>>> it) >>>>>>>> and can be super useful to some users who want the queries to keep >>>>>>>> working >>>>>>>> in 3.0 without rewriting. >>>>>>>> >>>>>>>> If your only concern is documentation, I totally agree that we >>>>>>>> should improve it. >>>>>>>> >>>>>>>> On Wed, Mar 18, 2020 at 4:36 PM Jungtaek Lim < >>>>>>>> kabhwan.opensou...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Thanks for sharing your view. >>>>>>>>> >>>>>>>>> I agree with you it's good for Spark to promote Spark's own CREATE >>>>>>>>> TABLE syntax. The thing is, we still leave Hive CREATE TABLE syntax >>>>>>>>> unchanged - it's being said as "convenience" but I'm not sure I can >>>>>>>>> agree >>>>>>>>> with. >>>>>>>>> >>>>>>>>> I'll quote my comments in SPARK-31136 here again to make the >>>>>>>>> problem statement be clearer: >>>>>>>>> >>>>>>>>> I think the parser implementation around CREATE TABLE brings >>>>>>>>> ambiguity which is not documented anywhere. It wasn't ambiguous >>>>>>>>> because we >>>>>>>>> forced to specify USE provider if it's not a Hive table. Now it's >>>>>>>>> either >>>>>>>>> default provider or Hive according to which options are provided, >>>>>>>>> which >>>>>>>>> seems to be non-trivial to reason about. (End users would never know, >>>>>>>>> as >>>>>>>>> it's completely from parser rule.) >>>>>>>>> >>>>>>>>> I feel this as the issue of "not breaking old behavior". The >>>>>>>>> parser rule gets pretty much complicated due to support legacy >>>>>>>>> config. Not >>>>>>>>> breaking anything would make us be stuck eventually. >>>>>>>>> >>>>>>>>> >>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md >>>>>>>>> >>>>>>>>> Since Spark 3.0, CREATE TABLE without a specific provider will use >>>>>>>>> the value of spark.sql.sources.default as its provider. In Spark >>>>>>>>> version >>>>>>>>> 2.4 and earlier, it was hive. To restore the behavior before Spark >>>>>>>>> 3.0, you >>>>>>>>> can set spark.sql.legacy.createHiveTableByDefault.enabled to true. >>>>>>>>> >>>>>>>>> >>>>>>>>> It's not true if "ROW FORMAT" / "STORED AS" are provided, and we >>>>>>>>> didn't describe anything for this. >>>>>>>>> >>>>>>>>> >>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-datasource.md >>>>>>>>> >>>>>>>>> CREATE TABLE [ IF NOT EXISTS ] table_identifier [ ( col_name1 >>>>>>>>> col_type1 [ COMMENT col_comment1 ], ... ) ] [USING data_source] [ >>>>>>>>> OPTIONS ( >>>>>>>>> key1=val1, key2=val2, ... ) ] [ PARTITIONED BY ( col_name1, >>>>>>>>> col_name2, ... >>>>>>>>> ) ] [ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( >>>>>>>>> col_name [ >>>>>>>>> ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ] [ LOCATION path ] [ >>>>>>>>> COMMENT table_comment ] [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) >>>>>>>>> ] [ >>>>>>>>> AS select_statement ] >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> https://github.com/apache/spark/blob/master/docs/sql-ref-syntax-ddl-create-table-hiveformat.md >>>>>>>>> >>>>>>>>> CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier [ ( >>>>>>>>> col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] [ COMMENT >>>>>>>>> table_comment ] [ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT >>>>>>>>> col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] [ ROW FORMAT >>>>>>>>> row_format ] [ STORED AS file_format ] [ LOCATION path ] [ >>>>>>>>> TBLPROPERTIES ( >>>>>>>>> key1=val1, key2=val2, ... ) ] [ AS select_statement ] >>>>>>>>> >>>>>>>>> >>>>>>>>> At least we should describe that parser will try to match the >>>>>>>>> first case (create table ~ using data source), and fail back to second >>>>>>>>> case; even though we describe this it's not intuitive to reason about >>>>>>>>> which >>>>>>>>> rule the DDL query will fall into. As I commented earlier, "ROW >>>>>>>>> FORMAT" and >>>>>>>>> "STORED AS" are the options which make DDL query fall into the second >>>>>>>>> case, >>>>>>>>> but they're described as "optional" so it's hard to catch the gotcha. >>>>>>>>> >>>>>>>>> Furthermore, while we document the syntax as above, in reality we >>>>>>>>> allow "EXTERNAL" in first rule (and throw error), which ends up >>>>>>>>> existing >>>>>>>>> DDL query "CREATE EXTERNAL TABLE ~ LOCATION" be broken. Even we add >>>>>>>>> "USING >>>>>>>>> hive" parser will throw error. It now requires "ROW FORMAT" or >>>>>>>>> "STORED AS". >>>>>>>>> >>>>>>>>> >>>>>>>>> Simply saying, do we really think end users should stop and try to >>>>>>>>> match their query against the parser rules (or orders when we explain >>>>>>>>> in >>>>>>>>> the doc) by themselves to understand which provider the table will >>>>>>>>> leverage? I'm sorry but I think we are making bad assumption on end >>>>>>>>> users >>>>>>>>> which is a serious problem. >>>>>>>>> >>>>>>>>> If we really want to promote Spark's one for CREATE TABLE, then >>>>>>>>> would it really matter to treat Hive CREATE TABLE be "exceptional" >>>>>>>>> one and >>>>>>>>> try to isolate each other? What's the point of providing a legacy >>>>>>>>> config to >>>>>>>>> go back to the old one even we fear about breaking something to make >>>>>>>>> it >>>>>>>>> better or clearer? We do think that table provider is important >>>>>>>>> (hence the >>>>>>>>> change was done), then is it still a trivial problem whether the >>>>>>>>> provider >>>>>>>>> is affected by specifying the "optional" fields? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan <cloud0...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I think the general guideline is to promote Spark's own CREATE >>>>>>>>>> TABLE syntax instead of the Hive one. Previously these two rules are >>>>>>>>>> mutually exclusive because the native syntax requires the USING >>>>>>>>>> clause >>>>>>>>>> while the Hive syntax makes ROW FORMAT or STORED AS clause optional. >>>>>>>>>> >>>>>>>>>> It's a good move to make the USING clause optional, which makes >>>>>>>>>> it easier to write the native CREATE TABLE syntax. Unfortunately, it >>>>>>>>>> leads >>>>>>>>>> to some conflicts with the Hive CREATE TABLE syntax, but I don't see >>>>>>>>>> a >>>>>>>>>> serious problem here. If a user just writes CREATE TABLE without >>>>>>>>>> USING >>>>>>>>>> or ROW FORMAT or STORED AS, does it matter what table we create? >>>>>>>>>> Internally >>>>>>>>>> the parser rules conflict and we pick the native syntax depending on >>>>>>>>>> the >>>>>>>>>> rule order. But the user-facing behavior looks fine. >>>>>>>>>> >>>>>>>>>> CREATE EXTERNAL TABLE is a problem as it works in 2.4 but not in >>>>>>>>>> 3.0. Shall we simply remove EXTERNAL from the native CREATE TABLE >>>>>>>>>> syntax? >>>>>>>>>> Then CREATE EXTERNAL TABLE creates Hive table like 2.4. >>>>>>>>>> >>>>>>>>>> On Mon, Mar 16, 2020 at 10:55 AM Jungtaek Lim < >>>>>>>>>> kabhwan.opensou...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi devs, >>>>>>>>>>> >>>>>>>>>>> I'd like to initiate discussion and hear the voices for >>>>>>>>>>> resolving ambiguous parser rule between two "create table"s being >>>>>>>>>>> brought >>>>>>>>>>> by SPARK-30098 [1]. >>>>>>>>>>> >>>>>>>>>>> Previously, "create table" parser rules were clearly >>>>>>>>>>> distinguished via "USING provider", which was very intuitive and >>>>>>>>>>> deterministic. Say, DDL query creates "Hive" table unless "USING >>>>>>>>>>> provider" >>>>>>>>>>> is specified, >>>>>>>>>>> (Please refer the parser rule in branch-2.4 [2]) >>>>>>>>>>> >>>>>>>>>>> After SPARK-30098, "create table" parser rules became ambiguous >>>>>>>>>>> (please refer the parser rule in branch-3.0 [3]) - the factors >>>>>>>>>>> differentiating two rules are only "ROW FORMAT" and "STORED AS" >>>>>>>>>>> which are >>>>>>>>>>> all defined as "optional". Now it relies on the "order" of parser >>>>>>>>>>> rule >>>>>>>>>>> which end users would have no idea to reason about, and very >>>>>>>>>>> unintuitive. >>>>>>>>>>> >>>>>>>>>>> Furthermore, undocumented rule of EXTERNAL (added in the first >>>>>>>>>>> rule to provide better message) brought more confusion (I've >>>>>>>>>>> described the >>>>>>>>>>> broken existing query via SPARK-30436 [4]). >>>>>>>>>>> >>>>>>>>>>> Personally I'd like to see two rules mutually exclusive, instead >>>>>>>>>>> of trying to document the difference and talk end users to be >>>>>>>>>>> careful about >>>>>>>>>>> their query. I'm seeing two ways to make rules be mutually >>>>>>>>>>> exclusive: >>>>>>>>>>> >>>>>>>>>>> 1. Add some identifier in create Hive table rule, like `CREATE >>>>>>>>>>> ... "HIVE" TABLE ...`. >>>>>>>>>>> >>>>>>>>>>> pros. This is the simplest way to distinguish between two rules. >>>>>>>>>>> cons. This would lead end users to change their query if they >>>>>>>>>>> intend to create Hive table. (Given we will also provide legacy >>>>>>>>>>> option I'm >>>>>>>>>>> feeling this is acceptable.) >>>>>>>>>>> >>>>>>>>>>> 2. Define "ROW FORMAT" or "STORED AS" as mandatory one. >>>>>>>>>>> >>>>>>>>>>> pros. Less invasive for existing queries. >>>>>>>>>>> cons. Less intuitive, because they have been optional and now >>>>>>>>>>> become mandatory to fall into the second rule. >>>>>>>>>>> >>>>>>>>>>> Would like to hear everyone's voices; better ideas are welcome! >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Jungtaek Lim (HeartSaVioR) >>>>>>>>>>> >>>>>>>>>>> 1. SPARK-30098 Use default datasource as provider for CREATE >>>>>>>>>>> TABLE syntax >>>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-30098 >>>>>>>>>>> 2. >>>>>>>>>>> https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 >>>>>>>>>>> 3. >>>>>>>>>>> https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 >>>>>>>>>>> 4. https://issues.apache.org/jira/browse/SPARK-30436 >>>>>>>>>>> >>>>>>>>>>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >>