[ https://issues.apache.org/jira/browse/SPARK-31136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058333#comment-17058333 ]
Jungtaek Lim edited comment on SPARK-31136 at 3/18/20, 8:14 AM: ---------------------------------------------------------------- This reminds me about my previous PR: [https://github.com/apache/spark/pull/27107] Please go through the comments in the PR again. I'm quoting the key point here: {quote}The parts differentiating between two syntaxes are skewSpec, rowFormat, and createFileFormat (using any of them would make create statement go into 2nd syntax), and all of them are optional. We're not enforcing to specify it but rely on the parser. {quote} I think the parser implementation around CREATE TABLE brings ambiguity which is not documented anywhere. It wasn't ambiguous because we forced to specify USE provider if it's not a Hive table. Now it's either default provider or Hive according to which options are provided, which seems to be non-trivial to reason about. (End users would never know, as it's completely from parser rule.) I feel this as the issue of "not breaking old behavior". The parser rule gets pretty much complicated due to support legacy config. Not breaking anything would make us be stuck eventually. was (Author: kabhwan): This reminds me about my previous PR: [https://github.com/apache/spark/pull/27107] Please go through the comments in the PR again. I'm quoting the key point here: {quote}The parts differentiating between two syntaxes are skewSpec, rowFormat, and createFileFormat (using any of them would make create statement go into 2nd syntax), and all of them are optional. We're not enforcing to specify it but rely on the parser. {quote} I think the parser implementation around CREATE TABLE brings ambiguity which is not documented anywhere. It wasn't ambiguous because we forced to specify STORED AS if it's not a Hive table. Now it's either default provider or Hive according to which options are provided, which seems to be non-trivial to reason about. (End users would never know, as it's completely from parser rule.) I feel this as the issue of "not breaking old behavior". The parser rule gets pretty much complicated due to support legacy config. Not breaking anything would make us be stuck eventually. > Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax > ----------------------------------------------------------------------------- > > Key: SPARK-31136 > URL: https://issues.apache.org/jira/browse/SPARK-31136 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 3.0.0 > Reporter: Dongjoon Hyun > Priority: Blocker > Labels: correctness > > We need to consider the behavior change of SPARK-30098 . > This is a placeholder to keep the discussion and the final decision. > `CREATE TABLE` syntax changes its behavior silently. > The following is one example of the breaking the existing user data pipelines. > *Apache Spark 2.4.5* > {code} > spark-sql> CREATE TABLE t(a STRING); > spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t; > spark-sql> SELECT * FROM t LIMIT 1; > # Apache Spark > Time taken: 2.05 seconds, Fetched 1 row(s) > {code} > {code} > spark-sql> CREATE TABLE t(a CHAR(3)); > spark-sql> INSERT INTO TABLE t SELECT 'a '; > spark-sql> SELECT a, length(a) FROM t; > a 3 > {code} > *Apache Spark 3.0.0-preview2* > {code} > spark-sql> CREATE TABLE t(a STRING); > spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t; > Error in query: LOAD DATA is not supported for datasource tables: > `default`.`t`; > {code} > {code} > spark-sql> CREATE TABLE t(a CHAR(3)); > spark-sql> INSERT INTO TABLE t SELECT 'a '; > spark-sql> SELECT a, length(a) FROM t; > a 2 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org