Re: [DISCUSS] Syntax for table DDL

Felix Cheung Tue, 02 Oct 2018 08:36:22 -0700

I think it has been an important “selling point” that Spark is “mostly 
compatible“ with Hive DDL.


I have see a lot of teams suffering from switching between Presto and Hive 
dialects.

So one question I have is, we are at a point of switch from Hive compatible to 
ANSI SQL, say?

Perhaps a more critical question, what does it take to get the platform to 
support both, by making the ANTLR extensible?



________________________________
From: Alessandro Solimando <alessandro.solima...@gmail.com>
Sent: Tuesday, October 2, 2018 12:35 AM
To: rb...@netflix.com
Cc: Xiao Li; dev
Subject: Re: [DISCUSS] Syntax for table DDL

I agree with Ryan, a "standard" and more widely adopted syntax is usually a 
good idea, with possibly some slight improvements like "bulk deletion" of 
columns (especially because both the syntax and the semantics are clear), 
rather than stay with Hive syntax at any cost.

I am personally following this PR with a lot of interest, thanks for all the 
work along this direction.

Best regards,
Alessandro

On Mon, 1 Oct 2018 at 20:21, Ryan Blue <rb...@netflix.com.invalid> wrote:

What do you mean by consistent with the syntax in SqlBase.g4? These aren’t 
currently defined, so we need to decide what syntax to support. There are more 
details below, but the syntax I’m proposing is more standard across databases 
than Hive, which uses confusing and non-standard syntax.

I doubt that we want to support Hive syntax for a few reasons. Hive uses the 
same column CHANGE statement for multiple purposes, so it ends up with strange 
patterns for simple tasks, like updating the column’s type:

ALTER TABLE t CHANGE a1 a1 INT;


The column name is doubled because old name, new name, and type are always 
required. So you have to know the type of a column to change its name and you 
have to double up the name to change its type. Hive also allows a couple other 
oddities:

  *   Column reordering with FIRST and AFTER keywords. Column reordering is 
tricky to get right so I’m not sure we want to add it.
  *   RESTRICT and CASCADE to signal whether to change all partitions or not. 
Spark doesn’t support partition-level schemas except through Hive, and even 
then I’m not sure how reliable it is.

I know that we wouldn’t necessarily have to support these features from Hive, 
but I’m pointing them out to ask the question: why copy Hive’s syntax if it is 
unlikely that Spark will implement all of the “features”? I’d rather go with 
SQL syntax from databases like PostgreSQL or others that are more standard and 
common.

The more “standard” versions of these statements are like what I’ve proposed:

  *   ALTER TABLE ident ALTER COLUMN qualifiedName TYPE dataType: ALTER is used 
by SQL Server, Access, DB2, and PostgreSQL; MODIFY by MySQL and Oracle. COLUMN 
is optional in Oracle and TYPE is omitted by databases other than PosgreSQL. I 
think we could easily add MODIFY as an alternative to the second ALTER (and 
maybe alternatives like UPDATE and CHANGE) and make both TYPE and COLUMN 
optional.
  *   ALTER TABLE ident RENAME COLUMN qualifiedName TO qualifiedName: This 
syntax is supported by PostgreSQL, Oracle, and DB2. MySQL uses the same syntax 
as Hive and it appears that SQL server doesn’t have this statement. This also 
match the table rename syntax, which uses TO.
  *   ALTER TABLE ident DROP (COLUMN | COLUMNS) qualifiedNameList: This matches 
PostgreSQL, Oracle, DB2, and SQL server. MySQL makes COLUMN optional. Most 
don’t allow deleting multiple columns, but it’s a reasonable extension.

While we’re on the subject of ALTER TABLE DDL, I should note that all of the 
databases use ADD COLUMN syntax that differs from Hive (and currently, Spark):

  *   ALTER TABLE ident ADD COLUMN qualifiedName dataType (',' qualifiedName 
dataType)*: All other databases I looked at use ADD COLUMN, but not all of them 
support adding multiple columns at the same time. Hive requires ( and ) 
enclosing the columns and uses the COLUMNS keyword instead of COLUMN. I think 
that Spark should be updated to make the parens optional and to support both 
keywords, COLUMN and COLUMNS.

What does everyone think? Is it reasonable to use the more standard syntax 
instead of using Hive as a base?

rb

On Fri, Sep 28, 2018 at 11:07 PM Xiao Li 
<gatorsm...@gmail.com<mailto:gatorsm...@gmail.com>> wrote:
Are they consistent with the current syntax defined in SqlBase.g4? I think we 
are following the Hive DDL syntax: 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column

Ryan Blue <rb...@netflix.com.invalid> 于2018年9月28日周五 下午3:47写道：

Hi everyone,

I’m currently working on new table DDL statements for v2 tables. For context, 
the new logical plans for DataSourceV2 require a catalog interface so that 
Spark can create tables for operations like CTAS. The proposed TableCatalog API 
also includes an API for altering those tables so we can make ALTER TABLE 
statements work. I’m implementing those DDL statements, which will make it into 
upstream Spark when the TableCatalog PR is merged.

Since I’m adding new SQL statements that don’t yet exist in Spark, I want to 
make sure that the syntax I’m using in our branch will match the syntax we add 
to Spark later. I’m basing this proposed syntax on 
PostgreSQL<https://www.postgresql.org/docs/current/static/ddl-alter.html>.

  *   Update data type: ALTER TABLE tableIdentifier ALTER COLUMN qualifiedName 
TYPE dataType.
  *   Rename column: ALTER TABLE tableIdentifier RENAME COLUMN qualifiedName TO 
qualifiedName
  *   Drop column: ALTER TABLE tableIdentifier DROP (COLUMN | COLUMNS) 
qualifiedNameList

A few notes:

  *   Using qualifiedName in these rules allows updating nested types, like 
point.x.
  *   Updates and renames can only alter one column, but drop can drop a list.
  *   Rename can’t move types and will validate that if the TO name is 
qualified, that the prefix matches the original field.
  *   I’m also changing ADD COLUMN to support adding fields to nested columns 
by using qualifiedName instead of identifier.

Please reply to this thread if you have suggestions based on a different SQL 
engine or want this syntax to be different for another reason. Thanks!

rb

--
Ryan Blue
Software Engineer
Netflix


--
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Syntax for table DDL

Reply via email to