date:20181002

Re: [Structured Streaming SPARK-23966] Why non-atomic rename is problem in State Store ?

2018-10-02 Thread Jungtaek Lim

Thanks Steve to answer in detail. I was under same feeling with Chandan
from the line as well: it was against my knowledge as rename operation
itself in HDFS is atomic, and I didn't imagine it was for tackling object
store.

I learned a lot for object store from your answer. Thanks again.

Jungtaek Lim (HeartSaVioR)

2018년 10월 3일 (수) 오전 2:48, chandan prakash 님이 작성:

> Thanks a lot Steve and Jungtaek for your answers.
> Steve,
> You explained really well in depth.
>
>  I understood that the existing old implementation was not correct for
> object store like S3. The new implementation will address that. And for
> better performance we should better choose a Direct Write based checkpoint
> rather than Rename based (which we can implement using the new
> CheckpointFilemanager abstraction)
> My confusion was because of this line in PR:
> *This is incorrect as rename is not atomic in HDFS FileSystem
> implementation*
> I thought the above line meant that existing old implementation is not
> correct for HDFS file system as well .
> So wanted to understand if there is something I am missing . The new
> implementation is for addressing issue of Object Store like S3 and nor HDFS.
> Thanks again for your explanation, I am sure it will help a lot of other
> code readers as well .
>
> Regards,
> Chandan
>
>
>
> On Mon, Oct 1, 2018 at 5:37 PM Steve Loughran 
> wrote:
>
>>
>>
>> On 11 Aug 2018, at 17:33, chandan prakash 
>> wrote:
>>
>> Hi All,
>> I was going through this pull request about new CheckpointFileManager
>> abstraction in structured streaming coming in 2.4 :
>> https://issues.apache.org/jira/browse/SPARK-23966
>> https://github.com/apache/spark/pull/21048
>>
>> I went through the code in detail and found it will indtroduce a very
>> nice abstraction which is much cleaner and extensible for Direct Writes
>> File System like S3 (in addition to current HDFS file system).
>>
>> *But I am unable to understand, is it really solving some problem in
>> exsisting State Store code which is currently  existing in Spark 2.3 ? *
>>
>> *My questions related to above statements in State Store code : *
>>  *PR description*:: "Checkpoint files must be written atomically such
>> that *no partial files are generated*.
>> *QUESTION*: When are partial files generated in current code ?  I can
>> see that data is first written to temp-delta file and then renamed to
>> version.delta file. If something bad happens, the task will fail due to
>> thrown exception and abort() will be called on store to close and delete
>> tempDeltaFileStream . I think it is quite clean, what is the case that
>> partial files might be generated ?
>>
>>
>> I suspect the issue is that as files are written to a "classic" Posix
>> store, flush/sync operations can result in the intermediate data being
>> visible to others. Which is why the convention for checkpointing/commit
>> operations is : write to temp & rename. Which is not what you want for
>> object stores, especially S3
>>
>>
>>
>>  *PR description*:: *State Store behavior is incorrect - HDFS FileSystem
>> implementation does not have atomic rename*"
>> *QUESTION*:  Hdfs filesystem rename operation is atomic, I think above
>> line takes into account about checking existing file if exists and then
>> taking appropriate action which together makes the file renaming operation
>> multi-steps and hence non-atomic. But why this behaviour is incorrect ?
>> Even if multiple executors try to write to the same version.delta file,
>> only 1st of them will succeed, the second one will see the file exists and
>> will delete its temp-delta file. Looks good .
>>
>>
>> HDFS single file and dir rename is atomic; it grabs a lock on the
>> metadatastore, does the change, unlocks it. If you are doing any FS op
>> which explicitly renames more than one file in your commit, you lose
>> atomicity.  If there's a check + rename then yes, it's two step, unless you
>> can use create(path, overwrite=false) to create some lease file where you
>> know that the creation is exclusive & atomic for HDFS + Posix, generally
>> not-at-all for the stores, especially S3 which can actually cache the 404
>> in its load balancers for a few tens of milliseconds
>>
>> For object stores, you are in different world of pain
>>
>> S3: nope; O(files+ data)  + observable + partial failures. List
>> inconsistency + caching of negative GET/HEAD to defend against DoS
>> wasb: no, except for bits of the tree where you enable leases, something
>> which increases cost of operations. O(files), with the odd pause if some
>> shard movement has to take place
>> google GCS: not sure, but it is O(files)
>> Azure abfs. Not atomic yet As the code says:
>>
>> if (isAtomicRenameKey(source.getName())) {
>>   LOG.warn("The atomic rename feature is not supported by the ABFS
>> scheme; however rename,"
>>   +" create and delete operations are atomic if Namespace is
>> enabled for your Azure Storage account.");
>> }
>>
>> From my reading of the

Re: [Structured Streaming SPARK-23966] Why non-atomic rename is problem in State Store ?

2018-10-02 Thread chandan prakash

Thanks a lot Steve and Jungtaek for your answers.
Steve,
You explained really well in depth.

 I understood that the existing old implementation was not correct for
object store like S3. The new implementation will address that. And for
better performance we should better choose a Direct Write based checkpoint
rather than Rename based (which we can implement using the new
CheckpointFilemanager abstraction)
My confusion was because of this line in PR:
*This is incorrect as rename is not atomic in HDFS FileSystem
implementation*
I thought the above line meant that existing old implementation is not
correct for HDFS file system as well .
So wanted to understand if there is something I am missing . The new
implementation is for addressing issue of Object Store like S3 and nor HDFS.
Thanks again for your explanation, I am sure it will help a lot of other
code readers as well .

Regards,
Chandan



On Mon, Oct 1, 2018 at 5:37 PM Steve Loughran 
wrote:

>
>
> On 11 Aug 2018, at 17:33, chandan prakash 
> wrote:
>
> Hi All,
> I was going through this pull request about new CheckpointFileManager
> abstraction in structured streaming coming in 2.4 :
> https://issues.apache.org/jira/browse/SPARK-23966
> https://github.com/apache/spark/pull/21048
>
> I went through the code in detail and found it will indtroduce a very nice
> abstraction which is much cleaner and extensible for Direct Writes File
> System like S3 (in addition to current HDFS file system).
>
> *But I am unable to understand, is it really solving some problem in
> exsisting State Store code which is currently  existing in Spark 2.3 ? *
>
> *My questions related to above statements in State Store code : *
>  *PR description*:: "Checkpoint files must be written atomically such
> that *no partial files are generated*.
> *QUESTION*: When are partial files generated in current code ?  I can see
> that data is first written to temp-delta file and then renamed to
> version.delta file. If something bad happens, the task will fail due to
> thrown exception and abort() will be called on store to close and delete
> tempDeltaFileStream . I think it is quite clean, what is the case that
> partial files might be generated ?
>
>
> I suspect the issue is that as files are written to a "classic" Posix
> store, flush/sync operations can result in the intermediate data being
> visible to others. Which is why the convention for checkpointing/commit
> operations is : write to temp & rename. Which is not what you want for
> object stores, especially S3
>
>
>
>  *PR description*:: *State Store behavior is incorrect - HDFS FileSystem
> implementation does not have atomic rename*"
> *QUESTION*:  Hdfs filesystem rename operation is atomic, I think above
> line takes into account about checking existing file if exists and then
> taking appropriate action which together makes the file renaming operation
> multi-steps and hence non-atomic. But why this behaviour is incorrect ?
> Even if multiple executors try to write to the same version.delta file,
> only 1st of them will succeed, the second one will see the file exists and
> will delete its temp-delta file. Looks good .
>
>
> HDFS single file and dir rename is atomic; it grabs a lock on the
> metadatastore, does the change, unlocks it. If you are doing any FS op
> which explicitly renames more than one file in your commit, you lose
> atomicity.  If there's a check + rename then yes, it's two step, unless you
> can use create(path, overwrite=false) to create some lease file where you
> know that the creation is exclusive & atomic for HDFS + Posix, generally
> not-at-all for the stores, especially S3 which can actually cache the 404
> in its load balancers for a few tens of milliseconds
>
> For object stores, you are in different world of pain
>
> S3: nope; O(files+ data)  + observable + partial failures. List
> inconsistency + caching of negative GET/HEAD to defend against DoS
> wasb: no, except for bits of the tree where you enable leases, something
> which increases cost of operations. O(files), with the odd pause if some
> shard movement has to take place
> google GCS: not sure, but it is O(files)
> Azure abfs. Not atomic yet As the code says:
>
> if (isAtomicRenameKey(source.getName())) {
>   LOG.warn("The atomic rename feature is not supported by the ABFS
> scheme; however rename,"
>   +" create and delete operations are atomic if Namespace is
> enabled for your Azure Storage account.");
> }
>
> From my reading of the SPARK-23966 PR, it's the object store problem which
> is being addressed -both correctness and performance.
>
>
> Anything I am missing here?
> Really curious to know which corner cases we are trying to solve by this
> new pull request ?
>
>
>
> Object stores as the back end. For S3 in particular, where that rename is
> O(data) and a direct PUT to the destination gives you that atomic ness.
>
>
> Someone needs to sit down and write that reference implementation.
>
> Whoever  does want to

Re: [DISCUSS] Syntax for table DDL

2018-10-02 Thread Ryan Blue

I'd say that it was important to be compatible with Hive in the past, but
that's becoming less important over time. Spark is well established with
Hadoop users and I think the focus moving forward should be to make Spark
more predictable as a SQL engine for people coming from more traditional
databases..

That said, I think there is no problem supporting the alter syntax for both
Hive/MySQL and the more standard versions.

On Tue, Oct 2, 2018 at 8:35 AM Felix Cheung 
wrote:

> I think it has been an important “selling point” that Spark is “mostly
> compatible“ with Hive DDL.
>
> I have see a lot of teams suffering from switching between Presto and Hive
> dialects.
>
> So one question I have is, we are at a point of switch from Hive
> compatible to ANSI SQL, say?
>
> Perhaps a more critical question, what does it take to get the platform to
> support both, by making the ANTLR extensible?
>
>
>
> --
> *From:* Alessandro Solimando 
> *Sent:* Tuesday, October 2, 2018 12:35 AM
> *To:* rb...@netflix.com
> *Cc:* Xiao Li; dev
> *Subject:* Re: [DISCUSS] Syntax for table DDL
>
> I agree with Ryan, a "standard" and more widely adopted syntax is usually
> a good idea, with possibly some slight improvements like "bulk deletion" of
> columns (especially because both the syntax and the semantics are clear),
> rather than stay with Hive syntax at any cost.
>
> I am personally following this PR with a lot of interest, thanks for all
> the work along this direction.
>
> Best regards,
> Alessandro
>
> On Mon, 1 Oct 2018 at 20:21, Ryan Blue  wrote:
>
>> What do you mean by consistent with the syntax in SqlBase.g4? These
>> aren’t currently defined, so we need to decide what syntax to support.
>> There are more details below, but the syntax I’m proposing is more standard
>> across databases than Hive, which uses confusing and non-standard syntax.
>>
>> I doubt that we want to support Hive syntax for a few reasons. Hive uses
>> the same column CHANGE statement for multiple purposes, so it ends up
>> with strange patterns for simple tasks, like updating the column’s type:
>>
>> ALTER TABLE t CHANGE a1 a1 INT;
>>
>> The column name is doubled because old name, new name, and type are
>> always required. So you have to know the type of a column to change its
>> name and you have to double up the name to change its type. Hive also
>> allows a couple other oddities:
>>
>>- Column reordering with FIRST and AFTER keywords. Column reordering
>>is tricky to get right so I’m not sure we want to add it.
>>- RESTRICT and CASCADE to signal whether to change all partitions or
>>not. Spark doesn’t support partition-level schemas except through Hive, 
>> and
>>even then I’m not sure how reliable it is.
>>
>> I know that we wouldn’t necessarily have to support these features from
>> Hive, but I’m pointing them out to ask the question: why copy Hive’s syntax
>> if it is unlikely that Spark will implement all of the “features”? I’d
>> rather go with SQL syntax from databases like PostgreSQL or others that are
>> more standard and common.
>>
>> The more “standard” versions of these statements are like what I’ve
>> proposed:
>>
>>- ALTER TABLE ident ALTER COLUMN qualifiedName TYPE dataType: ALTER
>>is used by SQL Server, Access, DB2, and PostgreSQL; MODIFY by MySQL
>>and Oracle. COLUMN is optional in Oracle and TYPE is omitted by
>>databases other than PosgreSQL. I think we could easily add MODIFY as
>>an alternative to the second ALTER (and maybe alternatives like UPDATE
>>and CHANGE) and make both TYPE and COLUMN optional.
>>- ALTER TABLE ident RENAME COLUMN qualifiedName TO qualifiedName:
>>This syntax is supported by PostgreSQL, Oracle, and DB2. MySQL uses the
>>same syntax as Hive and it appears that SQL server doesn’t have this
>>statement. This also match the table rename syntax, which uses TO.
>>- ALTER TABLE ident DROP (COLUMN | COLUMNS) qualifiedNameList: This
>>matches PostgreSQL, Oracle, DB2, and SQL server. MySQL makes COLUMN
>>optional. Most don’t allow deleting multiple columns, but it’s a 
>> reasonable
>>extension.
>>
>> While we’re on the subject of ALTER TABLE DDL, I should note that all of
>> the databases use ADD COLUMN syntax that differs from Hive (and
>> currently, Spark):
>>
>>- ALTER TABLE ident ADD COLUMN qualifiedName dataType (','
>>qualifiedName dataType)*: All other databases I looked at use ADD
>>COLUMN, but not all of them support adding multiple columns at the
>>same time. Hive requires ( and ) enclosing the columns and uses the
>>COLUMNS keyword instead of COLUMN. I think that Spark should be
>>updated to make the parens optional and to support both keywords,
>>COLUMN and COLUMNS.
>>
>> What does everyone think? Is it reasonable to use the more standard
>> syntax instead of using Hive as a base?
>>
>> rb
>>
>> On Fri, Sep 28, 2018 at 11:07 PM Xiao Li  wrote:
>>
>>> Are they consistent

Re: [DISCUSS] Syntax for table DDL

2018-10-02 Thread Felix Cheung

I think it has been an important “selling point” that Spark is “mostly 
compatible“ with Hive DDL.

I have see a lot of teams suffering from switching between Presto and Hive 
dialects.

So one question I have is, we are at a point of switch from Hive compatible to 
ANSI SQL, say?

Perhaps a more critical question, what does it take to get the platform to 
support both, by making the ANTLR extensible?




From: Alessandro Solimando 
Sent: Tuesday, October 2, 2018 12:35 AM
To: rb...@netflix.com
Cc: Xiao Li; dev
Subject: Re: [DISCUSS] Syntax for table DDL

I agree with Ryan, a "standard" and more widely adopted syntax is usually a 
good idea, with possibly some slight improvements like "bulk deletion" of 
columns (especially because both the syntax and the semantics are clear), 
rather than stay with Hive syntax at any cost.

I am personally following this PR with a lot of interest, thanks for all the 
work along this direction.

Best regards,
Alessandro

On Mon, 1 Oct 2018 at 20:21, Ryan Blue  wrote:

What do you mean by consistent with the syntax in SqlBase.g4? These aren’t 
currently defined, so we need to decide what syntax to support. There are more 
details below, but the syntax I’m proposing is more standard across databases 
than Hive, which uses confusing and non-standard syntax.

I doubt that we want to support Hive syntax for a few reasons. Hive uses the 
same column CHANGE statement for multiple purposes, so it ends up with strange 
patterns for simple tasks, like updating the column’s type:

ALTER TABLE t CHANGE a1 a1 INT;


The column name is doubled because old name, new name, and type are always 
required. So you have to know the type of a column to change its name and you 
have to double up the name to change its type. Hive also allows a couple other 
oddities:

  *   Column reordering with FIRST and AFTER keywords. Column reordering is 
tricky to get right so I’m not sure we want to add it.
  *   RESTRICT and CASCADE to signal whether to change all partitions or not. 
Spark doesn’t support partition-level schemas except through Hive, and even 
then I’m not sure how reliable it is.

I know that we wouldn’t necessarily have to support these features from Hive, 
but I’m pointing them out to ask the question: why copy Hive’s syntax if it is 
unlikely that Spark will implement all of the “features”? I’d rather go with 
SQL syntax from databases like PostgreSQL or others that are more standard and 
common.

The more “standard” versions of these statements are like what I’ve proposed:

  *   ALTER TABLE ident ALTER COLUMN qualifiedName TYPE dataType: ALTER is used 
by SQL Server, Access, DB2, and PostgreSQL; MODIFY by MySQL and Oracle. COLUMN 
is optional in Oracle and TYPE is omitted by databases other than PosgreSQL. I 
think we could easily add MODIFY as an alternative to the second ALTER (and 
maybe alternatives like UPDATE and CHANGE) and make both TYPE and COLUMN 
optional.
  *   ALTER TABLE ident RENAME COLUMN qualifiedName TO qualifiedName: This 
syntax is supported by PostgreSQL, Oracle, and DB2. MySQL uses the same syntax 
as Hive and it appears that SQL server doesn’t have this statement. This also 
match the table rename syntax, which uses TO.
  *   ALTER TABLE ident DROP (COLUMN | COLUMNS) qualifiedNameList: This matches 
PostgreSQL, Oracle, DB2, and SQL server. MySQL makes COLUMN optional. Most 
don’t allow deleting multiple columns, but it’s a reasonable extension.

While we’re on the subject of ALTER TABLE DDL, I should note that all of the 
databases use ADD COLUMN syntax that differs from Hive (and currently, Spark):

  *   ALTER TABLE ident ADD COLUMN qualifiedName dataType (',' qualifiedName 
dataType)*: All other databases I looked at use ADD COLUMN, but not all of them 
support adding multiple columns at the same time. Hive requires ( and ) 
enclosing the columns and uses the COLUMNS keyword instead of COLUMN. I think 
that Spark should be updated to make the parens optional and to support both 
keywords, COLUMN and COLUMNS.

What does everyone think? Is it reasonable to use the more standard syntax 
instead of using Hive as a base?

rb

On Fri, Sep 28, 2018 at 11:07 PM Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:
Are they consistent with the current syntax defined in SqlBase.g4? I think we 
are following the Hive DDL syntax: 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column

Ryan Blue  于2018年9月28日周五 下午3:47写道：

Hi everyone,

I’m currently working on new table DDL statements for v2 tables. For context, 
the new logical plans for DataSourceV2 require a catalog interface so that 
Spark can create tables for operations like CTAS. The proposed TableCatalog API 
also includes an API for altering those tables so we can make ALTER TABLE 
statements work. I’m implementing those DDL statements, which will make it into 
upstream Spark when the TableCatalog PR is merged.

Since I’m adding

Re: [Discuss] Datasource v2 support for Kerberos

2018-10-02 Thread Steve Loughran



On 2 Oct 2018, at 04:44, tigerquoll 
mailto:tigerqu...@outlook.com>> wrote:

Hi Steve,
I think that passing a kerberos keytab around is one of those bad ideas that
is entirely appropriate to re-question every single time you come across it.
It has been used already in spark when interacting with Kerberos systems
that do not support delegation tokens. Any such system will eventually stop
talking to Spark once the passed Kerberos tickets expire and are unable to
be renewed.

It is one of those "best bad idea we have" type situations that has arisen,
been discussed to death, and finally, grudgingly, an interim-only solution
settled on as passing the keytab to the worker to renew Kerberos tickets.

Spark AM, generally, with it pushing out tickets to the workers,  I don't 
believe the workers get to see the keytab —do they?

Gabor's illustration in the kafka SPIP is probably the best illustration of it 
I've ever seen
https://docs.google.com/document/d/1ouRayzaJf_N5VQtGhVq9FURXVmRpXzEEWYHob0ne3NY/edit#


A
long-time notable offender in this area is secure Kafka. Thankfully Kafka
delegation tokens are soon to be supported in spark, removing the need to
pass keytabs around when interacting with Kafka.

This particular thread could probably be better renamed as Generic
Datasource v2 support for Kerberos configuration - I would like to divert
from conversation on alternate architectures that could handle a lack of
delegation tickets (it is a worthwhile conversation, but a long and involved
one that will distract from this particular narrowly defined topic), and
focus just on configuration. information.   A very quick look through
various client code has identified at least the following configuration
information that potentially could be of use to a datasource that uses
Kerberos.

* krb5ConfPath
* kerberos debugging flags

mmm. 
https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/secrets.html

FWIW, Hadoop 2.8+ has the KDiag entry point which can also be run inside an 
application —though there's always the risk that going near UGI too early can 
"collapse" kerberos state too early

https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/KDiag.java

if Spark needs something like that for 2.7.x too, copying & repackaging that 
class would be a place to start


* spark.security.credentials.${service}.enabled
* JAAS config
* ZKServerPrincipal ??

It is entirely feasible that each datasource may require its own unique
Kerberos configuration (e.g. You are pulling from a external datasource that
has a different KDC then the yarn cluster you are running on).

This is a use-case I've never encountered, instead everyone relies on cross-AD 
trust. That's complex enough as it is

Re: [DISCUSS] Syntax for table DDL

2018-10-02 Thread Alessandro Solimando

I agree with Ryan, a "standard" and more widely adopted syntax is usually a
good idea, with possibly some slight improvements like "bulk deletion" of
columns (especially because both the syntax and the semantics are clear),
rather than stay with Hive syntax at any cost.

I am personally following this PR with a lot of interest, thanks for all
the work along this direction.

Best regards,
Alessandro

On Mon, 1 Oct 2018 at 20:21, Ryan Blue  wrote:

> What do you mean by consistent with the syntax in SqlBase.g4? These aren’t
> currently defined, so we need to decide what syntax to support. There are
> more details below, but the syntax I’m proposing is more standard across
> databases than Hive, which uses confusing and non-standard syntax.
>
> I doubt that we want to support Hive syntax for a few reasons. Hive uses
> the same column CHANGE statement for multiple purposes, so it ends up
> with strange patterns for simple tasks, like updating the column’s type:
>
> ALTER TABLE t CHANGE a1 a1 INT;
>
> The column name is doubled because old name, new name, and type are always
> required. So you have to know the type of a column to change its name and
> you have to double up the name to change its type. Hive also allows a
> couple other oddities:
>
>- Column reordering with FIRST and AFTER keywords. Column reordering
>is tricky to get right so I’m not sure we want to add it.
>- RESTRICT and CASCADE to signal whether to change all partitions or
>not. Spark doesn’t support partition-level schemas except through Hive, and
>even then I’m not sure how reliable it is.
>
> I know that we wouldn’t necessarily have to support these features from
> Hive, but I’m pointing them out to ask the question: why copy Hive’s syntax
> if it is unlikely that Spark will implement all of the “features”? I’d
> rather go with SQL syntax from databases like PostgreSQL or others that are
> more standard and common.
>
> The more “standard” versions of these statements are like what I’ve
> proposed:
>
>- ALTER TABLE ident ALTER COLUMN qualifiedName TYPE dataType: ALTER is
>used by SQL Server, Access, DB2, and PostgreSQL; MODIFY by MySQL and
>Oracle. COLUMN is optional in Oracle and TYPE is omitted by databases
>other than PosgreSQL. I think we could easily add MODIFY as an
>alternative to the second ALTER (and maybe alternatives like UPDATE
>and CHANGE) and make both TYPE and COLUMN optional.
>- ALTER TABLE ident RENAME COLUMN qualifiedName TO qualifiedName: This
>syntax is supported by PostgreSQL, Oracle, and DB2. MySQL uses the same
>syntax as Hive and it appears that SQL server doesn’t have this statement.
>This also match the table rename syntax, which uses TO.
>- ALTER TABLE ident DROP (COLUMN | COLUMNS) qualifiedNameList: This
>matches PostgreSQL, Oracle, DB2, and SQL server. MySQL makes COLUMN
>optional. Most don’t allow deleting multiple columns, but it’s a reasonable
>extension.
>
> While we’re on the subject of ALTER TABLE DDL, I should note that all of
> the databases use ADD COLUMN syntax that differs from Hive (and
> currently, Spark):
>
>- ALTER TABLE ident ADD COLUMN qualifiedName dataType (','
>qualifiedName dataType)*: All other databases I looked at use ADD
>COLUMN, but not all of them support adding multiple columns at the
>same time. Hive requires ( and ) enclosing the columns and uses the
>COLUMNS keyword instead of COLUMN. I think that Spark should be
>updated to make the parens optional and to support both keywords,
>COLUMN and COLUMNS.
>
> What does everyone think? Is it reasonable to use the more standard syntax
> instead of using Hive as a base?
>
> rb
>
> On Fri, Sep 28, 2018 at 11:07 PM Xiao Li  wrote:
>
>> Are they consistent with the current syntax defined in SqlBase.g4? I
>> think we are following the Hive DDL syntax:
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column
>>
>> Ryan Blue  于2018年9月28日周五 下午3:47写道：
>>
>>> Hi everyone,
>>>
>>> I’m currently working on new table DDL statements for v2 tables. For
>>> context, the new logical plans for DataSourceV2 require a catalog interface
>>> so that Spark can create tables for operations like CTAS. The proposed
>>> TableCatalog API also includes an API for altering those tables so we can
>>> make ALTER TABLE statements work. I’m implementing those DDL statements,
>>> which will make it into upstream Spark when the TableCatalog PR is merged.
>>>
>>> Since I’m adding new SQL statements that don’t yet exist in Spark, I
>>> want to make sure that the syntax I’m using in our branch will match the
>>> syntax we add to Spark later. I’m basing this proposed syntax on
>>> PostgreSQL
>>> .
>>>
>>>- *Update data type*: ALTER TABLE tableIdentifier ALTER COLUMN
>>>qualifiedName TYPE dataType.
>>>- *Rename column*: ALTER TABLE

Re: [Structured Streaming SPARK-23966] Why non-atomic rename is problem in State Store ?

Re: [Structured Streaming SPARK-23966] Why non-atomic rename is problem in State Store ?

Re: [DISCUSS] Syntax for table DDL

Re: [DISCUSS] Syntax for table DDL

Re: [Discuss] Datasource v2 support for Kerberos

Re: [DISCUSS] Syntax for table DDL

6 matches

Site Navigation

Mail list logo

Footer information