[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r635099518 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala ## @@ -96,4 +99,19 @@ object DataWritingCommand { sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY), metrics.values.toSeq) } + + def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, hadoopConf: Configuration) { +if (saveMode != SaveMode.Overwrite && !SQLConf.get.allowNonEmptyLocationInCTAS) { Review comment: For SQL CTAS, the mode is always `ErrorIfExists`, and spark just checks whether the table exists or not. It was not checking non-empty data path. In that case, I'll change the condition as `if (saveMode == SaveMode.**ErrorIfExists** && !SQLConf.get.allowNonEmptyLocationInCTAS) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r635099518 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala ## @@ -96,4 +99,19 @@ object DataWritingCommand { sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY), metrics.values.toSeq) } + + def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, hadoopConf: Configuration) { +if (saveMode != SaveMode.Overwrite && !SQLConf.get.allowNonEmptyLocationInCTAS) { Review comment: For SQL CTAS, the mode is always `ErrorIfExists`. In that case, I'll change the condition as `if (saveMode == SaveMode.**ErrorIfExists** && !SQLConf.get.allowNonEmptyLocationInCTAS) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r635076302 ## File path: sql/core/src/test/scala/org/apache/spark/sql/sources/CreateTableAsSelectSuite.scala ## @@ -235,10 +235,10 @@ class CreateTableAsSelectSuite extends DataSourceTest with SharedSparkSession { } test("create table using as select - with overridden max number of buckets") { -def createTableSql(numBuckets: Int): String = +def createTableSql(tablePath: String, numBuckets: Int): String = s""" |CREATE TABLE t USING PARQUET - |OPTIONS (PATH '${path.toURI}') + |OPTIONS (PATH '$tablePath') Review comment: Yes, in this test case, the function `createTableSql ` will be called twice inside `Seq(11, maxNrBuckets).foreach(...)`. As this PR doesn't allow CTAS with a non-empty location, the second call of `createTableSql` will fail if the same `path` used. That is why for each iteration a new table location is created -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r632289902 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala ## @@ -96,4 +99,20 @@ object DataWritingCommand { sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY), metrics.values.toSeq) } + + def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, hadoopConf: Configuration) { +if (saveMode != SaveMode.Overwrite && + !SQLConf.get.allowCreateTableAsSelectNonEmptyLocation) { + val filePath = new org.apache.hadoop.fs.Path(tablePath) + val fs = filePath.getFileSystem(hadoopConf) + if (fs.exists(filePath) && +fs.getFileStatus(filePath).isDirectory && +fs.listStatus(filePath).length != 0) { +throw new AnalysisException( + s"CREATE-TABLE-AS-SELECT cannot create table with location to a non-empty directory " + +s"${tablePath} . To disable the behavior, " + +s"set 'spark.sql.legacy.allowNonEmptyLocationInCTAS to true.") Review comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r632289297 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala ## @@ -598,6 +598,37 @@ abstract class SQLQuerySuiteBase extends QueryTest with SQLTestUtils with TestHi } } + test("SPARK-28551: CTAS Hive Table should be with non-existent or empty location") { +def executeCTASWithNonEmptyLocation(tempLocation: String) { + sql(s"CREATE TABLE ctas1(id string) stored as rcfile LOCATION '$tempLocation/ctas1'") + sql("INSERT INTO TABLE ctas1 SELECT 'A' ") + sql(s"""CREATE TABLE ctas_with_existing_location stored as rcfile LOCATION + |'$tempLocation' AS SELECT key k, value FROM src ORDER BY k, value""".stripMargin) +} + +Seq("false", "true").foreach { convertCTASFlag => + Seq("false", "true").foreach { allowNonEmptyDirFlag => +withSQLConf( + SQLConf.CONVERT_CTAS.key -> convertCTASFlag, + SQLConf.ALLOW_NON_EMPTY_LOCATION_IN_CTAS.key -> allowNonEmptyDirFlag) { + withTempDir { dir => +val tempLocation = dir.toURI.toString +withTable("ctas1", "ctas_with_existing_location") { + if (!spark.conf.get(SQLConf.ALLOW_NON_EMPTY_LOCATION_IN_CTAS)) { Review comment: Done, changed type of val `allowNonEmptyDirFlag` to Boolean -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r632288726 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala ## @@ -598,6 +598,37 @@ abstract class SQLQuerySuiteBase extends QueryTest with SQLTestUtils with TestHi } } + test("SPARK-28551: CTAS Hive Table should be with non-existent or empty location") { +def executeCTASWithNonEmptyLocation(tempLocation: String) { + sql(s"CREATE TABLE ctas1(id string) stored as rcfile LOCATION '$tempLocation/ctas1'") + sql("INSERT INTO TABLE ctas1 SELECT 'A' ") + sql(s"""CREATE TABLE ctas_with_existing_location stored as rcfile LOCATION + |'$tempLocation' AS SELECT key k, value FROM src ORDER BY k, value""".stripMargin) +} + +Seq("false", "true").foreach { convertCTASFlag => Review comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r632288384 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala ## @@ -96,4 +99,20 @@ object DataWritingCommand { sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY), metrics.values.toSeq) } + + def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, hadoopConf: Configuration) { +if (saveMode != SaveMode.Overwrite && + !SQLConf.get.allowCreateTableAsSelectNonEmptyLocation) { + val filePath = new org.apache.hadoop.fs.Path(tablePath) + val fs = filePath.getFileSystem(hadoopConf) + if (fs.exists(filePath) && +fs.getFileStatus(filePath).isDirectory && Review comment: Done ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala ## @@ -96,4 +99,20 @@ object DataWritingCommand { sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY), metrics.values.toSeq) } + + def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, hadoopConf: Configuration) { +if (saveMode != SaveMode.Overwrite && + !SQLConf.get.allowCreateTableAsSelectNonEmptyLocation) { Review comment: Done ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala ## @@ -96,4 +99,20 @@ object DataWritingCommand { sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY), metrics.values.toSeq) } + + def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, hadoopConf: Configuration) { +if (saveMode != SaveMode.Overwrite && + !SQLConf.get.allowCreateTableAsSelectNonEmptyLocation) { + val filePath = new org.apache.hadoop.fs.Path(tablePath) + val fs = filePath.getFileSystem(hadoopConf) + if (fs.exists(filePath) && +fs.getFileStatus(filePath).isDirectory && +fs.listStatus(filePath).length != 0) { +throw new AnalysisException( + s"CREATE-TABLE-AS-SELECT cannot create table with location to a non-empty directory " + +s"${tablePath} . To disable the behavior, " + Review comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r632288269 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala ## @@ -598,6 +598,37 @@ abstract class SQLQuerySuiteBase extends QueryTest with SQLTestUtils with TestHi } } + test("SPARK-28551: CTAS Hive Table should be with non-existent or empty location") { +def executeCTASWithNonEmptyLocation(tempLocation: String) { + sql(s"CREATE TABLE ctas1(id string) stored as rcfile LOCATION '$tempLocation/ctas1'") + sql("INSERT INTO TABLE ctas1 SELECT 'A' ") + sql(s"""CREATE TABLE ctas_with_existing_location stored as rcfile LOCATION + |'$tempLocation' AS SELECT key k, value FROM src ORDER BY k, value""".stripMargin) +} + +Seq("false", "true").foreach { convertCTASFlag => + Seq("false", "true").foreach { allowNonEmptyDirFlag => +withSQLConf( + SQLConf.CONVERT_CTAS.key -> convertCTASFlag, + SQLConf.ALLOW_NON_EMPTY_LOCATION_IN_CTAS.key -> allowNonEmptyDirFlag) { + withTempDir { dir => +val tempLocation = dir.toURI.toString +withTable("ctas1", "ctas_with_existing_location") { + if (!spark.conf.get(SQLConf.ALLOW_NON_EMPTY_LOCATION_IN_CTAS)) { +intercept[AnalysisException] { Review comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r631758038 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala ## @@ -598,6 +598,38 @@ abstract class SQLQuerySuiteBase extends QueryTest with SQLTestUtils with TestHi } } + test("SPARK-28551: CTAS Hive Table should be with non-existent or empty location") { +def executeCTASWithNonEmptyLocation(tempLocation: String) { + sql(s"CREATE TABLE ctas1(id string) stored as rcfile LOCATION " + +s"'file:$tempLocation/ctas1'") + sql("INSERT INTO TABLE ctas1 SELECT 'A' ") + sql(s"CREATE TABLE ctas_with_existing_location stored as rcfile " + +s"LOCATION 'file:$tempLocation' " + +s"AS SELECT key k, value FROM src ORDER BY k, value") +} + +Seq("false", "true").foreach { convertCTASFlag => Review comment: withSQLConf accepts pairs of String (String, String), passing (String, Boolean) will fail to compile -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r631706750 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -1604,6 +1604,15 @@ object SQLConf { .booleanConf .createWithDefault(false) + val ALLOW_CTAS_NON_EMPTY_LOCATION = +buildConf("spark.sql.legacy.ctas.allowNonEmptyLocation") + .internal() + .doc("When false, Create Table As Location Select will throw Analysis Exception, " + +"if the location is non empty.") Review comment: Done ## File path: docs/sql-migration-guide.md ## @@ -78,7 +78,9 @@ license: | - In Spark 3.2, the timestamps subtraction expression such as `timestamp '2021-03-31 23:48:00' - timestamp '2021-01-01 00:00:00'` returns values of `DayTimeIntervalType`. In Spark 3.1 and earlier, the type of the same expression is `CalendarIntervalType`. To restore the behavior before Spark 3.2, you can set `spark.sql.legacy.interval.enabled` to `true`. - In Spark 3.2, `CREATE TABLE .. LIKE ..` command can not use reserved properties. You need their specific clauses to specify them, for example, `CREATE TABLE test1 LIKE test LOCATION 'some path'`. You can set `spark.sql.legacy.notReserveProperties` to `true` to ignore the `ParseException`, in this case, these properties will be silently removed, for example: `TBLPROPERTIES('owner'='yao')` will have no effect. In Spark version 3.1 and below, the reserved properties can be used in `CREATE TABLE .. LIKE ..` command but have no side effects, for example, `TBLPROPERTIES('location'='/tmp')` does not change the location of the table but only create a headless property just like `'a'='b'`. - + + - In Spark 3.2, `CREATE TABLE AS SELECT` with non-empty `LOCATION` will throw `AnalysisException`. To restore the behavior before Spark 3.2, you can set `spark.sql.legacy.ctas.allowNonEmptyLocation` to `true`. + Review comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r631706649 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -1604,6 +1604,15 @@ object SQLConf { .booleanConf .createWithDefault(false) + val ALLOW_CTAS_NON_EMPTY_LOCATION = +buildConf("spark.sql.legacy.ctas.allowNonEmptyLocation") Review comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r631706428 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala ## @@ -96,4 +98,23 @@ object DataWritingCommand { sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY), metrics.values.toSeq) } + + def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, sparkSession: SparkSession) { +if (saveMode != SaveMode.Overwrite && + !sparkSession.sqlContext.conf.allowCreateTableAsSelectNonEmptyLocation) { + val filePath = new org.apache.hadoop.fs.Path(tablePath) + val fs = filePath.getFileSystem(sparkSession.sparkContext.hadoopConfiguration) + if(fs != null && fs.exists(filePath)) { +val locStats = fs.getFileStatus(filePath) +if(locStats != null && locStats.isDirectory) { Review comment: removed null check -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r631705687 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala ## @@ -96,4 +98,23 @@ object DataWritingCommand { sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY), metrics.values.toSeq) } + + def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, sparkSession: SparkSession) { +if (saveMode != SaveMode.Overwrite && + !sparkSession.sqlContext.conf.allowCreateTableAsSelectNonEmptyLocation) { + val filePath = new org.apache.hadoop.fs.Path(tablePath) + val fs = filePath.getFileSystem(sparkSession.sparkContext.hadoopConfiguration) + if(fs != null && fs.exists(filePath)) { +val locStats = fs.getFileStatus(filePath) +if(locStats != null && locStats.isDirectory) { Review comment: combined 3 if conditions to 1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r631705217 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala ## @@ -96,4 +98,23 @@ object DataWritingCommand { sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY), metrics.values.toSeq) } + + def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, sparkSession: SparkSession) { +if (saveMode != SaveMode.Overwrite && + !sparkSession.sqlContext.conf.allowCreateTableAsSelectNonEmptyLocation) { + val filePath = new org.apache.hadoop.fs.Path(tablePath) + val fs = filePath.getFileSystem(sparkSession.sparkContext.hadoopConfiguration) + if(fs != null && fs.exists(filePath)) { +val locStats = fs.getFileStatus(filePath) +if(locStats != null && locStats.isDirectory) { + val lStats = fs.listStatus(filePath) + if(lStats != null && lStats.length != 0) { Review comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r631622569 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala ## @@ -96,4 +98,23 @@ object DataWritingCommand { sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY), metrics.values.toSeq) } + + def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, sparkSession: SparkSession) { +if (saveMode != SaveMode.Overwrite && + !sparkSession.sqlContext.conf.allowCreateTableAsSelectNonEmptyLocation) { + val filePath = new org.apache.hadoop.fs.Path(tablePath) + val fs = filePath.getFileSystem(sparkSession.sparkContext.hadoopConfiguration) + if(fs != null && fs.exists(filePath)) { +val locStats = fs.getFileStatus(filePath) +if(locStats != null && locStats.isDirectory) { + val lStats = fs.listStatus(filePath) + if(lStats != null && lStats.length != 0) { +throw new AnalysisException( + s"CREATE-TABLE-AS-SELECT cannot create table" + Review comment: In Hive , https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/ql/ErrorMsg.java#L408 Error message is `CREATE-TABLE-AS-SELECT cannot create table with location to a non-empty directory.` To be consistent, shall we show this message? "CREATE-TABLE-AS-SELECT cannot create table with location to a non-empty directory. To disable the behavior, set 'spark.sql.legacy.allowNonEmptyLocationInCTAS to true" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r631550114 ## File path: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala ## @@ -598,6 +598,38 @@ abstract class SQLQuerySuiteBase extends QueryTest with SQLTestUtils with TestHi } } + test("SPARK-28551: CTAS Hive Table should be with non-existent or empty location") { +def executeCTASWithNonEmptyLocation(tempLocation: String) { + sql(s"CREATE TABLE ctas1(id string) stored as rcfile LOCATION " + +s"'file:$tempLocation/ctas1'") + sql("INSERT INTO TABLE ctas1 SELECT 'A' ") + sql(s"CREATE TABLE ctas_with_existing_location stored as rcfile " + +s"LOCATION 'file:$tempLocation' " + +s"AS SELECT key k, value FROM src ORDER BY k, value") +} + +Seq("false", "true").foreach { convertCTASFlag => + Seq("false", "true").foreach { allowNonEmptyDirFlag => Review comment: withSQLConf accepts pairs of String (String, String), passing (String, Boolean) will fail to compile -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL]CTAS with LOCATION , should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r630301266 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -1604,6 +1604,15 @@ object SQLConf { .booleanConf .createWithDefault(false) + val ALLOW_CTAS_NON_EMPTY_LOCATION = +buildConf("spark.sql.legacy.ctas.allowNonEmptyLocation") + .internal() + .doc("When false, Create Table As Location Select will throw Analysis Exception, " + +"if the location is non empty.") + .version("3.2") Review comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL]CTAS with LOCATION , should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r630278821 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala ## @@ -96,4 +98,23 @@ object DataWritingCommand { sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY), metrics.values.toSeq) } + + def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, sparkSession: SparkSession) { +if (saveMode != SaveMode.Overwrite && + !sparkSession.sqlContext.conf.allowCreateTableAsSelectNonEmptyLocation) { + val filePath = new org.apache.hadoop.fs.Path(tablePath) + val fs = filePath.getFileSystem(sparkSession.sparkContext.hadoopConfiguration) + if(fs != null && fs.exists(filePath)) { +val locStats = fs.getFileStatus(filePath) +if(locStats != null && locStats.isDirectory) { + val lStats = fs.listStatus(filePath) Review comment: Yes, Hive does the same thing. [HIVE-11319] ref: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L13615 The liststatus call will be slow when the location has data, which is a failure condition as per this PR. So in the normal case, location Dir will be empty and liststatus won't be slow. Later Another JIRA HIVE-15367 added one more level of check for target location, which only contains the staging-dirs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL]CTAS with LOCATION , should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r630278821 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala ## @@ -96,4 +98,23 @@ object DataWritingCommand { sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY), metrics.values.toSeq) } + + def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, sparkSession: SparkSession) { +if (saveMode != SaveMode.Overwrite && + !sparkSession.sqlContext.conf.allowCreateTableAsSelectNonEmptyLocation) { + val filePath = new org.apache.hadoop.fs.Path(tablePath) + val fs = filePath.getFileSystem(sparkSession.sparkContext.hadoopConfiguration) + if(fs != null && fs.exists(filePath)) { +val locStats = fs.getFileStatus(filePath) +if(locStats != null && locStats.isDirectory) { + val lStats = fs.listStatus(filePath) Review comment: Yes, Hive does the same thing. [HIVE-11319] ref: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L13615 Later Another JIRA HIVE-15367 added one more level of check for target location, which only contains the staging-dirs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org