[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14210 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r71828372 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/CreateTableAsSelectSuite.scala --- @@ -212,7 +213,23 @@ class CreateTableAsSelectSuite extends DataSourceTest with SharedSQLContext with ) val table = catalog.getTableMetadata(TableIdentifier("t")) assert(DDLUtils.getBucketSpecFromTableProperties(table) == -Some(BucketSpec(5, Seq("a"), Seq("b" +Option(BucketSpec(5, Seq("a"), Seq("b" +} + } + + test("create table using as select - with zero buckets") { +withTable("t") { + val e = intercept[AnalysisException] { +sql( + s""" + |CREATE TABLE t USING PARQUET + |OPTIONS (PATH '${path.toString}') + |CLUSTERED BY (a) SORTED BY (b) INTO 0 BUCKETS --- End diff -- hm? so we allow `CLUSTERED BY` when schema is not specified for data source table? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r71826808 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -344,6 +344,11 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { table, provider, partitionColumnNames, bucketSpec, mode, options, query) } else { val struct = Option(ctx.colTypeList()).map(createStructType) + if (struct.isEmpty && bucketSpec.nonEmpty) { +throw new ParseException( + "Expected explicit specification of table schema when using CLUSTERED BY clause.", ctx) --- End diff -- `DataFrameWriter` does not allow it. - First, `save()` API has an [`assertNotBucketed("save")`](https://github.com/apache/spark/blob/864b764eafa57a1418b683ccf6899b01bab28fba/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L203) - Second, `insertInto(tableName: String)` API also has an [`assertNotBucketed("save")`](https://github.com/apache/spark/blob/864b764eafa57a1418b683ccf6899b01bab28fba/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L243) - Third, `saveAsTable(tableName: String)` API relies on `CreateTableUsingAsSelect`. This will be converted to `CreateDataSourceTableAsSelectCommand`. It also specifies the schema [here](https://github.com/apache/spark/blob/ce3b98bae28af72299722f56e4e4ef831f471ec0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala#L259) I think we are pretty safe now. Please let me know if we still should move it? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r71822711 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -344,6 +344,11 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { table, provider, partitionColumnNames, bucketSpec, mode, options, query) } else { val struct = Option(ctx.colTypeList()).map(createStructType) + if (struct.isEmpty && bucketSpec.nonEmpty) { +throw new ParseException( + "Expected explicit specification of table schema when using CLUSTERED BY clause.", ctx) --- End diff -- Yeah, let me find a place. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r71815552 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -344,6 +344,11 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { table, provider, partitionColumnNames, bucketSpec, mode, options, query) } else { val struct = Option(ctx.colTypeList()).map(createStructType) + if (struct.isEmpty && bucketSpec.nonEmpty) { +throw new ParseException( + "Expected explicit specification of table schema when using CLUSTERED BY clause.", ctx) --- End diff -- should we put this check into somewhere else? `DataFrameWriter` also have this issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r71746939 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -75,7 +75,12 @@ private[sql] case class InsertIntoHadoopFsRelationCommand( case (x, ys) if ys.length > 1 => "\"" + x + "\"" }.mkString(", ") throw new AnalysisException(s"Duplicate column(s) : $duplicateColumns found, " + - s"cannot save to file.") +"cannot save to file.") +} + +if (bucketSpec.exists(_.numBuckets <= 0)) { --- End diff -- Yeah, currently, this is the only place we support bucket writing. Your idea is pretty good. Let me move it to BucketSpec. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r71667341 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -75,7 +75,12 @@ private[sql] case class InsertIntoHadoopFsRelationCommand( case (x, ys) if ys.length > 1 => "\"" + x + "\"" }.mkString(", ") throw new AnalysisException(s"Duplicate column(s) : $duplicateColumns found, " + - s"cannot save to file.") +"cannot save to file.") +} + +if (bucketSpec.exists(_.numBuckets <= 0)) { --- End diff -- is this the only place we need to check? what if we put this check inside the constructor of `BucketSpec`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r71667256 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -344,6 +344,11 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { table, provider, partitionColumnNames, bucketSpec, mode, options, query) } else { val struct = Option(ctx.colTypeList()).map(createStructType) + if (struct.isEmpty && bucketSpec.nonEmpty) { --- End diff -- oh i see, makes sense --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r71657222 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -344,6 +344,11 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { table, provider, partitionColumnNames, bucketSpec, mode, options, query) } else { val struct = Option(ctx.colTypeList()).map(createStructType) + if (struct.isEmpty && bucketSpec.nonEmpty) { --- End diff -- In that case, partition columns are inferred at runtime. Thus, although we ignore user specified partition columns, but we still treat the table as a partition table. Here, we will really ignore the bucket spec, because the bucket columns are not inferred at runtime. That means, we will not treat it as a bucket table. That is the major reason why this PR issues the exception. Let me know if Log message is still better here. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r71651476 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -344,6 +344,11 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { table, provider, partitionColumnNames, bucketSpec, mode, options, query) } else { val struct = Option(ctx.colTypeList()).map(createStructType) + if (struct.isEmpty && bucketSpec.nonEmpty) { --- End diff -- IIRC, we do similar check for partition columns, if the schema is inferred, we will warn users that the specified partition columns are ignored, by log. Should we follow it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r71622060 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -1264,6 +1265,29 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach { } } + test("create table using cluster by without schema specification") { +import testImplicits._ +withTempPath { tempDir => + withTable("jsonTable") { +(("a", "b") :: Nil).toDF().toJSON.rdd.saveAsTextFile(tempDir.getCanonicalPath) --- End diff -- Sure, will change it. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r71621951 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/CreateTableAsSelectSuite.scala --- @@ -199,7 +200,7 @@ class CreateTableAsSelectSuite extends DataSourceTest with SharedSQLContext with } } - test("create table using as select - with bucket") { + test("create table using as select - with non-zero bucket") { --- End diff -- Sure, will do. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r71621984 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/CreateTableAsSelectSuite.scala --- @@ -212,7 +213,23 @@ class CreateTableAsSelectSuite extends DataSourceTest with SharedSQLContext with ) val table = catalog.getTableMetadata(TableIdentifier("t")) assert(DDLUtils.getBucketSpecFromTableProperties(table) == -Some(BucketSpec(5, Seq("a"), Seq("b" +Option(BucketSpec(5, Seq("a"), Seq("b" +} + } + + test("create table using as select - with zero bucket") { --- End diff -- Sure, will do. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r71621929 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -1264,6 +1265,29 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach { } } + test("create table using cluster by without schema specification") { --- End diff -- Sure, will do. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r71072892 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -1264,6 +1265,29 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach { } } + test("create table using cluster by without schema specification") { --- End diff -- s/cluster/clustered? Even an uppercase variant CLUSTERED BY as in the other descriptions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r71072862 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/CreateTableAsSelectSuite.scala --- @@ -199,7 +200,7 @@ class CreateTableAsSelectSuite extends DataSourceTest with SharedSQLContext with } } - test("create table using as select - with bucket") { + test("create table using as select - with non-zero bucket") { --- End diff -- s/bucket/buckets? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r71072854 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/CreateTableAsSelectSuite.scala --- @@ -212,7 +213,23 @@ class CreateTableAsSelectSuite extends DataSourceTest with SharedSQLContext with ) val table = catalog.getTableMetadata(TableIdentifier("t")) assert(DDLUtils.getBucketSpecFromTableProperties(table) == -Some(BucketSpec(5, Seq("a"), Seq("b" +Option(BucketSpec(5, Seq("a"), Seq("b" +} + } + + test("create table using as select - with zero bucket") { --- End diff -- s/bucket/buckets? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r71072842 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -1264,6 +1265,29 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach { } } + test("create table using cluster by without schema specification") { +import testImplicits._ +withTempPath { tempDir => + withTable("jsonTable") { +(("a", "b") :: Nil).toDF().toJSON.rdd.saveAsTextFile(tempDir.getCanonicalPath) --- End diff -- Why don't you use `toDF.write.json` instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r71015752 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -75,7 +75,14 @@ private[sql] case class InsertIntoHadoopFsRelationCommand( case (x, ys) if ys.length > 1 => "\"" + x + "\"" }.mkString(", ") throw new AnalysisException(s"Duplicate column(s) : $duplicateColumns found, " + - s"cannot save to file.") +"cannot save to file.") +} + +bucketSpec.foreach { spec => --- End diff -- Sure, let me fix it. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14210: [SPARK-16556] [SPARK-16559] [SQL] Fix Two Bugs in...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14210#discussion_r70931496 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -75,7 +75,14 @@ private[sql] case class InsertIntoHadoopFsRelationCommand( case (x, ys) if ys.length > 1 => "\"" + x + "\"" }.mkString(", ") throw new AnalysisException(s"Duplicate column(s) : $duplicateColumns found, " + - s"cannot save to file.") +"cannot save to file.") +} + +bucketSpec.foreach { spec => --- End diff -- ``` if (bucketSpec.exists(_.numBuckets <= 0)) { ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org