[GitHub] [spark] cloud-fan commented on a change in pull request #30869: [SPARK-33865][SQL] When HiveDDL, we need check avro schema too

2021-07-14 Thread GitBox


cloud-fan commented on a change in pull request #30869:
URL: https://github.com/apache/spark/pull/30869#discussion_r669676724



##
File path: 
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
##
@@ -2158,6 +2158,46 @@ abstract class AvroSuite
   }
 }
   }
+
+  test("SPARK-33865: Hive DDL with avro should check col name") {
+withTable("test_ddl") {
+  withView("v") {
+spark.range(1).createTempView("v")
+withTempDir { dir =>
+  val e1 = intercept[SchemaParseException] {
+spark.sql(
+  s"""
+ |CREATE TABLE test_ddl USING AVRO
+ |LOCATION '${dir}'
+ |AS SELECT ID, ABS(ID) FROM v""".stripMargin)
+  }.getMessage
+  assert(e1.contains("Illegal character in: abs(ID)"))
+}
+
+withTempDir { dir =>
+  val e2 = intercept[SchemaParseException] {
+spark.sql(
+  s"""
+ |CREATE TABLE test_ddl USING AVRO
+ |LOCATION '${dir}'
+ |AS SELECT ID, IF(ID=1,1,0) FROM v""".stripMargin)
+  }.getMessage
+  assert(e2.contains("Illegal initial character: (IF((ID = 1), 1, 
0))"))

Review comment:
   and also hive tables?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #30869: [SPARK-33865][SQL] When HiveDDL, we need check avro schema too

2021-07-14 Thread GitBox


cloud-fan commented on a change in pull request #30869:
URL: https://github.com/apache/spark/pull/30869#discussion_r669676538



##
File path: 
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
##
@@ -2158,6 +2158,46 @@ abstract class AvroSuite
   }
 }
   }
+
+  test("SPARK-33865: Hive DDL with avro should check col name") {
+withTable("test_ddl") {
+  withView("v") {
+spark.range(1).createTempView("v")
+withTempDir { dir =>
+  val e1 = intercept[SchemaParseException] {
+spark.sql(
+  s"""
+ |CREATE TABLE test_ddl USING AVRO
+ |LOCATION '${dir}'
+ |AS SELECT ID, ABS(ID) FROM v""".stripMargin)
+  }.getMessage
+  assert(e1.contains("Illegal character in: abs(ID)"))
+}
+
+withTempDir { dir =>
+  val e2 = intercept[SchemaParseException] {
+spark.sql(
+  s"""
+ |CREATE TABLE test_ddl USING AVRO
+ |LOCATION '${dir}'
+ |AS SELECT ID, IF(ID=1,1,0) FROM v""".stripMargin)
+  }.getMessage
+  assert(e2.contains("Illegal initial character: (IF((ID = 1), 1, 
0))"))

Review comment:
   Can we also test direct writing like `df.write.parquet(path)`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #30869: [SPARK-33865][SQL] When HiveDDL, we need check avro schema too

2021-07-14 Thread GitBox


cloud-fan commented on a change in pull request #30869:
URL: https://github.com/apache/spark/pull/30869#discussion_r669675979



##
File path: 
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
##
@@ -2158,6 +2158,46 @@ abstract class AvroSuite
   }
 }
   }
+
+  test("SPARK-33865: Hive DDL with avro should check col name") {
+withTable("test_ddl") {
+  withView("v") {
+spark.range(1).createTempView("v")
+withTempDir { dir =>
+  val e1 = intercept[SchemaParseException] {
+spark.sql(
+  s"""
+ |CREATE TABLE test_ddl USING AVRO
+ |LOCATION '${dir}'
+ |AS SELECT ID, ABS(ID) FROM v""".stripMargin)
+  }.getMessage
+  assert(e1.contains("Illegal character in: abs(ID)"))
+}
+
+withTempDir { dir =>
+  val e2 = intercept[SchemaParseException] {
+spark.sql(
+  s"""
+ |CREATE TABLE test_ddl USING AVRO
+ |LOCATION '${dir}'
+ |AS SELECT ID, IF(ID=1,1,0) FROM v""".stripMargin)
+  }.getMessage
+  assert(e2.contains("Illegal initial character: (IF((ID = 1), 1, 
0))"))

Review comment:
   I don't see a big difference compared to the test above. Can we remove 
one of them?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #30869: [SPARK-33865][SQL] When HiveDDL, we need check avro schema too

2021-07-14 Thread GitBox


cloud-fan commented on a change in pull request #30869:
URL: https://github.com/apache/spark/pull/30869#discussion_r669674686



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
##
@@ -933,19 +934,36 @@ object DDLUtils {
 case HIVE_PROVIDER =>
   val serde = table.storage.serde
   if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) {
-OrcFileFormat.checkFieldNames(colNames)
+checkDataColNames("orc", colNames)
   } else if (serde == HiveSerDe.sourceToSerDe("parquet").get.serde ||
 serde == Some("parquet.hive.serde.ParquetHiveSerDe") ||
 serde == 
Some("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe")) {
-ParquetSchemaConverter.checkFieldNames(colNames)
+checkDataColNames("parquet", colNames)
+  } else if (serde == HiveSerDe.sourceToSerDe("avro").get.serde) {
+checkDataColNames("avro", colNames)
   }
-case "parquet" => ParquetSchemaConverter.checkFieldNames(colNames)
-case "orc" => OrcFileFormat.checkFieldNames(colNames)
+case provider if Seq("parquet", "orc", "avro").contains(provider) =>
+  checkDataColNames(provider, colNames)
 case _ =>
   }
 }
   }
 
+  private[sql] def checkDataColNames(provider: String, colNames: Seq[String]): 
Unit = {
+try {
+  DataSource.lookupDataSource(provider, 
SQLConf.get).getConstructor().newInstance() match {
+case f: FileFormat => f.checkFieldNames(colNames)
+case f: FileDataSourceV2 => f.checkFieldNames(colNames)

Review comment:
   file source v2 have a better place to put this check: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala#L84




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #30869: [SPARK-33865][SQL] When HiveDDL, we need check avro schema too

2021-07-14 Thread GitBox


cloud-fan commented on a change in pull request #30869:
URL: https://github.com/apache/spark/pull/30869#discussion_r669668332



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
##
@@ -933,19 +934,36 @@ object DDLUtils {
 case HIVE_PROVIDER =>
   val serde = table.storage.serde
   if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) {
-OrcFileFormat.checkFieldNames(colNames)
+checkDataColNames("orc", colNames)
   } else if (serde == HiveSerDe.sourceToSerDe("parquet").get.serde ||
 serde == Some("parquet.hive.serde.ParquetHiveSerDe") ||
 serde == 
Some("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe")) {
-ParquetSchemaConverter.checkFieldNames(colNames)
+checkDataColNames("parquet", colNames)
+  } else if (serde == HiveSerDe.sourceToSerDe("avro").get.serde) {
+checkDataColNames("avro", colNames)
   }
-case "parquet" => ParquetSchemaConverter.checkFieldNames(colNames)
-case "orc" => OrcFileFormat.checkFieldNames(colNames)
+case provider if Seq("parquet", "orc", "avro").contains(provider) =>

Review comment:
   why do we limit to only these 3 file sources?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #30869: [SPARK-33865][SQL] When HiveDDL, we need check avro schema too

2021-07-14 Thread GitBox


cloud-fan commented on a change in pull request #30869:
URL: https://github.com/apache/spark/pull/30869#discussion_r669666304



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
##
@@ -933,19 +934,36 @@ object DDLUtils {
 case HIVE_PROVIDER =>
   val serde = table.storage.serde
   if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) {
-OrcFileFormat.checkFieldNames(colNames)
+checkDataColNames("orc", colNames)

Review comment:
   how about inner field names? do we need to check?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #30869: [SPARK-33865][SQL] When HiveDDL, we need check avro schema too

2021-02-08 Thread GitBox


cloud-fan commented on a change in pull request #30869:
URL: https://github.com/apache/spark/pull/30869#discussion_r572030438



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
##
@@ -162,6 +162,8 @@ trait FileFormat {
* By default all data types are supported.
*/
   def supportDataType(dataType: DataType): Boolean = true
+
+  def checkFieldNames(names: Seq[String]): Unit = {}

Review comment:
   Can we add some document for this API? We can probably require the 
exception type as `AnalysisException`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #30869: [SPARK-33865][SQL] When HiveDDL, we need check avro schema too

2021-02-08 Thread GitBox


cloud-fan commented on a change in pull request #30869:
URL: https://github.com/apache/spark/pull/30869#discussion_r57203



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
##
@@ -2876,6 +2877,33 @@ class HiveDDLSuite
 }
   }
 
+  test("SPARK-33865: Hive DDL with avro should check col name") {
+withTable("t1") {
+  withView("v") {
+spark.range(1).createTempView("v")
+
+val e1 = intercept[SchemaParseException] {

Review comment:
   Shall we unify the exception type? It's weird if different formats use 
different exception types.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #30869: [SPARK-33865][SQL] When HiveDDL, we need check avro schema too

2020-12-30 Thread GitBox


cloud-fan commented on a change in pull request #30869:
URL: https://github.com/apache/spark/pull/30869#discussion_r550219183



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
##
@@ -2898,4 +2899,33 @@ class HiveDDLSuite
   }
 }
   }
+
+  test("SPARK-33865: Hive DDL with avro should check col name") {
+withTable("tbl", "t1") {
+  withView("view1") {
+spark.sql("CREATE TABLE tbl(id long)")
+spark.sql("INSERT OVERWRITE TABLE tbl VALUES 4")
+spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
+
+
+val e1 = intercept[SchemaParseException] {
+  spark.sql(s"CREATE TABLE t1 STORED AS AVRO " +

Review comment:
   Can we also test CREATE TABL without AS SELECT?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #30869: [SPARK-33865][SQL] When HiveDDL, we need check avro schema too

2020-12-30 Thread GitBox


cloud-fan commented on a change in pull request #30869:
URL: https://github.com/apache/spark/pull/30869#discussion_r550219112



##
File path: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
##
@@ -2898,4 +2899,33 @@ class HiveDDLSuite
   }
 }
   }
+
+  test("SPARK-33865: Hive DDL with avro should check col name") {
+withTable("tbl", "t1") {
+  withView("view1") {
+spark.sql("CREATE TABLE tbl(id long)")
+spark.sql("INSERT OVERWRITE TABLE tbl VALUES 4")
+spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")

Review comment:
   nit: we can prepare the data by `spark.range(1).createTempView("v")`, 
instead of creating a table then a view.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #30869: [SPARK-33865][SQL] When HiveDDL, we need check avro schema too

2020-12-23 Thread GitBox


cloud-fan commented on a change in pull request #30869:
URL: https://github.com/apache/spark/pull/30869#discussion_r548388043



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala
##
@@ -920,9 +921,12 @@ object DDLUtils {
 serde == Some("parquet.hive.serde.ParquetHiveSerDe") ||
 serde == 
Some("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe")) {
 ParquetSchemaConverter.checkFieldNames(colNames)
+  } else if (serde == HiveSerDe.sourceToSerDe("avro").get.serde) {
+AvroFileFormat.checkFieldNames(colNames)
   }
 case "parquet" => ParquetSchemaConverter.checkFieldNames(colNames)
 case "orc" => OrcFileFormat.checkFieldNames(colNames)
+case "avro" => AvroFileFormat.checkFieldNames(colNames)

Review comment:
   What we can do here is:
   ```
   DataSource.lookupDataSource(provider, conf) match  {
 case f: FileFormat => f.checkFieldNames... // a new API
 case _ =>
   }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #30869: [SPARK-33865][SQL] When HiveDDL, we need check avro schema too

2020-12-22 Thread GitBox


cloud-fan commented on a change in pull request #30869:
URL: https://github.com/apache/spark/pull/30869#discussion_r547143306



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/avro/AvroFileFormat.scala
##
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.avro
+
+import org.apache.avro.SchemaParseException
+
+object AvroFileFormat {

Review comment:
   It's better to implement data sources in external modules, which is 
closer to what end-users can do. If we can't check field names for avro, then 
custom file source implementations can't do it either, and our API has a 
problem.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #30869: [SPARK-33865][SQL] When HiveDDL, we need check avro schema too

2020-12-21 Thread GitBox


cloud-fan commented on a change in pull request #30869:
URL: https://github.com/apache/spark/pull/30869#discussion_r546579620



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/avro/AvroFileFormat.scala
##
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.avro
+
+import org.apache.avro.SchemaParseException
+
+object AvroFileFormat {

Review comment:
   We need a bit more refactor here. avro data source is implemented in an 
external module and it's weird to put `AvroFileFormat` in sql/core. Can we add 
an API to `trait FileFormat` to do this check?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org