[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r136923762 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2000,4 +2000,15 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { assert(setOfPath.size() == pathSizeToDeleteOnExit) } } + + test("SPARK-21912 Creating ORC datasource table should check invalid column names") { +withTable("orc1") { + Seq(" ", "?", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach { name => +val m = intercept[AnalysisException] { + sql(s"CREATE TABLE orc1 USING ORC AS SELECT 1 `column$name`") --- End diff -- I'll try in another way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r136921664 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2000,4 +2000,15 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { assert(setOfPath.size() == pathSizeToDeleteOnExit) } } + + test("SPARK-21912 Creating ORC datasource table should check invalid column names") { +withTable("orc1") { + Seq(" ", "?", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach { name => +val m = intercept[AnalysisException] { + sql(s"CREATE TABLE orc1 USING ORC AS SELECT 1 `column$name`") --- End diff -- @gatorsmile . I tried the following. We can add a check for `ParquetFileFormat`, but not for `OrcFileFormat`. Should I change the PR title and scope instead? ```scala table.provider.get.toLowerCase match { case "parquet" => dataSource.schema.map(_.name).foreach( org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.checkFieldName) case "orc" => dataSource.schema.map(_.name).foreach( org.apache.spark.sql.hive.OrcRelation.checkFieldName) } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r136919173 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2000,4 +2000,15 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { assert(setOfPath.size() == pathSizeToDeleteOnExit) } } + + test("SPARK-21912 Creating ORC datasource table should check invalid column names") { +withTable("orc1") { + Seq(" ", "?", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach { name => +val m = intercept[AnalysisException] { + sql(s"CREATE TABLE orc1 USING ORC AS SELECT 1 `column$name`") --- End diff -- Do we need to add Datasource specific operation on `createDataSourceTables` for Parquet and ORC? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r136918576 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2000,4 +2000,15 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { assert(setOfPath.size() == pathSizeToDeleteOnExit) } } + + test("SPARK-21912 Creating ORC datasource table should check invalid column names") { +withTable("orc1") { + Seq(" ", "?", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach { name => +val m = intercept[AnalysisException] { + sql(s"CREATE TABLE orc1 USING ORC AS SELECT 1 `column$name`") --- End diff -- It seems to be the same situation with Parquet. `CREATE TABLE` passes but `SELECT` raises exceptions. ```scala scala> sql("CREATE TABLE parquet1(`a b` int) using parquet") res1: org.apache.spark.sql.DataFrame = [] scala> sql("select * from parquet1").show org.apache.spark.sql.AnalysisException: Attribute name "a b" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.; ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r136916250 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2000,4 +2000,15 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { assert(setOfPath.size() == pathSizeToDeleteOnExit) } } + + test("SPARK-21912 Creating ORC datasource table should check invalid column names") { +withTable("orc1") { + Seq(" ", "?", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach { name => +val m = intercept[AnalysisException] { + sql(s"CREATE TABLE orc1 USING ORC AS SELECT 1 `column$name`") --- End diff -- Yep. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r136916230 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -169,6 +172,18 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable } } } + + private def checkFieldName(name: String): Unit = { +try { + TypeDescription.fromString(s"struct<$name:int>") +} catch { + case _: IllegalArgumentException => +throw new AnalysisException( + s"""Attribute name "$name" contains invalid character(s). --- End diff -- I agree with you that `column` is more accurate here. Previously, I borrowed this from `ParquetSchemaConverter` https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L565-L572 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r136915649 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -169,6 +172,18 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable } } } + + private def checkFieldName(name: String): Unit = { +try { + TypeDescription.fromString(s"struct<$name:int>") +} catch { + case _: IllegalArgumentException => +throw new AnalysisException( + s"""Attribute name "$name" contains invalid character(s). --- End diff -- Thank you for review. Sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r136910368 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2000,4 +2000,15 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { assert(setOfPath.size() == pathSizeToDeleteOnExit) } } + + test("SPARK-21912 Creating ORC datasource table should check invalid column names") { +withTable("orc1") { + Seq(" ", "?", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach { name => +val m = intercept[AnalysisException] { + sql(s"CREATE TABLE orc1 USING ORC AS SELECT 1 `column$name`") --- End diff -- This is CTAS. How about `CREATE TABLE`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r136910147 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -169,6 +172,18 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable } } } + + private def checkFieldName(name: String): Unit = { +try { + TypeDescription.fromString(s"struct<$name:int>") +} catch { + case _: IllegalArgumentException => +throw new AnalysisException( + s"""Attribute name "$name" contains invalid character(s). --- End diff -- Nit: `Attribute`-> `Column` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r136878102 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -169,6 +171,16 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable } } } + + private def checkFieldName(name: String): Unit = { +// ,;{}()\n\t= and space are special characters in ORC schema --- End diff -- Thank you for review, @tejasapatil ! That's a good idea. Right, It's not an exhaustive list. I'll update the PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/19124#discussion_r136877087 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -169,6 +171,16 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable } } } + + private def checkFieldName(name: String): Unit = { +// ,;{}()\n\t= and space are special characters in ORC schema --- End diff -- Is this exhaustive list ? eg. looks like `?` is not allowed either. Given that the underlying lib (ORC) can evolve to support / not support certain chars, its safer to reply on some method rather than coming up with a blacklist. Can you simply call `TypeInfoUtils.getTypeInfoFromTypeString` or any related method which would do this check ? ``` Caused by: java.lang.IllegalArgumentException: Error: : expected at the position 8 of 'struct' but '?' is found. at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:483) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfoFromTypeString(TypeInfoUtils.java:770) at org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:194) at org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:231) at org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:91) ... ... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/19124 [SPARK-21912][SQL] Creating ORC datasource table should check invalid column names ## What changes were proposed in this pull request? Currently, users meet job abortions while creating ORC data source tables with invalid column names. We had better prevent this by raising **AnalysisException** with a guide to use aliases instead like Paquet data source tables. **BEFORE** ```scala scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`") 17/09/04 13:28:21 ERROR Utils: Aborting task java.lang.IllegalArgumentException: Error: : expected at the position 8 of 'struct' but ' ' is found. 17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted. 17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) org.apache.spark.SparkException: Task failed while writing rows. ``` **AFTER** ```scala scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`") 17/09/04 13:27:40 ERROR CreateDataSourceTableAsSelectCommand: Failed to write to table orc1 org.apache.spark.sql.AnalysisException: Attribute name "a b" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.; ``` ## How was this patch tested? Pass the Jenkins with a new test case. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-21912 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19124.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19124 commit 808dfe0fcd9de2f43b33f0d1d084172b5624f2a8 Author: Dongjoon HyunDate: 2017-09-04T20:46:15Z [SPARK-21912][SQL] Creating ORC datasource table should check invalid column names --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org