[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...

2017-09-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19124#discussion_r136923762
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2000,4 +2000,15 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   assert(setOfPath.size() == pathSizeToDeleteOnExit)
 }
   }
+
+  test("SPARK-21912 Creating ORC datasource table should check invalid 
column names") {
+withTable("orc1") {
+  Seq(" ", "?", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach 
{ name =>
+val m = intercept[AnalysisException] {
+  sql(s"CREATE TABLE orc1 USING ORC AS SELECT 1 `column$name`")
--- End diff --

I'll try in another way.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...

2017-09-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19124#discussion_r136921664
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2000,4 +2000,15 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   assert(setOfPath.size() == pathSizeToDeleteOnExit)
 }
   }
+
+  test("SPARK-21912 Creating ORC datasource table should check invalid 
column names") {
+withTable("orc1") {
+  Seq(" ", "?", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach 
{ name =>
+val m = intercept[AnalysisException] {
+  sql(s"CREATE TABLE orc1 USING ORC AS SELECT 1 `column$name`")
--- End diff --

@gatorsmile . I tried the following. We can add a check for 
`ParquetFileFormat`, but not for `OrcFileFormat`. Should I change the PR title 
and scope instead?

```scala
table.provider.get.toLowerCase match {
  case "parquet" =>
dataSource.schema.map(_.name).foreach(
  
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.checkFieldName)
  case "orc" =>
dataSource.schema.map(_.name).foreach(
  org.apache.spark.sql.hive.OrcRelation.checkFieldName)
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...

2017-09-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19124#discussion_r136919173
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2000,4 +2000,15 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   assert(setOfPath.size() == pathSizeToDeleteOnExit)
 }
   }
+
+  test("SPARK-21912 Creating ORC datasource table should check invalid 
column names") {
+withTable("orc1") {
+  Seq(" ", "?", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach 
{ name =>
+val m = intercept[AnalysisException] {
+  sql(s"CREATE TABLE orc1 USING ORC AS SELECT 1 `column$name`")
--- End diff --

Do we need to add Datasource specific operation on `createDataSourceTables` 
for Parquet and ORC?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...

2017-09-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19124#discussion_r136918576
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2000,4 +2000,15 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   assert(setOfPath.size() == pathSizeToDeleteOnExit)
 }
   }
+
+  test("SPARK-21912 Creating ORC datasource table should check invalid 
column names") {
+withTable("orc1") {
+  Seq(" ", "?", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach 
{ name =>
+val m = intercept[AnalysisException] {
+  sql(s"CREATE TABLE orc1 USING ORC AS SELECT 1 `column$name`")
--- End diff --

It seems to be the same situation with Parquet. `CREATE TABLE` passes but 
`SELECT` raises exceptions.
```scala
scala> sql("CREATE TABLE parquet1(`a b` int) using parquet")
res1: org.apache.spark.sql.DataFrame = []

scala> sql("select * from parquet1").show
org.apache.spark.sql.AnalysisException: Attribute name "a b" contains 
invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...

2017-09-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19124#discussion_r136916250
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2000,4 +2000,15 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   assert(setOfPath.size() == pathSizeToDeleteOnExit)
 }
   }
+
+  test("SPARK-21912 Creating ORC datasource table should check invalid 
column names") {
+withTable("orc1") {
+  Seq(" ", "?", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach 
{ name =>
+val m = intercept[AnalysisException] {
+  sql(s"CREATE TABLE orc1 USING ORC AS SELECT 1 `column$name`")
--- End diff --

Yep.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...

2017-09-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19124#discussion_r136916230
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -169,6 +172,18 @@ class OrcFileFormat extends FileFormat with 
DataSourceRegister with Serializable
   }
 }
   }
+
+  private def checkFieldName(name: String): Unit = {
+try {
+  TypeDescription.fromString(s"struct<$name:int>")
+} catch {
+  case _: IllegalArgumentException =>
+throw new AnalysisException(
+  s"""Attribute name "$name" contains invalid character(s).
--- End diff --

I agree with you that `column` is more accurate here. Previously, I 
borrowed this from `ParquetSchemaConverter`


https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L565-L572


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...

2017-09-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19124#discussion_r136915649
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -169,6 +172,18 @@ class OrcFileFormat extends FileFormat with 
DataSourceRegister with Serializable
   }
 }
   }
+
+  private def checkFieldName(name: String): Unit = {
+try {
+  TypeDescription.fromString(s"struct<$name:int>")
+} catch {
+  case _: IllegalArgumentException =>
+throw new AnalysisException(
+  s"""Attribute name "$name" contains invalid character(s).
--- End diff --

Thank you for review. Sure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...

2017-09-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19124#discussion_r136910368
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2000,4 +2000,15 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   assert(setOfPath.size() == pathSizeToDeleteOnExit)
 }
   }
+
+  test("SPARK-21912 Creating ORC datasource table should check invalid 
column names") {
+withTable("orc1") {
+  Seq(" ", "?", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach 
{ name =>
+val m = intercept[AnalysisException] {
+  sql(s"CREATE TABLE orc1 USING ORC AS SELECT 1 `column$name`")
--- End diff --

This is CTAS. How about `CREATE TABLE`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...

2017-09-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19124#discussion_r136910147
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -169,6 +172,18 @@ class OrcFileFormat extends FileFormat with 
DataSourceRegister with Serializable
   }
 }
   }
+
+  private def checkFieldName(name: String): Unit = {
+try {
+  TypeDescription.fromString(s"struct<$name:int>")
+} catch {
+  case _: IllegalArgumentException =>
+throw new AnalysisException(
+  s"""Attribute name "$name" contains invalid character(s).
--- End diff --

Nit: `Attribute`-> `Column`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...

2017-09-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19124#discussion_r136878102
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -169,6 +171,16 @@ class OrcFileFormat extends FileFormat with 
DataSourceRegister with Serializable
   }
 }
   }
+
+  private def checkFieldName(name: String): Unit = {
+// ,;{}()\n\t= and space are special characters in ORC schema
--- End diff --

Thank you for review, @tejasapatil !
That's a good idea. Right, It's not an exhaustive list. I'll update the PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...

2017-09-04 Thread tejasapatil
Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/19124#discussion_r136877087
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -169,6 +171,16 @@ class OrcFileFormat extends FileFormat with 
DataSourceRegister with Serializable
   }
 }
   }
+
+  private def checkFieldName(name: String): Unit = {
+// ,;{}()\n\t= and space are special characters in ORC schema
--- End diff --

Is this exhaustive list ? eg. looks like `?` is not allowed either. Given 
that the underlying lib (ORC) can evolve to support / not support certain 
chars, its safer to reply on some method rather than coming up with a 
blacklist. Can you simply call `TypeInfoUtils.getTypeInfoFromTypeString` or any 
related method which would do this check ?

```
Caused by: java.lang.IllegalArgumentException: Error: : expected at the 
position 8 of 'struct' but '?' is found.
  at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360)
  at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
  at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:483)
  at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
  at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfoFromTypeString(TypeInfoUtils.java:770)
  at 
org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:194)
  at 
org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:231)
  at 
org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:91)
...
...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19124: [SPARK-21912][SQL] Creating ORC datasource table ...

2017-09-04 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/19124

[SPARK-21912][SQL] Creating ORC datasource table should check invalid 
column names

## What changes were proposed in this pull request?

Currently, users meet job abortions while creating ORC data source tables 
with invalid column names. We had better prevent this by raising 
**AnalysisException** with a guide to use aliases instead like Paquet data 
source tables.

**BEFORE**
```scala
scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
17/09/04 13:28:21 ERROR Utils: Aborting task
java.lang.IllegalArgumentException: Error: : expected at the position 8 of 
'struct' but ' ' is found.
17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 
aborted.
17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.SparkException: Task failed while writing rows.
```

**AFTER**
```scala
scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
17/09/04 13:27:40 ERROR CreateDataSourceTableAsSelectCommand: Failed to 
write to table orc1
org.apache.spark.sql.AnalysisException: Attribute name "a b" contains 
invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
```

## How was this patch tested?

Pass the Jenkins with a new test case.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-21912

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19124.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19124


commit 808dfe0fcd9de2f43b33f0d1d084172b5624f2a8
Author: Dongjoon Hyun 
Date:   2017-09-04T20:46:15Z

[SPARK-21912][SQL] Creating ORC datasource table should check invalid 
column names




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org