[GitHub] spark pull request #16804: [SPARK-19459][SQL] Add Hive datatype (char/varcha...

2017-02-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16804


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16804: [SPARK-19459][SQL] Add Hive datatype (char/varcha...

2017-02-08 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16804#discussion_r100213581
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala ---
@@ -152,14 +152,39 @@ abstract class OrcSuite extends QueryTest with 
TestHiveSingleton with BeforeAndA
 assert(new OrcOptions(Map("Orc.Compress" -> "NONE")).compressionCodec 
== "NONE")
   }
 
-  test("SPARK-18220: read Hive orc table with varchar column") {
+  test("SPARK-19459/SPARK-18220: read char/varchar column written by 
Hive") {
 val hiveClient = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+val location = Utils.createTempDir().toURI
--- End diff --

shall we remove this temp dir in the finally block?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16804: [SPARK-19459][SQL] Add Hive datatype (char/varcha...

2017-02-08 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16804#discussion_r100085484
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala ---
@@ -162,6 +162,40 @@ abstract class OrcSuite extends QueryTest with 
TestHiveSingleton with BeforeAndA
   hiveClient.runSqlHive("DROP TABLE IF EXISTS orc_varchar")
 }
   }
+
+  test("SPARK-19459: read char/varchar column written by Hive") {
+val hiveClient = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+val location = Utils.createTempDir().toURI
+try {
+  hiveClient.runSqlHive(
+"""
+   |CREATE EXTERNAL TABLE hive_orc(
+   |  a STRING,
+   |  b CHAR(10),
+   |  c VARCHAR(10))
+   |STORED AS orc""".stripMargin)
+  // Hive throws an exception if I assign the location in the create 
table statment.
+  hiveClient.runSqlHive(
+s"ALTER TABLE hive_orc SET LOCATION '$location'")
+  hiveClient.runSqlHive(
+"INSERT INTO TABLE hive_orc SELECT 'a', 'b', 'c' FROM (SELECT 1) 
t")
+
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16804: [SPARK-19459][SQL] Add Hive datatype (char/varcha...

2017-02-07 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16804#discussion_r99938462
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala ---
@@ -162,6 +162,40 @@ abstract class OrcSuite extends QueryTest with 
TestHiveSingleton with BeforeAndA
   hiveClient.runSqlHive("DROP TABLE IF EXISTS orc_varchar")
 }
   }
+
+  test("SPARK-19459: read char/varchar column written by Hive") {
+val hiveClient = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+val location = Utils.createTempDir().toURI
+try {
+  hiveClient.runSqlHive(
+"""
+   |CREATE EXTERNAL TABLE hive_orc(
+   |  a STRING,
+   |  b CHAR(10),
+   |  c VARCHAR(10))
+   |STORED AS orc""".stripMargin)
+  // Hive throws an exception if I assign the location in the create 
table statment.
+  hiveClient.runSqlHive(
+s"ALTER TABLE hive_orc SET LOCATION '$location'")
+  hiveClient.runSqlHive(
+"INSERT INTO TABLE hive_orc SELECT 'a', 'b', 'c' FROM (SELECT 1) 
t")
+
--- End diff --

yeah that makes sense


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16804: [SPARK-19459][SQL] Add Hive datatype (char/varcha...

2017-02-07 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16804#discussion_r99938509
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/package.scala ---
@@ -21,4 +21,12 @@ package org.apache.spark.sql
  * Contains a type system for attributes produced by relations, including 
complex types like
  * structs, arrays and maps.
  */
-package object types
+package object types {
+  /**
+   * Metadata key used to store the the raw hive type string in the 
metadata of StructField. This
--- End diff --

will do


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16804: [SPARK-19459][SQL] Add Hive datatype (char/varcha...

2017-02-07 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16804#discussion_r99937955
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
@@ -32,7 +32,7 @@ import org.apache.spark.sql.catalyst.catalog._
 import org.apache.spark.sql.catalyst.expressions.{AttributeMap, 
AttributeReference, Expression}
 import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, 
Statistics}
 import org.apache.spark.sql.execution.FileRelation
-import org.apache.spark.sql.types.StructField
+import org.apache.spark.sql.types._
--- End diff --

That just makes it easier to use `HIVE_TYPE_STRING`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16804: [SPARK-19459][SQL] Add Hive datatype (char/varcha...

2017-02-07 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16804#discussion_r99937828
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala 
---
@@ -51,6 +51,9 @@ private[hive] case class HiveSimpleUDF(
   @transient
   lazy val function = funcWrapper.createFunction[UDF]()
 
+  {
+function
+  }
--- End diff --

That is my bad.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16804: [SPARK-19459][SQL] Add Hive datatype (char/varcha...

2017-02-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16804#discussion_r99926422
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala ---
@@ -162,6 +162,40 @@ abstract class OrcSuite extends QueryTest with 
TestHiveSingleton with BeforeAndA
   hiveClient.runSqlHive("DROP TABLE IF EXISTS orc_varchar")
 }
   }
+
+  test("SPARK-19459: read char/varchar column written by Hive") {
+val hiveClient = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+val location = Utils.createTempDir().toURI
+try {
+  hiveClient.runSqlHive(
+"""
+   |CREATE EXTERNAL TABLE hive_orc(
+   |  a STRING,
+   |  b CHAR(10),
+   |  c VARCHAR(10))
+   |STORED AS orc""".stripMargin)
+  // Hive throws an exception if I assign the location in the create 
table statment.
+  hiveClient.runSqlHive(
+s"ALTER TABLE hive_orc SET LOCATION '$location'")
+  hiveClient.runSqlHive(
+"INSERT INTO TABLE hive_orc SELECT 'a', 'b', 'c' FROM (SELECT 1) 
t")
+
--- End diff --

How about adding one more check?

```Scala
  checkAnswer(spark.table("hive_orc"), Row("a", "b ", "c"))
```

Then, we can remove the test case `SPARK-18220: read Hive orc table with 
varchar column`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16804: [SPARK-19459][SQL] Add Hive datatype (char/varcha...

2017-02-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16804#discussion_r99899779
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/package.scala ---
@@ -21,4 +21,12 @@ package org.apache.spark.sql
  * Contains a type system for attributes produced by relations, including 
complex types like
  * structs, arrays and maps.
  */
-package object types
+package object types {
+  /**
+   * Metadata key used to store the the raw hive type string in the 
metadata of StructField. This
--- End diff --

Nit: `the the` -> `the`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16804: [SPARK-19459][SQL] Add Hive datatype (char/varcha...

2017-02-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16804#discussion_r99897081
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala 
---
@@ -51,6 +51,9 @@ private[hive] case class HiveSimpleUDF(
   @transient
   lazy val function = funcWrapper.createFunction[UDF]()
 
+  {
+function
+  }
--- End diff --

What is the reason for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16804: [SPARK-19459][SQL] Add Hive datatype (char/varcha...

2017-02-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16804#discussion_r99896659
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
@@ -32,7 +32,7 @@ import org.apache.spark.sql.catalyst.catalog._
 import org.apache.spark.sql.catalyst.expressions.{AttributeMap, 
AttributeReference, Expression}
 import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, 
Statistics}
 import org.apache.spark.sql.execution.FileRelation
-import org.apache.spark.sql.types.StructField
+import org.apache.spark.sql.types._
--- End diff --

Unnecessary change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16804: [SPARK-19459][SQL] Add Hive datatype (char/varcha...

2017-02-06 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16804#discussion_r99627383
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/package.scala ---
@@ -21,4 +21,10 @@ package org.apache.spark.sql
  * Contains a type system for attributes produced by relations, including 
complex types like
  * structs, arrays and maps.
  */
-package object types
+package object types {
+  /**
+   * Metadata key used to store the Hive type name. This is relevant for 
datatypes that do not
+   * have a direct Spark SQL counterpart, such as CHAR and VARCHAR.
+   */
+  val HIVE_TYPE_STRING = "HIVE_TYPE_STRING"
--- End diff --

Yeah we should.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16804: [SPARK-19459][SQL] Add Hive datatype (char/varcha...

2017-02-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16804#discussion_r99625665
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala ---
@@ -162,6 +162,28 @@ abstract class OrcSuite extends QueryTest with 
TestHiveSingleton with BeforeAndA
   hiveClient.runSqlHive("DROP TABLE IF EXISTS orc_varchar")
 }
   }
+
+  test("read varchar column from orc tables created by hive") {
+try {
--- End diff --

how about
```
val hiveClient = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
 try {
   hiveClient.runSqlHive("CREATE TABLE hive_orc(a VARCHAR(10)) STORED 
AS orc LOCATION xxx")
   hiveClient.runSqlHive("INSERT INTO TABLE orc_varchar SELECT 'a' FROM 
(SELECT 1) t")
   sql("CREATE EXTERNAL TABLE spark_orc ...")
   checkAnswer...
 } finally {
sql("DROP TABLE IF EXISTS ...")
...
  }
```
then we don't need to create the orc file manually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16804: [SPARK-19459][SQL] Add Hive datatype (char/varcha...

2017-02-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16804#discussion_r99623985
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/package.scala ---
@@ -21,4 +21,10 @@ package org.apache.spark.sql
  * Contains a type system for attributes produced by relations, including 
complex types like
  * structs, arrays and maps.
  */
-package object types
+package object types {
+  /**
+   * Metadata key used to store the Hive type name. This is relevant for 
datatypes that do not
+   * have a direct Spark SQL counterpart, such as CHAR and VARCHAR.
+   */
+  val HIVE_TYPE_STRING = "HIVE_TYPE_STRING"
--- End diff --

shall we remove `HiveUtils. HIVE_TYPE_STRING`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16804: [SPARK-19459][SQL] Add Hive datatype (char/varcha...

2017-02-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16804#discussion_r99623795
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 ---
@@ -1457,8 +1457,31 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] 
with Logging {
*/
   override def visitColType(ctx: ColTypeContext): StructField = 
withOrigin(ctx) {
 import ctx._
-val structField = StructField(identifier.getText, 
typedVisit(dataType), nullable = true)
-if (STRING == null) structField else 
structField.withComment(string(STRING))
+
+val builder = new MetadataBuilder
+// Add comment to metadata
+if (STRING != null) {
+  builder.putString("comment", string(STRING))
+}
+// Add Hive type string to metadata.
+dataType match {
+  case p: PrimitiveDataTypeContext =>
+val dt = p.identifier.getText.toLowerCase
+(dt, p.INTEGER_VALUE().asScala.toList) match {
--- End diff --

nit:
```
p.identifier.getText.toLowerCase match {
  case "varchar" | "char" => builder.putString(HIVE_TYPE_STRING, 
dataType.getText.toLowerCase)
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16804: [SPARK-19459][SQL] Add Hive datatype (char/varcha...

2017-02-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/16804#discussion_r99479616
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala ---
@@ -162,6 +162,27 @@ abstract class OrcSuite extends QueryTest with 
TestHiveSingleton with BeforeAndA
   hiveClient.runSqlHive("DROP TABLE IF EXISTS orc_varchar")
 }
   }
+
+  test("read varchar column from orc tables created by hive") {
+try {
+  // This is an ORC file with a single VARCHAR(10) column that's 
created using Hive 1.2.1
--- End diff --

Hi, @hvanhovell .
Nit. It's three columns.
```
Structure for orc/orc_text_types.orc
File Version: 0.12 with HIVE_8732
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:string,_col1:char(10),_col2:varchar(10)>
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16804: [SPARK-19459][SQL] Add Hive datatype (char/varcha...

2017-02-04 Thread hvanhovell
GitHub user hvanhovell opened a pull request:

https://github.com/apache/spark/pull/16804

[SPARK-19459][SQL] Add Hive datatype (char/varchar) to StructField metadata

## What changes were proposed in this pull request?
Reading from an existing ORC table which contains `char` or `varchar` 
columns can fail with a `ClassCastException` if the table metadata has been 
created using Spark. This is caused by the fact that spark internally replaces 
`char` and `varchar` columns with a `string` column.

This PR fixes this by adding the hive type to the `StructField's` metadata 
under the `HIVE_TYPE_STRING` key. This is picked up by the `HiveClient` and the 
ORC reader, see https://github.com/apache/spark/pull/16060 for more details on 
how the metadata is used.

## How was this patch tested?
Added a regression test to `OrcSourceSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hvanhovell/spark SPARK-19459

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16804.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16804


commit c6a5bf60b9d92dde5be2d0b60af42acf92095aa4
Author: Herman van Hovell 
Date:   2017-02-04T10:58:54Z

Add Hive datatype (char/varchar) to struct field metadata. This fixes 
issues with char/varchar columns in ORC.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org