[GitHub] spark pull request #15900: [SPARK-18464][SQL] support old table which doesn'...

2016-11-15 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/15900

[SPARK-18464][SQL] support old table which doesn't store schema in metastore

## What changes were proposed in this pull request?

Before Spark 2.1, users can create an external data source table without 
schema, and we will infer the table schema at runtime. In Spark 2.1, we decided 
to infer the schema when the table was created, so that we don't need to infer 
it again and again at runtime.

This is a good improvement, but we should still respect and support old 
tables which doesn't store table schema in metastore.

## How was this patch tested?

regression test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark hive-catalog

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15900.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15900


commit a75cf309a4141da0271bca11769b8355bb493758
Author: Wenchen Fan 
Date:   2016-11-16T06:09:18Z

support old table which doesn't store schema in table properties




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15900: [SPARK-18464][SQL] support old table which doesn'...

2016-11-16 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15900#discussion_r88289046
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
 ---
@@ -1371,4 +1371,23 @@ class MetastoreDataSourcesSuite extends QueryTest 
with SQLTestUtils with TestHiv
   }
 }
   }
+
+  test("SPARK-18464: support old table which doesn't store schema in table 
properties") {
+withTable("old") {
+  withTempPath { path =>
+Seq(1 -> "a").toDF("i", "j").write.parquet(path.getAbsolutePath)
+val tableDesc = CatalogTable(
+  identifier = TableIdentifier("old", Some("default")),
+  tableType = CatalogTableType.EXTERNAL,
+  storage = CatalogStorageFormat.empty.copy(
+properties = Map("path" -> path.getAbsolutePath)
+  ),
+  schema = new StructType(),
+  properties = Map(
+HiveExternalCatalog.DATASOURCE_PROVIDER -> "parquet"))
+hiveClient.createTable(tableDesc, ignoreIfExists = false)
+checkAnswer(spark.table("old"), Row(1, "a"))
+  }
+}
+  }
--- End diff --

It will be good to actually create a set of compatibility tests to make 
sure a new version of Spark can access table metadata created by a older 
version (starting from Spark 1.3) without problem. Let's create a follow-up 
jira for this task and do it during the QA period of spark 2.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15900: [SPARK-18464][SQL] support old table which doesn'...

2016-11-16 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15900#discussion_r88289294
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -1023,6 +1023,11 @@ object HiveExternalCatalog {
   // After SPARK-6024, we removed this flag.
   // Although we are not using `spark.sql.sources.schema` any more, we 
need to still support.
   DataType.fromJson(schema.get).asInstanceOf[StructType]
+} else if 
(props.filterKeys(_.startsWith(DATASOURCE_SCHEMA_PREFIX)).isEmpty) {
+  // If there is no schema information in table properties, it means 
the schema of this table
+  // was empty when saving into metastore, which is possible in older 
version of Spark. We
+  // should respect it.
+  new StructType()
--- End diff --

btw, this function is only needed for data source tables, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15900: [SPARK-18464][SQL] support old table which doesn'...

2016-11-16 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15900#discussion_r88289527
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
 ---
@@ -1371,4 +1371,23 @@ class MetastoreDataSourcesSuite extends QueryTest 
with SQLTestUtils with TestHiv
   }
 }
   }
+
+  test("SPARK-18464: support old table which doesn't store schema in table 
properties") {
+withTable("old") {
+  withTempPath { path =>
+Seq(1 -> "a").toDF("i", "j").write.parquet(path.getAbsolutePath)
+val tableDesc = CatalogTable(
+  identifier = TableIdentifier("old", Some("default")),
+  tableType = CatalogTableType.EXTERNAL,
+  storage = CatalogStorageFormat.empty.copy(
+properties = Map("path" -> path.getAbsolutePath)
+  ),
+  schema = new StructType(),
+  properties = Map(
+HiveExternalCatalog.DATASOURCE_PROVIDER -> "parquet"))
+hiveClient.createTable(tableDesc, ignoreIfExists = false)
+checkAnswer(spark.table("old"), Row(1, "a"))
--- End diff --

Can we also test `describe table` and make sure it can provide correct 
column info?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15900: [SPARK-18464][SQL] support old table which doesn'...

2016-11-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15900#discussion_r88383221
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -1023,6 +1023,11 @@ object HiveExternalCatalog {
   // After SPARK-6024, we removed this flag.
   // Although we are not using `spark.sql.sources.schema` any more, we 
need to still support.
   DataType.fromJson(schema.get).asInstanceOf[StructType]
+} else if 
(props.filterKeys(_.startsWith(DATASOURCE_SCHEMA_PREFIX)).isEmpty) {
+  // If there is no schema information in table properties, it means 
the schema of this table
+  // was empty when saving into metastore, which is possible in older 
version of Spark. We
+  // should respect it.
+  new StructType()
--- End diff --

no, since we also store schema for hive table, hive table will also call 
this function. But hive table will never go into this branch, as it always has 
a schema.(the removal of runtime schema inference happened  before we store 
schema of hive table)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15900: [SPARK-18464][SQL] support old table which doesn'...

2016-11-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15900#discussion_r88384871
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
 ---
@@ -1371,4 +1371,23 @@ class MetastoreDataSourcesSuite extends QueryTest 
with SQLTestUtils with TestHiv
   }
 }
   }
+
+  test("SPARK-18464: support old table which doesn't store schema in table 
properties") {
+withTable("old") {
+  withTempPath { path =>
+Seq(1 -> "a").toDF("i", "j").write.parquet(path.getAbsolutePath)
+val tableDesc = CatalogTable(
+  identifier = TableIdentifier("old", Some("default")),
+  tableType = CatalogTableType.EXTERNAL,
+  storage = CatalogStorageFormat.empty.copy(
+properties = Map("path" -> path.getAbsolutePath)
+  ),
+  schema = new StructType(),
+  properties = Map(
+HiveExternalCatalog.DATASOURCE_PROVIDER -> "parquet"))
+hiveClient.createTable(tableDesc, ignoreIfExists = false)
+checkAnswer(spark.table("old"), Row(1, "a"))
+  }
+}
+  }
--- End diff --

created https://issues.apache.org/jira/browse/SPARK-18482


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15900: [SPARK-18464][SQL] support old table which doesn'...

2016-11-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15900


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org