[GitHub] spark pull request #18824: [SPARK-21617][SQL] Store correct metadata in Hive...

2017-08-03 Thread vanzin
Github user vanzin closed the pull request at:

https://github.com/apache/spark/pull/18824


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18824: [SPARK-21617][SQL] Store correct metadata in Hive...

2017-08-03 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18824#discussion_r131211342
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -413,7 +414,10 @@ private[hive] class HiveClientImpl(
 unsupportedFeatures += "partitioned view"
   }
 
-  val properties = Option(h.getParameters).map(_.asScala.toMap).orNull
+  val properties = 
Option(h.getParameters).map(_.asScala.toMap).getOrElse(Map())
+
+  val provider = 
properties.get(HiveExternalCatalog.DATASOURCE_PROVIDER)
+.orElse(Some(DDLUtils.HIVE_PROVIDER))
--- End diff --

The restoring you mention is done in 
`HiveExternalCatalog.restoreTableMetadata`. Let me see if I can use that 
instead of making this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18824: [SPARK-21617][SQL] Store correct metadata in Hive...

2017-08-03 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18824#discussion_r131210013
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -616,15 +616,24 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 // Add table metadata such as table schema, partition columns, etc. to 
table properties.
 val updatedTable = withNewSchema.copy(
   properties = withNewSchema.properties ++ 
tableMetaToTableProps(withNewSchema))
+
+// If it's a data source table, make sure the original schema is left 
unchanged; the
+// actual schema is recorded as a table property.
+val tableToStore = if (DDLUtils.isDatasourceTable(updatedTable)) {
+  updatedTable.copy(schema = rawTable.schema)
--- End diff --

Hmm, I see that this will break DS tables created with 
`newHiveCompatibleMetastoreTable` instead of 
`newSparkSQLSpecificMetastoreTable`.

For the former, the only thing I can see that could be used to identify the 
case is the presence of serde properties in the table metadata. That could 
replace the `DDLUtils.isDatasourceTable(updatedTable)` check to see whether the 
schema needs to be updated.

For the latter case, I see that `newSparkSQLSpecificMetastoreTable` stores 
the partition schema as the table's schema (which sort of explains the weird 
exception handling I saw). So this code is only correct if the partition schema 
cannot change. Where is the partition schema for a DS table defined? Is that 
under control of the user (or the data source implementation)? Because if it 
can change you can run into pretty much the same issue.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18824: [SPARK-21617][SQL] Store correct metadata in Hive...

2017-08-03 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18824#discussion_r131198143
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -413,7 +414,10 @@ private[hive] class HiveClientImpl(
 unsupportedFeatures += "partitioned view"
   }
 
-  val properties = Option(h.getParameters).map(_.asScala.toMap).orNull
+  val properties = 
Option(h.getParameters).map(_.asScala.toMap).getOrElse(Map())
+
+  val provider = 
properties.get(HiveExternalCatalog.DATASOURCE_PROVIDER)
+.orElse(Some(DDLUtils.HIVE_PROVIDER))
--- End diff --

> Maybe this is redundant.

This was definitely not redundant in my testing. The metadata loaded from 
the metastore in `HiveExternalCatalog.alterTableSchema` was definitely not set 
when I debugged this. In fact the test I wrote fails if I remove this code (or 
comment the line that sets "provider" a few lines below).

Perhaps some other part of the code sets it in a different code path, but 
this would make that part of the code redundant, not the other way around.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18824: [SPARK-21617][SQL] Store correct metadata in Hive...

2017-08-03 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18824#discussion_r131059637
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -413,7 +414,10 @@ private[hive] class HiveClientImpl(
 unsupportedFeatures += "partitioned view"
   }
 
-  val properties = Option(h.getParameters).map(_.asScala.toMap).orNull
+  val properties = 
Option(h.getParameters).map(_.asScala.toMap).getOrElse(Map())
+
+  val provider = 
properties.get(HiveExternalCatalog.DATASOURCE_PROVIDER)
+.orElse(Some(DDLUtils.HIVE_PROVIDER))
--- End diff --

Oh. Nvm. Looks like we access the key `DATASOURCE_PROVIDER` in 
`table.properties` for that purpose. This should be safe. Anyway, actually we 
will set `provider` for `CatalogTable` later when restoring the table read from 
metastore.

Another concern is we previously don't restore `provider` for a view, 
please refer to 
https://github.com/apache/spark/blob/f18b905f6cace7686ef169fda7de474079d0af23/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L682.
 By this change, we will set `provider` to `HIVE_PROVIDER` too for view.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18824: [SPARK-21617][SQL] Store correct metadata in Hive...

2017-08-02 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18824#discussion_r131056819
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -616,15 +616,24 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 // Add table metadata such as table schema, partition columns, etc. to 
table properties.
 val updatedTable = withNewSchema.copy(
   properties = withNewSchema.properties ++ 
tableMetaToTableProps(withNewSchema))
+
+// If it's a data source table, make sure the original schema is left 
unchanged; the
+// actual schema is recorded as a table property.
+val tableToStore = if (DDLUtils.isDatasourceTable(updatedTable)) {
+  updatedTable.copy(schema = rawTable.schema)
--- End diff --

I just checked the JIRA description. This sounds a bug we need to resolve. 
It needs a little bit complex to fix it. We need to follow what we did for 
`create table`. cc @xwu0226 Please help @vanzin address this issue. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18824: [SPARK-21617][SQL] Store correct metadata in Hive...

2017-08-02 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18824#discussion_r131056558
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -616,15 +616,24 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 // Add table metadata such as table schema, partition columns, etc. to 
table properties.
 val updatedTable = withNewSchema.copy(
   properties = withNewSchema.properties ++ 
tableMetaToTableProps(withNewSchema))
+
+// If it's a data source table, make sure the original schema is left 
unchanged; the
+// actual schema is recorded as a table property.
+val tableToStore = if (DDLUtils.isDatasourceTable(updatedTable)) {
+  updatedTable.copy(schema = rawTable.schema)
--- End diff --

We do support `ALTER TABLE ADD COLUMN`, which relies on `alterTableSchema 
`. The data source tables can be read by Hive if possible.  Thus, I think we 
should not set the schema unchanged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18824: [SPARK-21617][SQL] Store correct metadata in Hive...

2017-08-02 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18824#discussion_r131048805
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -413,7 +414,10 @@ private[hive] class HiveClientImpl(
 unsupportedFeatures += "partitioned view"
   }
 
-  val properties = Option(h.getParameters).map(_.asScala.toMap).orNull
+  val properties = 
Option(h.getParameters).map(_.asScala.toMap).getOrElse(Map())
+
+  val provider = 
properties.get(HiveExternalCatalog.DATASOURCE_PROVIDER)
+.orElse(Some(DDLUtils.HIVE_PROVIDER))
--- End diff --

Previously we don't store provider for Hive serde table. Some existing 
logic to decide if a table retrieved from metastore is a datasource table may 
be broken due to this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18824: [SPARK-21617][SQL] Store correct metadata in Hive...

2017-08-02 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18824#discussion_r131047358
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -616,15 +616,24 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 // Add table metadata such as table schema, partition columns, etc. to 
table properties.
 val updatedTable = withNewSchema.copy(
   properties = withNewSchema.properties ++ 
tableMetaToTableProps(withNewSchema))
+
+// If it's a data source table, make sure the original schema is left 
unchanged; the
+// actual schema is recorded as a table property.
+val tableToStore = if (DDLUtils.isDatasourceTable(updatedTable)) {
+  updatedTable.copy(schema = rawTable.schema)
+} else {
+  updatedTable
+}
+
 try {
-  client.alterTable(updatedTable)
+  client.alterTable(tableToStore)
 } catch {
   case NonFatal(e) =>
 val warningMessage =
   s"Could not alter schema of table  
${rawTable.identifier.quotedString} in a Hive " +
 "compatible way. Updating Hive metastore in Spark SQL specific 
format."
 logWarning(warningMessage, e)
-client.alterTable(updatedTable.copy(schema = 
updatedTable.partitionSchema))
+client.alterTable(updatedTable.copy(schema = 
tableToStore.partitionSchema))
--- End diff --

I think this part is directly related to the logic which converts the table 
metadata to Spark SQL specific format: 
https://github.com/apache/spark/blob/f18b905f6cace7686ef169fda7de474079d0af23/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L290-L301




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18824: [SPARK-21617][SQL] Store correct metadata in Hive...

2017-08-02 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18824#discussion_r131015612
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -616,15 +616,24 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 // Add table metadata such as table schema, partition columns, etc. to 
table properties.
 val updatedTable = withNewSchema.copy(
   properties = withNewSchema.properties ++ 
tableMetaToTableProps(withNewSchema))
+
+// If it's a data source table, make sure the original schema is left 
unchanged; the
+// actual schema is recorded as a table property.
+val tableToStore = if (DDLUtils.isDatasourceTable(updatedTable)) {
+  updatedTable.copy(schema = rawTable.schema)
+} else {
+  updatedTable
+}
+
 try {
-  client.alterTable(updatedTable)
+  client.alterTable(tableToStore)
 } catch {
   case NonFatal(e) =>
 val warningMessage =
   s"Could not alter schema of table  
${rawTable.identifier.quotedString} in a Hive " +
 "compatible way. Updating Hive metastore in Spark SQL specific 
format."
 logWarning(warningMessage, e)
-client.alterTable(updatedTable.copy(schema = 
updatedTable.partitionSchema))
+client.alterTable(updatedTable.copy(schema = 
tableToStore.partitionSchema))
--- End diff --

This is the exception handling code I mentioned in the bug report which 
seems very suspicious. I had half a desire to just remove it, but maybe someone 
can explain to me why this code makes sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18824: [SPARK-21617][SQL] Store correct metadata in Hive...

2017-08-02 Thread vanzin
GitHub user vanzin opened a pull request:

https://github.com/apache/spark/pull/18824

[SPARK-21617][SQL] Store correct metadata in Hive for altered DS table.

This change fixes two issues:
- when loading table metadata from Hive, restore the "provider" field of
  CatalogTable so DS tables can be identified.
- when altering a DS table in the Hive metastore, make sure to not alter
  the table's schema, since the DS table's schema is stored as a table
  property in those cases.

Also added a new unit test for this issue which fails without this change.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vanzin/spark SPARK-21617

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18824.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18824


commit aae3abd673adc7ff939d842e49d566fa722403a3
Author: Marcelo Vanzin 
Date:   2017-08-02T21:47:34Z

[SPARK-21617][SQL] Store correct metadata in Hive for altered DS table.

This change fixes two issues:
- when loading table metadata from Hive, restore the "provider" field of
  CatalogTable so DS tables can be identified.
- when altering a DS table in the Hive metastore, make sure to not alter
  the table's schema, since the DS table's schema is stored as a table
  property in those cases.

Also added a new unit test for this issue which fails without this change.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org