[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15797#discussion_r99075673 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -810,13 +825,43 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat newSpecs: Seq[TablePartitionSpec]): Unit = withClient { client.renamePartitions( db, table, specs.map(lowerCasePartitionSpec), newSpecs.map(lowerCasePartitionSpec)) + +val tableMeta = getTable(db, table) +val partitionColumnNames = tableMeta.partitionColumnNames +// Hive metastore is not case preserving and keeps partition columns with lower cased names. +// When Hive rename partition for managed tables, it will create the partition location with +// a default path generate by the new spec with lower cased partition column names. This is +// unexpected and we need to rename them manually and alter the partition location. +val hasUpperCasePartitionColumn = partitionColumnNames.exists(col => col.toLowerCase != col) +if (tableMeta.tableType == MANAGED && hasUpperCasePartitionColumn) { + val tablePath = new Path(tableMeta.location) + val newParts = newSpecs.map { spec => +val partition = client.getPartition(db, table, lowerCasePartitionSpec(spec)) +val wrongPath = new Path(partition.location) +val rightPath = ExternalCatalogUtils.generatePartitionPath( + spec, partitionColumnNames, tablePath) +try { + tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath) --- End diff -- good catch! let me think about how to fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15797#discussion_r98327471 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -810,13 +825,43 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat newSpecs: Seq[TablePartitionSpec]): Unit = withClient { client.renamePartitions( db, table, specs.map(lowerCasePartitionSpec), newSpecs.map(lowerCasePartitionSpec)) + +val tableMeta = getTable(db, table) +val partitionColumnNames = tableMeta.partitionColumnNames +// Hive metastore is not case preserving and keeps partition columns with lower cased names. +// When Hive rename partition for managed tables, it will create the partition location with +// a default path generate by the new spec with lower cased partition column names. This is +// unexpected and we need to rename them manually and alter the partition location. +val hasUpperCasePartitionColumn = partitionColumnNames.exists(col => col.toLowerCase != col) +if (tableMeta.tableType == MANAGED && hasUpperCasePartitionColumn) { + val tablePath = new Path(tableMeta.location) + val newParts = newSpecs.map { spec => +val partition = client.getPartition(db, table, lowerCasePartitionSpec(spec)) +val wrongPath = new Path(partition.location) +val rightPath = ExternalCatalogUtils.generatePartitionPath( + spec, partitionColumnNames, tablePath) +try { + tablePath.getFileSystem(hadoopConf).rename(wrongPath, rightPath) --- End diff -- Found an issue here... When we call `rename`, not all the file systems have the same behaviors. For example, on mac OS, when we doing this `.../tbl/partcol1=5/partcol2=6` -> `.../tbl/partCol1=5/partCol2=6` . The result is `.../tbl/partcol1=5/partCol2=6`. However, the file system used in Jenkin does not have such an issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15797 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15797#discussion_r87356512 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala --- @@ -231,7 +231,7 @@ class InMemoryCatalog( assert(tableMeta.storage.locationUri.isDefined, "Managed table should always have table location, as we will assign a default location " + "to it if it doesn't have one.") -val dir = new Path(tableMeta.storage.locationUri.get) +val dir = new Path(tableMeta.location) try { val fs = dir.getFileSystem(hadoopConfig) fs.delete(dir, true) --- End diff -- it's minor and not related to this PR, I'd like to not touch it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15797#discussion_r87336535 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -810,13 +825,44 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat newSpecs: Seq[TablePartitionSpec]): Unit = withClient { client.renamePartitions( db, table, specs.map(lowerCasePartitionSpec), newSpecs.map(lowerCasePartitionSpec)) + +val tableMeta = getTable(db, table) +val partitionColumnNames = tableMeta.partitionColumnNames +// Hive metastore is not case preserving and keeps partition columns with lower cased names. +// When Hive rename partition for managed tables, it will create the partition location with +// a default path generate by the new spec with lower cased partition column names. This is +// unexpected and we need to rename them manually and alter the partition location. +val hasUpperCasePartitionColumn = partitionColumnNames.exists(col => col.toLowerCase != col) +if (tableMeta.tableType == MANAGED && hasUpperCasePartitionColumn) { + val tablePath = new Path(tableMeta.location) + val fs = tablePath.getFileSystem(hadoopConf) --- End diff -- [In the other functions](https://github.com/cloud-fan/spark/blob/56f3073162f0b53e4a3de686bba31c8eb6048123/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala#L367-L371), the function call `getFileSystem` is also part of the try-catch block. Do we need to make it consistent with others? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15797#discussion_r87141007 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -810,13 +825,44 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat newSpecs: Seq[TablePartitionSpec]): Unit = withClient { client.renamePartitions( db, table, specs.map(lowerCasePartitionSpec), newSpecs.map(lowerCasePartitionSpec)) + +val tableMeta = getTable(db, table) +val partitionColumnNames = tableMeta.partitionColumnNames +// Hive metastore is not case preserving and keeps partition columns with lower cased names. +// When Hive rename partition for managed tables, it will create the partition location with +// a default path generate by the new spec with lower cased partition column names. This is +// unexpected and we need to rename them manually and alter the partition location. +val hasUpperCasePartitionColumn = partitionColumnNames.exists(col => col.toLowerCase != col) +if (tableMeta.tableType == MANAGED && hasUpperCasePartitionColumn) { + val tablePath = new Path(tableMeta.location) + val fs = tablePath.getFileSystem(hadoopConf) + val newParts = newSpecs.map { spec => +val partition = client.getPartition(db, table, lowerCasePartitionSpec(spec)) +val wrongPath = new Path(partition.storage.locationUri.get) +val rightPath = ExternalCatalogUtils.generatePartitionPath( + spec, partitionColumnNames, tablePath) +try { + fs.rename(wrongPath, rightPath) +} catch { + case e: IOException => +throw new SparkException(s"Unable to rename partition path $wrongPath", e) --- End diff -- Maybe we also need to add `rightPath` here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15797#discussion_r86946976 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala --- @@ -591,25 +673,25 @@ abstract class ExternalCatalogSuite extends SparkFunSuite with BeforeAndAfterEac val catalog = newBasicCatalog() val db = catalog.getDatabase("db1") val table = CatalogTable( - identifier = TableIdentifier("my_table", Some("db1")), --- End diff -- Actually this test was written by me, and I picked `myTable` at the first time, but changed to `my_table` because of this bug. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15797#discussion_r86936905 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -171,6 +171,10 @@ case class CatalogTable( throw new AnalysisException(s"table $identifier did not specify database") } + def location: String = storage.locationUri.getOrElse { --- End diff -- Like the other functions in this file, add the function descriptions too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15797#discussion_r86936775 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala --- @@ -591,25 +673,25 @@ abstract class ExternalCatalogSuite extends SparkFunSuite with BeforeAndAfterEac val catalog = newBasicCatalog() val db = catalog.getDatabase("db1") val table = CatalogTable( - identifier = TableIdentifier("my_table", Some("db1")), --- End diff -- Could we keep the original names unchanged? Now, the table/database names allow `alphanumeric characters and underscores`. Thus, I think they are named using underscore on purpose? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15797#discussion_r86935972 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -451,51 +444,6 @@ object PartitioningUtils { } } - // - // The following string escaping code is mainly copied from Hive (o.a.h.h.common.FileUtils). - // - - val charToEscape = { -val bitSet = new java.util.BitSet(128) - -/** - * ASCII 01-1F are HTTP control characters that need to be escaped. - * \u000A and \u000D are \n and \r, respectively. - */ -val clist = Array( - '\u0001', '\u0002', '\u0003', '\u0004', '\u0005', '\u0006', '\u0007', '\u0008', '\u0009', - '\n', '\u000B', '\u000C', '\r', '\u000E', '\u000F', '\u0010', '\u0011', '\u0012', '\u0013', - '\u0014', '\u0015', '\u0016', '\u0017', '\u0018', '\u0019', '\u001A', '\u001B', '\u001C', - '\u001D', '\u001E', '\u001F', '"', '#', '%', '\'', '*', '/', ':', '=', '?', '\\', '\u007F', - '{', '[', ']', '^') - -clist.foreach(bitSet.set(_)) - -if (Shell.WINDOWS) { - Array(' ', '<', '>', '|').foreach(bitSet.set(_)) -} - -bitSet - } - - def needsEscaping(c: Char): Boolean = { -c >= 0 && c < charToEscape.size() && charToEscape.get(c) - } - - def escapePathName(path: String): String = { -val builder = new StringBuilder() -path.foreach { c => - if (needsEscaping(c)) { -builder.append('%') -builder.append(f"${c.asInstanceOf[Int]}%02X") - } else { -builder.append(c) - } -} - -builder.toString() - } - def unescapePathName(path: String): String = { --- End diff -- If `escapePathName` is moved, I think we also need to move `unescapePathName` to the same file? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15797#discussion_r86908704 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala --- @@ -0,0 +1,92 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.catalog + +import org.apache.hadoop.fs.Path +import org.apache.hadoop.util.Shell + +import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec + +object ExternalCatalogUtils { --- End diff -- @ericl `PartitioningUtils` is not accessible in catalyst module, so I created this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15797#discussion_r86906444 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -810,13 +840,37 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat newSpecs: Seq[TablePartitionSpec]): Unit = withClient { client.renamePartitions( db, table, specs.map(lowerCasePartitionSpec), newSpecs.map(lowerCasePartitionSpec)) + +val tableMeta = getTable(db, table) +val partitionColumnNames = tableMeta.partitionColumnNames +// Hive metastore is not case preserving and keeps partition columns with lower cased names. +// When Hive rename partition for managed tables, it will update the partition location with +// a default value generate by the new spec and lower cased partition column names. This is +// unexpected and we need to call `alterPartitions` to fix the partition location. +val hasUpperCasePartitionColumn = partitionColumnNames.exists(col => col.toLowerCase != col) +if (tableMeta.tableType == MANAGED && hasUpperCasePartitionColumn) { + val tableLocation = tableMeta.storage.locationUri.get + val newParts = newSpecs.map { spec => +val partition = client.getPartition(db, table, lowerCasePartitionSpec(spec)) +val partitionPath = new Path( + tableLocation, + partitionColumnNames.map(col => col + "=" + spec(col)).mkString("/")) --- End diff -- RENAME PARTITION can only change the partition spec, not partition path. The SQL syntax is :`ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;`, users can't specify partition location --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15797#discussion_r86906239 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -810,13 +840,37 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat newSpecs: Seq[TablePartitionSpec]): Unit = withClient { client.renamePartitions( db, table, specs.map(lowerCasePartitionSpec), newSpecs.map(lowerCasePartitionSpec)) + +val tableMeta = getTable(db, table) +val partitionColumnNames = tableMeta.partitionColumnNames +// Hive metastore is not case preserving and keeps partition columns with lower cased names. +// When Hive rename partition for managed tables, it will update the partition location with +// a default value generate by the new spec and lower cased partition column names. This is +// unexpected and we need to call `alterPartitions` to fix the partition location. +val hasUpperCasePartitionColumn = partitionColumnNames.exists(col => col.toLowerCase != col) +if (tableMeta.tableType == MANAGED && hasUpperCasePartitionColumn) { --- End diff -- no, Hive only update the partition location when renaming partitions for managed table. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15797#discussion_r86866556 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala --- @@ -355,25 +355,30 @@ class InMemoryCatalog( } } -val tableDir = new Path(catalog(db).db.locationUri, table) -val partitionColumnNames = getTable(db, table).partitionColumnNames +val tableMeta = getTable(db, table) +val partitionColumnNames = tableMeta.partitionColumnNames +val tableLocation = new Path(tableMeta.storage.locationUri.get) // TODO: we should follow hive to roll back if one partition path failed to create. parts.foreach { p => - // If location is set, the partition is using an external partition location and we don't - // need to handle its directory. - if (p.storage.locationUri.isEmpty) { -val partitionPath = partitionColumnNames.flatMap { col => - p.spec.get(col).map(col + "=" + _) -}.mkString("/") -try { - val fs = tableDir.getFileSystem(hadoopConfig) - fs.mkdirs(new Path(tableDir, partitionPath)) -} catch { - case e: IOException => -throw new SparkException(s"Unable to create partition path $partitionPath", e) + val partitionLocation = p.storage.locationUri.getOrElse { +val partitionPath = partitionColumnNames.map(col => col + "=" + p.spec(col)).mkString("/") --- End diff -- Do we need to escape these, as in `PartitioningUtils.escapePathName`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15797#discussion_r86872012 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -810,13 +840,37 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat newSpecs: Seq[TablePartitionSpec]): Unit = withClient { client.renamePartitions( db, table, specs.map(lowerCasePartitionSpec), newSpecs.map(lowerCasePartitionSpec)) + +val tableMeta = getTable(db, table) +val partitionColumnNames = tableMeta.partitionColumnNames +// Hive metastore is not case preserving and keeps partition columns with lower cased names. +// When Hive rename partition for managed tables, it will update the partition location with +// a default value generate by the new spec and lower cased partition column names. This is +// unexpected and we need to call `alterPartitions` to fix the partition location. +val hasUpperCasePartitionColumn = partitionColumnNames.exists(col => col.toLowerCase != col) +if (tableMeta.tableType == MANAGED && hasUpperCasePartitionColumn) { + val tableLocation = tableMeta.storage.locationUri.get + val newParts = newSpecs.map { spec => +val partition = client.getPartition(db, table, lowerCasePartitionSpec(spec)) +val partitionPath = new Path( + tableLocation, + partitionColumnNames.map(col => col + "=" + spec(col)).mkString("/")) --- End diff -- What if it had a custom location? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15797#discussion_r86868525 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala --- @@ -423,33 +424,35 @@ class InMemoryCatalog( requirePartitionsExist(db, table, specs) requirePartitionsNotExist(db, table, newSpecs) -val tableDir = new Path(catalog(db).db.locationUri, table) -val partitionColumnNames = getTable(db, table).partitionColumnNames +val tableMeta = getTable(db, table) +val partitionColumnNames = tableMeta.partitionColumnNames +val tableLocation = new Path(tableMeta.storage.locationUri.get) +val shouldUpdatePartitionLocation = getTable(db, table).tableType == CatalogTableType.MANAGED +val existingParts = catalog(db).tables(table).partitions // TODO: we should follow hive to roll back if one partition path failed to rename. specs.zip(newSpecs).foreach { case (oldSpec, newSpec) => - val newPart = getPartition(db, table, oldSpec).copy(spec = newSpec) - val existingParts = catalog(db).tables(table).partitions - - // If location is set, the partition is using an external partition location and we don't - // need to handle its directory. - if (newPart.storage.locationUri.isEmpty) { -val oldPath = partitionColumnNames.flatMap { col => - oldSpec.get(col).map(col + "=" + _) -}.mkString("/") -val newPath = partitionColumnNames.flatMap { col => - newSpec.get(col).map(col + "=" + _) -}.mkString("/") + val oldPartition = getPartition(db, table, oldSpec) + val newPartition = if (shouldUpdatePartitionLocation) { +val oldPartPath = new Path(oldPartition.storage.locationUri.get) +val newPartPath = new Path( + tableLocation, + partitionColumnNames.map(col => col + "=" + newSpec(col)).mkString("/")) --- End diff -- Should we pull this into PartitioningUtils? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user ericl commented on a diff in the pull request: https://github.com/apache/spark/pull/15797#discussion_r86871542 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -810,13 +840,37 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat newSpecs: Seq[TablePartitionSpec]): Unit = withClient { client.renamePartitions( db, table, specs.map(lowerCasePartitionSpec), newSpecs.map(lowerCasePartitionSpec)) + +val tableMeta = getTable(db, table) +val partitionColumnNames = tableMeta.partitionColumnNames +// Hive metastore is not case preserving and keeps partition columns with lower cased names. +// When Hive rename partition for managed tables, it will update the partition location with +// a default value generate by the new spec and lower cased partition column names. This is +// unexpected and we need to call `alterPartitions` to fix the partition location. +val hasUpperCasePartitionColumn = partitionColumnNames.exists(col => col.toLowerCase != col) +if (tableMeta.tableType == MANAGED && hasUpperCasePartitionColumn) { --- End diff -- Would this apply to external tables as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15797: [SPARK-17990][SPARK-18302][SQL] correct several p...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15797#discussion_r86806933 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -226,106 +208,139 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat } else { tableDefinition.storage.locationUri } - // Ideally we should also put `locationUri` in table properties like provider, schema, etc. - // However, in older version of Spark we already store table location in storage properties - // with key "path". Here we keep this behaviour for backward compatibility. - val storagePropsWithLocation = tableDefinition.storage.properties ++ -tableLocation.map("path" -> _) - - // converts the table metadata to Spark SQL specific format, i.e. set data schema, names and - // bucket specification to empty. Note that partition columns are retained, so that we can - // call partition-related Hive API later. - def newSparkSQLSpecificMetastoreTable(): CatalogTable = { -tableDefinition.copy( - // Hive only allows directory paths as location URIs while Spark SQL data source tables - // also allow file paths. For non-hive-compatible format, we should not set location URI - // to avoid hive metastore to throw exception. - storage = tableDefinition.storage.copy( -locationUri = None, -properties = storagePropsWithLocation), - schema = tableDefinition.partitionSchema, - bucketSpec = None, - properties = tableDefinition.properties ++ tableProperties) + + if (tableDefinition.provider.get == DDLUtils.HIVE_PROVIDER) { +val tableWithDataSourceProps = tableDefinition.copy( + // We can't leave `locationUri` empty and count on Hive metastore to set a default table + // location, because Hive metastore is not case preserving and the table name is always + // lower cased when appear in the default table path, which is not expected. + storage = tableDefinition.storage.copy(locationUri = tableLocation), + // Here we follow data source tables and put table metadata like provider, schema, etc. in + // table properties, so that we can work around the Hive metastore issue about not case + // preserving and make Hive serde table support mixed-case column names. + properties = tableDefinition.properties ++ tableMetaToTableProps(tableDefinition)) +client.createTable(tableWithDataSourceProps, ignoreIfExists) + } else { +createDataSourceTable( + tableDefinition.withNewStorage(locationUri = tableLocation), + ignoreIfExists) } +} + } - // converts the table metadata to Hive compatible format, i.e. set the serde information. - def newHiveCompatibleMetastoreTable(serde: HiveSerDe): CatalogTable = { -val location = if (tableDefinition.tableType == EXTERNAL) { - // When we hit this branch, we are saving an external data source table with hive - // compatible format, which means the data source is file-based and must have a `path`. - require(tableDefinition.storage.locationUri.isDefined, -"External file-based data source table must have a `path` entry in storage properties.") - Some(new Path(tableDefinition.storage.locationUri.get).toUri.toString) -} else { - None -} + private def createDataSourceTable( --- End diff -- This method just copies the code in the above if branch: https://github.com/apache/spark/pull/15797/files#diff-159191585e10542f013cb3a714f26075R199 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org