spark git commit: [SPARK-15431][SQL][HOTFIX] ignore 'list' command testcase from CliSuite for now
Repository: spark Updated Branches: refs/heads/branch-2.0 b3845fede -> b430aa98c [SPARK-15431][SQL][HOTFIX] ignore 'list' command testcase from CliSuite for now ## What changes were proposed in this pull request? The test cases for `list` command added in `CliSuite` by PR #13212 can not run in some jenkins jobs after being merged. However, some jenkins jobs can pass: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/ https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/ https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.2/ https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/ https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/ Others failed on this test case. But the failures on those jobs are at slightly different checkpoints among different jobs too. So it seems that CliSuite's output capture is flaky for list commands to check for expected output. There are test cases already in `HiveQuerySuite` and `SparkContextSuite` to cover the cases. So I am ignoring 2 test cases added by PR #13212 . Author: Xin WuCloses #13276 from xwu0226/SPARK-15431-clisuite. (cherry picked from commit 6f95c6c030db0057de213733c2bd3453463bc6f2) Signed-off-by: Yin Huai Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b430aa98 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b430aa98 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b430aa98 Branch: refs/heads/branch-2.0 Commit: b430aa98caa16978cd53dd354423cac45410c284 Parents: b3845fe Author: Xin Wu Authored: Fri May 27 08:54:14 2016 -0700 Committer: Yin Huai Committed: Fri May 27 08:54:54 2016 -0700 -- .../scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b430aa98/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala -- diff --git a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala index 2bf0221..656fe97 100644 --- a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala +++ b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala @@ -239,7 +239,7 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with Logging { "" -> "This is a test for Spark-11624") } - test("list jars") { + ignore("list jars") { val jarFile = Thread.currentThread().getContextClassLoader.getResource("TestUDTF.jar") runCliWithin(2.minute)( s"ADD JAR $jarFile" -> "", @@ -248,7 +248,7 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with Logging { ) } - test("list files") { + ignore("list files") { val dataFilePath = Thread.currentThread().getContextClassLoader .getResource("data/files/small_kv.txt") runCliWithin(2.minute)( - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15431][SQL][HOTFIX] ignore 'list' comma...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13276#issuecomment-222183974 Merging to master and branch 2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s)...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13212#discussion_r64926605 --- Diff: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala --- @@ -238,4 +238,23 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with Logging { runCliWithin(2.minute, Seq("-e", "!echo \"This is a test for Spark-11624\";"))( "" -> "This is a test for Spark-11624") } + + test("list jars") { +val jarFile = Thread.currentThread().getContextClassLoader.getResource("TestUDTF.jar") +runCliWithin(2.minute)( + s"ADD JAR $jarFile" -> "", + s"LIST JARS" -> "TestUDTF.jar", + s"List JAR $jarFile" -> "TestUDTF.jar" +) + } + + test("list files") { +val dataFilePath = Thread.currentThread().getContextClassLoader + .getResource("data/files/small_kv.txt") +runCliWithin(2.minute)( + s"ADD FILE $dataFilePath" -> "", + s"LIST FILES" -> "small_kv.txt", + s"LIST FILE $dataFilePath" -> "small_kv.txt" +) + } --- End diff -- Seems it fails all maven builds. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-15431][SQL][HOTFIX] ignore 'list' command testcase from CliSuite for now
Repository: spark Updated Branches: refs/heads/master d5911d117 -> 6f95c6c03 [SPARK-15431][SQL][HOTFIX] ignore 'list' command testcase from CliSuite for now ## What changes were proposed in this pull request? The test cases for `list` command added in `CliSuite` by PR #13212 can not run in some jenkins jobs after being merged. However, some jenkins jobs can pass: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/ https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/ https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.2/ https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/ https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/ Others failed on this test case. But the failures on those jobs are at slightly different checkpoints among different jobs too. So it seems that CliSuite's output capture is flaky for list commands to check for expected output. There are test cases already in `HiveQuerySuite` and `SparkContextSuite` to cover the cases. So I am ignoring 2 test cases added by PR #13212 . Author: Xin WuCloses #13276 from xwu0226/SPARK-15431-clisuite. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6f95c6c0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6f95c6c0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6f95c6c0 Branch: refs/heads/master Commit: 6f95c6c030db0057de213733c2bd3453463bc6f2 Parents: d5911d1 Author: Xin Wu Authored: Fri May 27 08:54:14 2016 -0700 Committer: Yin Huai Committed: Fri May 27 08:54:14 2016 -0700 -- .../scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6f95c6c0/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala -- diff --git a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala index 2bf0221..656fe97 100644 --- a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala +++ b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala @@ -239,7 +239,7 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with Logging { "" -> "This is a test for Spark-11624") } - test("list jars") { + ignore("list jars") { val jarFile = Thread.currentThread().getContextClassLoader.getResource("TestUDTF.jar") runCliWithin(2.minute)( s"ADD JAR $jarFile" -> "", @@ -248,7 +248,7 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with Logging { ) } - test("list files") { + ignore("list files") { val dataFilePath = Thread.currentThread().getContextClassLoader .getResource("data/files/small_kv.txt") runCliWithin(2.minute)( - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15431][SQL][HOTFIX] ignore 'list' comma...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13276#issuecomment-222183379 It causes master builds to fail. I am merging it. Is the bug in the implementation or in our test code? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Removes unexpected empty ta...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13270#discussion_r64922718 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -212,11 +212,46 @@ class SessionCatalog( * If no such database is specified, create it in the current database. */ def createTable(tableDefinition: CatalogTable, ignoreIfExists: Boolean): Unit = { -val db = formatDatabaseName(tableDefinition.identifier.database.getOrElse(getCurrentDatabase)) -val table = formatTableName(tableDefinition.identifier.table) +val tableId = tableDefinition.identifier +val db = formatDatabaseName(tableId.database.getOrElse(getCurrentDatabase)) +val table = formatTableName(tableId.table) val newTableDefinition = tableDefinition.copy(identifier = TableIdentifier(table, Some(db))) requireDbExists(db) -externalCatalog.createTable(db, newTableDefinition, ignoreIfExists) + +if ( + // If this is an external data source table... + tableDefinition.properties.contains("spark.sql.sources.provider") && + newTableDefinition.tableType == CatalogTableType.EXTERNAL && + // ... that is not persisted as Hive compatible format (external tables in Hive compatible + // format always set `locationUri` to the actual data location and should NOT be hacked as + // following.) + tableDefinition.storage.locationUri.isEmpty +) { + // !! HACK ALERT !! + // + // Due to a restriction of Hive metastore, here we have to set `locationUri` to a temporary + // directory that doesn't exist yet but can definitely be successfully created, and then + // delete it right after creating the external data source table. This location will be + // persisted to Hive metastore as standard Hive table location URI, but Spark SQL doesn't + // really use it. Also, since we only do this workaround for external tables, deleting the + // directory after the fact doesn't do any harm. + // + // Please refer to https://issues.apache.org/jira/browse/SPARK-15269 for more details. + + val tempPath = +new Path(defaultTablePath(tableId).stripSuffix(Path.SEPARATOR) + "-__PLACEHOLDER__") + + try { +externalCatalog.createTable( + db, + newTableDefinition.withNewStorage(locationUri = Some(tempPath.toString)), + ignoreIfExists) + } finally { +FileSystem.get(tempPath.toUri, hadoopConf).delete(tempPath, true) + } +} else { + externalCatalog.createTable(db, newTableDefinition, ignoreIfExists) +} --- End diff -- I think Hive-specific details/hacks should not be exposed in `SessionCatalog`. Let's move it into `HiveExternalCatalog`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15565] [SQL] Add the File Scheme to the...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13348#issuecomment-222065198 I created the jira because of https://issues.apache.org/jira/browse/SPARK-15034?focusedCommentId=15301508=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15301508. Since `System.getProperty("user.dir")` is pointing to a local dir, seems it makes to explicitly set the scheme. It may introduce confusion if we add the default fs scheme to the path generated by `System.getProperty("user.dir")`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15565] [SQL] Add the File Scheme to the...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13348#discussion_r64858995 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -55,7 +56,7 @@ object SQLConf { val WAREHOUSE_PATH = SQLConfigBuilder("spark.sql.warehouse.dir") .doc("The default location for managed databases and tables.") .stringConf -.createWithDefault("${system:user.dir}/spark-warehouse") +.createWithDefault("${system:user.dir}" + File.separator + "spark-warehouse") --- End diff -- Why not just change the default value to `file:/${system:user.dir}/spark-warehouse`? I do not really understand why we have to change `def warehousePath`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15565] [SQL] Add the File Scheme to the...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13348#discussion_r64858379 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -665,7 +666,8 @@ private[sql] class SQLConf extends Serializable with CatalystConf with Logging { def variableSubstituteDepth: Int = getConf(VARIABLE_SUBSTITUTE_DEPTH) def warehousePath: String = { -getConf(WAREHOUSE_PATH).replace("${system:user.dir}", System.getProperty("user.dir")) +getConf(WAREHOUSE_PATH).replace("${system:user.dir}", + new File(System.getProperty("user.dir")).toURI.toURL.toString).replace("//", "/") --- End diff -- When do we need `.replace("//", "/")`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15594][SQL] ALTER TABLE SERDEPROPERTIES...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13343#discussion_r64856240 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -871,6 +879,58 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach { } } + private def testSetSerdePartition(isDatasourceTable: Boolean): Unit = { +val catalog = spark.sessionState.catalog +val tableIdent = TableIdentifier("tab1", Some("dbx")) +val spec = Map("a" -> "1", "b" -> "2") +createDatabase(catalog, "dbx") +createTable(catalog, tableIdent) +createTablePartition(catalog, spec, tableIdent) +createTablePartition(catalog, Map("a" -> "1", "b" -> "3"), tableIdent) +createTablePartition(catalog, Map("a" -> "2", "b" -> "2"), tableIdent) +createTablePartition(catalog, Map("a" -> "2", "b" -> "3"), tableIdent) +if (isDatasourceTable) { + convertToDatasourceTable(catalog, tableIdent) +} +assert(catalog.getPartition(tableIdent, spec).storage.serde.isEmpty) +assert(catalog.getPartition(tableIdent, spec).storage.serdeProperties.isEmpty) +// set table serde and/or properties (should fail on datasource tables) +if (isDatasourceTable) { + val e1 = intercept[AnalysisException] { +sql("ALTER TABLE dbx.tab1 PARTITION (a=1, b=2) SET SERDE 'whatever'") + } + val e2 = intercept[AnalysisException] { +sql("ALTER TABLE dbx.tab1 PARTITION (a=1, b=2) SET SERDE 'org.apache.madoop' " + + "WITH SERDEPROPERTIES ('k' = 'v', 'kay' = 'vee')") + } + assert(e1.getMessage.contains("datasource")) + assert(e2.getMessage.contains("datasource")) +} else { + sql("ALTER TABLE dbx.tab1 PARTITION (a=1, b=2) SET SERDE 'org.apache.jadoop'") + assert(catalog.getPartition(tableIdent, spec).storage.serde == Some("org.apache.jadoop")) + assert(catalog.getPartition(tableIdent, spec).storage.serdeProperties.isEmpty) + sql("ALTER TABLE dbx.tab1 PARTITION (a=1, b=2) SET SERDE 'org.apache.madoop' " + +"WITH SERDEPROPERTIES ('k' = 'v', 'kay' = 'vee')") + assert(catalog.getPartition(tableIdent, spec).storage.serde == Some("org.apache.madoop")) + assert(catalog.getPartition(tableIdent, spec).storage.serdeProperties == +Map("k" -> "v", "kay" -> "vee")) +} +// set serde properties only +sql("ALTER TABLE dbx.tab1 PARTITION (a=1, b=2) " + + "SET SERDEPROPERTIES ('k' = 'vvv', 'kay' = 'vee')") --- End diff -- For data source tables, seems we do not store any partition specs. I am wondering why metastore does not complain? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15565] [SQL] Add the File Scheme to the...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13348#discussion_r64853158 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -55,7 +56,7 @@ object SQLConf { val WAREHOUSE_PATH = SQLConfigBuilder("spark.sql.warehouse.dir") .doc("The default location for managed databases and tables.") .stringConf -.createWithDefault("${system:user.dir}/spark-warehouse") +.createWithDefault("${system:user.dir}" + File.separator + "spark-warehouse") --- End diff -- Although the path points to a local dir, this path is actually a hadoop filesystem path. So, I am wondering if we need to use `File.separator` at here? If we are running on Windows, does a hadoop filesystem path uses `\` as the separator? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-15583][SQL] Disallow altering datasource properties
Repository: spark Updated Branches: refs/heads/branch-2.0 702755f92 -> 8e26b74fc [SPARK-15583][SQL] Disallow altering datasource properties ## What changes were proposed in this pull request? Certain table properties (and SerDe properties) are in the protected namespace `spark.sql.sources.`, which we use internally for datasource tables. The user should not be allowed to (1) Create a Hive table setting these properties (2) Alter these properties in an existing table Previously, we threw an exception if the user tried to alter the properties of an existing datasource table. However, this is overly restrictive for datasource tables and does not do anything for Hive tables. ## How was this patch tested? DDLSuite Author: Andrew OrCloses #13341 from andrewor14/alter-table-props. (cherry picked from commit 3fca635b4ed322208debcd89a539e42cdde6bbd4) Signed-off-by: Yin Huai Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8e26b74f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8e26b74f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8e26b74f Branch: refs/heads/branch-2.0 Commit: 8e26b74fccc8e7b52db1011f6d6e295c6ba0c5aa Parents: 702755f Author: Andrew Or Authored: Thu May 26 20:11:09 2016 -0700 Committer: Yin Huai Committed: Thu May 26 20:11:19 2016 -0700 -- .../command/createDataSourceTables.scala| 17 +++ .../spark/sql/execution/command/ddl.scala | 37 +++-- .../spark/sql/execution/command/tables.scala| 2 + .../spark/sql/execution/command/DDLSuite.scala | 148 --- .../spark/sql/hive/execution/HiveDDLSuite.scala | 2 +- 5 files changed, 139 insertions(+), 67 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8e26b74f/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala index 6ca66a2..deedb68 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala @@ -255,6 +255,23 @@ case class CreateDataSourceTableAsSelectCommand( object CreateDataSourceTableUtils extends Logging { + + // TODO: Actually replace usages with these variables (SPARK-15584) + + val DATASOURCE_PREFIX = "spark.sql.sources." + val DATASOURCE_PROVIDER = DATASOURCE_PREFIX + "provider" + val DATASOURCE_WRITEJOBUUID = DATASOURCE_PREFIX + "writeJobUUID" + val DATASOURCE_OUTPUTPATH = DATASOURCE_PREFIX + "output.path" + val DATASOURCE_SCHEMA_PREFIX = DATASOURCE_PREFIX + "schema." + val DATASOURCE_SCHEMA_NUMPARTS = DATASOURCE_SCHEMA_PREFIX + "numParts" + val DATASOURCE_SCHEMA_NUMPARTCOLS = DATASOURCE_SCHEMA_PREFIX + "numPartCols" + val DATASOURCE_SCHEMA_NUMBUCKETS = DATASOURCE_SCHEMA_PREFIX + "numBuckets" + val DATASOURCE_SCHEMA_NUMBUCKETCOLS = DATASOURCE_SCHEMA_PREFIX + "numBucketCols" + val DATASOURCE_SCHEMA_PART_PREFIX = DATASOURCE_SCHEMA_PREFIX + "part." + val DATASOURCE_SCHEMA_PARTCOL_PREFIX = DATASOURCE_SCHEMA_PREFIX + "partCol." + val DATASOURCE_SCHEMA_BUCKETCOL_PREFIX = DATASOURCE_SCHEMA_PREFIX + "bucketCol." + val DATASOURCE_SCHEMA_SORTCOL_PREFIX = DATASOURCE_SCHEMA_PREFIX + "sortCol." + /** * Checks if the given name conforms the Hive standard ("[a-zA-z_0-9]+"), * i.e. if this name only contains characters, numbers, and _. http://git-wip-us.apache.org/repos/asf/spark/blob/8e26b74f/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala index 7ce7bb9..15eba3b 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala @@ -25,6 +25,7 @@ import org.apache.spark.sql.catalyst.catalog.{CatalogDatabase, CatalogTable} import org.apache.spark.sql.catalyst.catalog.{CatalogTablePartition, CatalogTableType, SessionCatalog} import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeReference} +import org.apache.spark.sql.execution.command.CreateDataSourceTableUtils.DATASOURCE_PREFIX import org.apache.spark.sql.execution.datasources.BucketSpec
spark git commit: [SPARK-15583][SQL] Disallow altering datasource properties
Repository: spark Updated Branches: refs/heads/master 6ab973ec5 -> 3fca635b4 [SPARK-15583][SQL] Disallow altering datasource properties ## What changes were proposed in this pull request? Certain table properties (and SerDe properties) are in the protected namespace `spark.sql.sources.`, which we use internally for datasource tables. The user should not be allowed to (1) Create a Hive table setting these properties (2) Alter these properties in an existing table Previously, we threw an exception if the user tried to alter the properties of an existing datasource table. However, this is overly restrictive for datasource tables and does not do anything for Hive tables. ## How was this patch tested? DDLSuite Author: Andrew OrCloses #13341 from andrewor14/alter-table-props. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3fca635b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3fca635b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3fca635b Branch: refs/heads/master Commit: 3fca635b4ed322208debcd89a539e42cdde6bbd4 Parents: 6ab973e Author: Andrew Or Authored: Thu May 26 20:11:09 2016 -0700 Committer: Yin Huai Committed: Thu May 26 20:11:09 2016 -0700 -- .../command/createDataSourceTables.scala| 17 +++ .../spark/sql/execution/command/ddl.scala | 37 +++-- .../spark/sql/execution/command/tables.scala| 2 + .../spark/sql/execution/command/DDLSuite.scala | 148 --- .../spark/sql/hive/execution/HiveDDLSuite.scala | 2 +- 5 files changed, 139 insertions(+), 67 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3fca635b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala index 6ca66a2..deedb68 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala @@ -255,6 +255,23 @@ case class CreateDataSourceTableAsSelectCommand( object CreateDataSourceTableUtils extends Logging { + + // TODO: Actually replace usages with these variables (SPARK-15584) + + val DATASOURCE_PREFIX = "spark.sql.sources." + val DATASOURCE_PROVIDER = DATASOURCE_PREFIX + "provider" + val DATASOURCE_WRITEJOBUUID = DATASOURCE_PREFIX + "writeJobUUID" + val DATASOURCE_OUTPUTPATH = DATASOURCE_PREFIX + "output.path" + val DATASOURCE_SCHEMA_PREFIX = DATASOURCE_PREFIX + "schema." + val DATASOURCE_SCHEMA_NUMPARTS = DATASOURCE_SCHEMA_PREFIX + "numParts" + val DATASOURCE_SCHEMA_NUMPARTCOLS = DATASOURCE_SCHEMA_PREFIX + "numPartCols" + val DATASOURCE_SCHEMA_NUMBUCKETS = DATASOURCE_SCHEMA_PREFIX + "numBuckets" + val DATASOURCE_SCHEMA_NUMBUCKETCOLS = DATASOURCE_SCHEMA_PREFIX + "numBucketCols" + val DATASOURCE_SCHEMA_PART_PREFIX = DATASOURCE_SCHEMA_PREFIX + "part." + val DATASOURCE_SCHEMA_PARTCOL_PREFIX = DATASOURCE_SCHEMA_PREFIX + "partCol." + val DATASOURCE_SCHEMA_BUCKETCOL_PREFIX = DATASOURCE_SCHEMA_PREFIX + "bucketCol." + val DATASOURCE_SCHEMA_SORTCOL_PREFIX = DATASOURCE_SCHEMA_PREFIX + "sortCol." + /** * Checks if the given name conforms the Hive standard ("[a-zA-z_0-9]+"), * i.e. if this name only contains characters, numbers, and _. http://git-wip-us.apache.org/repos/asf/spark/blob/3fca635b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala index 7ce7bb9..15eba3b 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala @@ -25,6 +25,7 @@ import org.apache.spark.sql.catalyst.catalog.{CatalogDatabase, CatalogTable} import org.apache.spark.sql.catalyst.catalog.{CatalogTablePartition, CatalogTableType, SessionCatalog} import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeReference} +import org.apache.spark.sql.execution.command.CreateDataSourceTableUtils.DATASOURCE_PREFIX import org.apache.spark.sql.execution.datasources.BucketSpec import org.apache.spark.sql.types._ @@ -228,15 +229,13 @@ case class AlterTableSetPropertiesCommand( extends
[GitHub] spark pull request: [SPARK-15583][SQL] Disallow altering datasourc...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13341#issuecomment-222048281 Thanks. Merging to master and branch 2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8428][SPARK-13850] Fix integer overflow...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13336#issuecomment-222045290 sorry. It has been fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8428][SPARK-13850] Fix integer overflow...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13336#issuecomment-222045184 Seems it breaks 1.6 build? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15594][SQL] ALTER TABLE SERDEPROPERTIES...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13343#discussion_r64849728 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -871,6 +879,58 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach { } } + private def testSetSerdePartition(isDatasourceTable: Boolean): Unit = { +val catalog = spark.sessionState.catalog +val tableIdent = TableIdentifier("tab1", Some("dbx")) +val spec = Map("a" -> "1", "b" -> "2") +createDatabase(catalog, "dbx") +createTable(catalog, tableIdent) +createTablePartition(catalog, spec, tableIdent) +createTablePartition(catalog, Map("a" -> "1", "b" -> "3"), tableIdent) +createTablePartition(catalog, Map("a" -> "2", "b" -> "2"), tableIdent) +createTablePartition(catalog, Map("a" -> "2", "b" -> "3"), tableIdent) +if (isDatasourceTable) { + convertToDatasourceTable(catalog, tableIdent) +} +assert(catalog.getPartition(tableIdent, spec).storage.serde.isEmpty) +assert(catalog.getPartition(tableIdent, spec).storage.serdeProperties.isEmpty) +// set table serde and/or properties (should fail on datasource tables) +if (isDatasourceTable) { + val e1 = intercept[AnalysisException] { +sql("ALTER TABLE dbx.tab1 PARTITION (a=1, b=2) SET SERDE 'whatever'") + } + val e2 = intercept[AnalysisException] { +sql("ALTER TABLE dbx.tab1 PARTITION (a=1, b=2) SET SERDE 'org.apache.madoop' " + + "WITH SERDEPROPERTIES ('k' = 'v', 'kay' = 'vee')") + } + assert(e1.getMessage.contains("datasource")) + assert(e2.getMessage.contains("datasource")) +} else { + sql("ALTER TABLE dbx.tab1 PARTITION (a=1, b=2) SET SERDE 'org.apache.jadoop'") + assert(catalog.getPartition(tableIdent, spec).storage.serde == Some("org.apache.jadoop")) + assert(catalog.getPartition(tableIdent, spec).storage.serdeProperties.isEmpty) + sql("ALTER TABLE dbx.tab1 PARTITION (a=1, b=2) SET SERDE 'org.apache.madoop' " + +"WITH SERDEPROPERTIES ('k' = 'v', 'kay' = 'vee')") + assert(catalog.getPartition(tableIdent, spec).storage.serde == Some("org.apache.madoop")) + assert(catalog.getPartition(tableIdent, spec).storage.serdeProperties == +Map("k" -> "v", "kay" -> "vee")) +} +// set serde properties only +sql("ALTER TABLE dbx.tab1 PARTITION (a=1, b=2) " + + "SET SERDEPROPERTIES ('k' = 'vvv', 'kay' = 'vee')") --- End diff -- Do data source tables allow this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15583][SQL] Disallow altering datasourc...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13341#issuecomment-222044580 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-15532][SQL] SQLContext/HiveContext's public constructors should use SparkSession.build.getOrCreate
Repository: spark Updated Branches: refs/heads/branch-2.0 4939c2a12 -> d3cd579d7 [SPARK-15532][SQL] SQLContext/HiveContext's public constructors should use SparkSession.build.getOrCreate ## What changes were proposed in this pull request? This PR changes SQLContext/HiveContext's public constructor to use SparkSession.build.getOrCreate and removes isRootContext from SQLContext. ## How was this patch tested? Existing tests. Author: Yin Huai <yh...@databricks.com> Closes #13310 from yhuai/SPARK-15532. (cherry picked from commit 3ac2363d757cc9cebc627974f17ecda3a263efdf) Signed-off-by: Yin Huai <yh...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d3cd579d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d3cd579d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d3cd579d Branch: refs/heads/branch-2.0 Commit: d3cd579d7cdc1d2642186041dfbe296b266e4069 Parents: 4939c2a Author: Yin Huai <yh...@databricks.com> Authored: Thu May 26 16:53:31 2016 -0700 Committer: Yin Huai <yh...@databricks.com> Committed: Thu May 26 16:54:03 2016 -0700 -- .../spark/mllib/classification/LogisticRegression.scala | 2 +- project/MimaExcludes.scala| 2 ++ .../src/main/scala/org/apache/spark/sql/SQLContext.scala | 10 ++ .../main/scala/org/apache/spark/sql/SparkSession.scala| 4 ++-- .../scala/org/apache/spark/sql/hive/HiveContext.scala | 10 -- .../scala/org/apache/spark/sql/hive/test/TestHive.scala | 9 - 6 files changed, 15 insertions(+), 22 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d3cd579d/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala b/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala index adbcdd3..4bba2ea 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala @@ -437,7 +437,7 @@ class LogisticRegressionWithLBFGS lr.setMaxIter(optimizer.getNumIterations()) lr.setTol(optimizer.getConvergenceTol()) // Convert our input into a DataFrame -val sqlContext = new SQLContext(input.context) +val sqlContext = SQLContext.getOrCreate(input.context) import sqlContext.implicits._ val df = input.map(_.asML).toDF() // Determine if we should cache the DF http://git-wip-us.apache.org/repos/asf/spark/blob/d3cd579d/project/MimaExcludes.scala -- diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala index 08c575a..73debe9 100644 --- a/project/MimaExcludes.scala +++ b/project/MimaExcludes.scala @@ -54,6 +54,8 @@ object MimaExcludes { ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.rdd.RDD.coalesce"), ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.rdd.PartitionCoalescer$LocationIterator"), ProblemFilters.exclude[IncompatibleTemplateDefProblem]("org.apache.spark.rdd.PartitionCoalescer"), +// SPARK-15532 Remove isRootContext flag from SQLContext. + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.SQLContext.isRootContext"), // SPARK-12600 Remove SQL deprecated methods ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.SQLContext$QueryExecution"), ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.SQLContext$SparkPlanner"), http://git-wip-us.apache.org/repos/asf/spark/blob/d3cd579d/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala index b17fb8a..66d9aa2 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala @@ -57,9 +57,7 @@ import org.apache.spark.sql.util.ExecutionListenerManager * @groupname Ungrouped Support functions for language integrated queries * @since 1.0.0 */ -class SQLContext private[sql]( -val sparkSession: SparkSession, -val isRootContext: Boolean) +class SQLContext private[sql](val sparkSession: SparkSession) extends Logging with Serializable { self => @@ -69,13 +67,9 @@ class SQLContext private
[GitHub] spark pull request: [SPARK-15532] [SQL] SQLContext/HiveContext's p...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13310#issuecomment-222026576 Merging to master and branch 2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15532] [SQL] Add SQLConf.ALLOW_MULTIPLE...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13310#discussion_r64823932 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -69,13 +67,30 @@ class SQLContext private[sql]( // Note: Since Spark 2.0 this class has become a wrapper of SparkSession, where the // real functionality resides. This class remains mainly for backward compatibility. - private[sql] def this(sparkSession: SparkSession) = { -this(sparkSession, true) - } - @deprecated("Use SparkSession.builder instead", "2.0.0") def this(sc: SparkContext) = { -this(new SparkSession(sc)) +this { + val session = new SparkSession(sc) + // If spark.sql.allowMultipleContexts is true, we will throw an exception if a user + // wants to create a new root SQLContext (a SQLContext that is not created by newSession). + val allowMultipleContexts = sc.conf.getBoolean( +SQLConf.ALLOW_MULTIPLE_CONTEXTS.key, +SQLConf.ALLOW_MULTIPLE_CONTEXTS.defaultValue.get) + + // Assert no root SQLContext/SparkSession is running when allowMultipleContexts is false. + { +val defaultSessionExists = SparkSession.getDefaultSession.isDefined +if (!allowMultipleContexts && defaultSessionExists) { + val errMsg = sc.conf.get( +SQLConf.ALLOW_MULTIPLE_CONTEXTS_ERROR_MESSAGE.key, +SQLConf.ALLOW_MULTIPLE_CONTEXTS_ERROR_MESSAGE.defaultValue.get) + throw new SparkException(errMsg) +} else if (!defaultSessionExists) { + SparkSession.setDefaultSession(session) +} + } + session +} --- End diff -- We need to do the same thing for HiveContext. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15538][SPARK-15539][SQL] Truncate table...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13315#issuecomment-221997409 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13445][SQL]Improves error message and a...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1#issuecomment-221968733 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14543] [SQL] Improve InsertIntoTable co...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/12313#issuecomment-221966605 @rdblue Thank you for this PR. Those improvements sound good. I chatted with others. Here are two questions. 1. `sqlContext.table("source").write.byName.insertInto("destination")` will change the behavior of insertInto and make it different from SQL's insertInto. Looks like it is better to have another command instead of changing the behavior of insertInto in this way. 2. When we use resolution by name, seems we will complain if the data has less number of fields. What will happen if the data has more fields? How about we first improve the position-based resolution part? Then, we discuss the name-based part separately? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15538][SPARK-15539][SQL] Truncate table...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13315#discussion_r64794004 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -289,37 +289,53 @@ case class TruncateTableCommand( val catalog = sparkSession.sessionState.catalog if (!catalog.tableExists(tableName)) { throw new AnalysisException(s"Table '$tableName' in TRUNCATE TABLE does not exist.") -} else if (catalog.isTemporaryTable(tableName)) { +} +if (catalog.isTemporaryTable(tableName)) { throw new AnalysisException( s"Operation not allowed: TRUNCATE TABLE on temporary tables: '$tableName'") +} +val table = catalog.getTableMetadata(tableName) +if (table.tableType == CatalogTableType.EXTERNAL) { + throw new AnalysisException( +s"Operation not allowed: TRUNCATE TABLE on external tables: '$tableName'") +} +if (table.tableType == CatalogTableType.VIEW) { + throw new AnalysisException( +s"Operation not allowed: TRUNCATE TABLE on views: '$tableName'") +} +if (DDLUtils.isDatasourceTable(table) && partitionSpec.isDefined) { + throw new AnalysisException( +s"Operation not allowed: TRUNCATE TABLE ... PARTITION is not supported " + +s"for tables created using the data sources API: '$tableName'") +} +val locations = if (partitionSpec.isDefined) { + catalog.listPartitions(tableName, partitionSpec).map(_.storage.locationUri) } else { - val locations = if (partitionSpec.isDefined) { -catalog.listPartitions(tableName, partitionSpec).map(_.storage.locationUri) + if (table.partitionColumnNames.nonEmpty) { +catalog.listPartitions(tableName).map(_.storage.locationUri) } else { -val table = catalog.getTableMetadata(tableName) -if (table.partitionColumnNames.nonEmpty) { - catalog.listPartitions(tableName).map(_.storage.locationUri) -} else { - Seq(table.storage.locationUri) -} +Seq(table.storage.locationUri) } - val hadoopConf = sparkSession.sessionState.newHadoopConf() - locations.foreach { location => -if (location.isDefined) { - val path = new Path(location.get) - try { -val fs = path.getFileSystem(hadoopConf) -fs.delete(path, true) -fs.mkdirs(path) - } catch { -case NonFatal(e) => - throw new AnalysisException( -s"Failed to truncate table '$tableName' when removing data of the path: $path " + - s"because of ${e.toString}") - } +} +val hadoopConf = sparkSession.sessionState.newHadoopConf() +locations.foreach { location => + if (location.isDefined) { +val path = new Path(location.get) +try { + val fs = path.getFileSystem(hadoopConf) + fs.delete(path, true) + fs.mkdirs(path) +} catch { + case NonFatal(e) => +throw new AnalysisException( + s"Failed to truncate table '$tableName' when removing data of the path: $path " + +s"because of ${e.toString}") } } } +// After deleting the data, invalidate the table to make sure we don't keep around a stale +// file relation in the metastore cache. +sparkSession.sessionState.invalidateTable(tableName.unquotedString) --- End diff -- Seems we also need to drop the cached data in the InMemoryColumnarStore? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15538][SPARK-15539][SQL] Truncate table...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13315#discussion_r64793761 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -289,37 +289,53 @@ case class TruncateTableCommand( val catalog = sparkSession.sessionState.catalog if (!catalog.tableExists(tableName)) { throw new AnalysisException(s"Table '$tableName' in TRUNCATE TABLE does not exist.") -} else if (catalog.isTemporaryTable(tableName)) { +} +if (catalog.isTemporaryTable(tableName)) { throw new AnalysisException( s"Operation not allowed: TRUNCATE TABLE on temporary tables: '$tableName'") +} +val table = catalog.getTableMetadata(tableName) +if (table.tableType == CatalogTableType.EXTERNAL) { + throw new AnalysisException( +s"Operation not allowed: TRUNCATE TABLE on external tables: '$tableName'") +} +if (table.tableType == CatalogTableType.VIEW) { + throw new AnalysisException( +s"Operation not allowed: TRUNCATE TABLE on views: '$tableName'") +} +if (DDLUtils.isDatasourceTable(table) && partitionSpec.isDefined) { + throw new AnalysisException( +s"Operation not allowed: TRUNCATE TABLE ... PARTITION is not supported " + +s"for tables created using the data sources API: '$tableName'") +} +val locations = if (partitionSpec.isDefined) { + catalog.listPartitions(tableName, partitionSpec).map(_.storage.locationUri) } else { - val locations = if (partitionSpec.isDefined) { -catalog.listPartitions(tableName, partitionSpec).map(_.storage.locationUri) + if (table.partitionColumnNames.nonEmpty) { +catalog.listPartitions(tableName).map(_.storage.locationUri) --- End diff -- This is for hive table, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15538][SPARK-15539][SQL] Truncate table...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13315#issuecomment-221949749 {{sql("truncate table emp16") }} should delete all partitions, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9876][SQL]: Update Parquet to 1.8.1.
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13280#issuecomment-221935404 @rdblue Is there any perf evaluation of this new version that we can refer to ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15532] [SQL] Add SQLConf.ALLOW_MULTIPLE...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13310#discussion_r64783746 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -70,6 +70,27 @@ object SQLConf { .intConf .createWithDefault(10) + val ALLOW_MULTIPLE_CONTEXTS = SQLConfigBuilder("spark.sql.allowMultipleContexts") +.doc("When set to true, creating multiple SQLContexts/HiveContexts is allowed. " + + "When set to false, only one SQLContext/HiveContext is allowed to be created " + + "through the constructor (new SQLContexts/HiveContexts created through newSession " + + "method is allowed). Please note that this conf needs to be set in Spark Conf. Once " + + "a SQLContext/HiveContext has been created, changing the value of this conf will not " + + "have effect.") +.booleanConf +.createWithDefault(false) --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15534][SPARK-15535][SQL] Truncate table...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13302#discussion_r64688076 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -288,9 +288,10 @@ case class TruncateTableCommand( override def run(sparkSession: SparkSession): Seq[Row] = { val catalog = sparkSession.sessionState.catalog if (!catalog.tableExists(tableName)) { - logError(s"table '$tableName' in TRUNCATE TABLE does not exist.") + throw new AnalysisException(s"Table '$tableName' in TRUNCATE TABLE does not exist.") --- End diff -- Sorry Just realized that I had a typo `For drop table with IF EXISTS keyword` should be `For drop table without IF EXISTS keyword`. Seems without IF EXISTS keyword, Hive (I tried 1.2.1) does not throw an exception if the table does not exist. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Prevent illegal NULL propagation when fi...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13290#discussion_r64687394 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1448,6 +1450,37 @@ class Analyzer( } /** + * Fixes nullability of Attributes in a resolved LogicalPlan by using the nullability of + * corresponding Attributes of its children output Attributes. This step is needed because + * users can use a resolved AttributeReference in the Dataset API and outer joins + * can change the nullability of an AttribtueReference. Without the fix, a nullable column's + * nullable field can be actually set as non-nullable, which cause illegal optimization + * (e.g., NULL propagation) and wrong answers. + * See SPARK-13484 and SPARK-13801 for the concrete queries of this case. + */ + object FixNullability extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { + case q: LogicalPlan if q.resolved => +val childrenOutput = q.children.flatMap(c => c.output).groupBy(_.exprId).flatMap { + case (exprId, attributes) => +// If there are multiple Attributes having the same ExpirId, we need to resolve +// the conflict of nullable field. +val nullable = attributes.map(_.nullable).reduce(_ || _) --- End diff -- I feel it is not very possible. Let me think about it more. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Prevent illegal NULL propagation when fi...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13290#discussion_r64686215 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1448,6 +1450,37 @@ class Analyzer( } /** + * Fixes nullability of Attributes in a resolved LogicalPlan by using the nullability of + * corresponding Attributes of its children output Attributes. This step is needed because + * users can use a resolved AttributeReference in the Dataset API and outer joins + * can change the nullability of an AttribtueReference. Without the fix, a nullable column's + * nullable field can be actually set as non-nullable, which cause illegal optimization + * (e.g., NULL propagation) and wrong answers. + * See SPARK-13484 and SPARK-13801 for the concrete queries of this case. + */ + object FixNullability extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { + case q: LogicalPlan if q.resolved => +val childrenOutput = q.children.flatMap(c => c.output).groupBy(_.exprId).flatMap { --- End diff -- Yea. This version tries to fix the nullability for the entire query plan tree because I think it is possible to hit this problem in other cases when using Dataset API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15539][SQL] DROP TABLE throw exception ...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13307#issuecomment-221753653 Seems we should convert those disabled tests to our tests? Or we should put those disabled tests to HiveQuerySuite or HiveWindowFunctionQuerySuite? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15532] [SQL] Add SQLConf.ALLOW_MULTIPLE...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13310#issuecomment-221753174 oh, some ml tests failed. Let me take a look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15532] [SQL] Add SQLConf.ALLOW_MULTIPLE...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13310#discussion_r64675651 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala --- @@ -103,7 +103,11 @@ class SparkSession private( * A wrapped version of this session in the form of a [[SQLContext]], for backward compatibility. */ @transient - private[sql] val sqlContext: SQLContext = new SQLContext(this) + private[sql] val sqlContext: SQLContext = existingSharedState match { +// If we have an existing SharedState, we should not create a root SQLContext at here. +case Some(sharedState) => new SQLContext(this, false) +case None => new SQLContext(this, true) --- End diff -- I am removing isRootContext flag. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15532] [SQL] Add SQLConf.ALLOW_MULTIPLE...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13310#discussion_r64675631 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/MultiSQLContextsSuite.scala --- @@ -0,0 +1,96 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark._ +import org.apache.spark.sql.internal.SQLConf + +class MultiSQLContextsSuite extends SparkFunSuite with BeforeAndAfterAll { --- End diff -- sure --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15532] [SQL] Add SQLConf.ALLOW_MULTIPLE...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13310#discussion_r64675634 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -70,6 +70,27 @@ object SQLConf { .intConf .createWithDefault(10) + val ALLOW_MULTIPLE_CONTEXTS = SQLConfigBuilder("spark.sql.allowMultipleContexts") +.doc("When set to true, creating multiple SQLContexts/HiveContexts is allowed. " + + "When set to false, only one SQLContext/HiveContext is allowed to be created " + + "through the constructor (new SQLContexts/HiveContexts created through newSession " + + "method is allowed). Please note that this conf needs to be set in Spark Conf. Once " + + "a SQLContext/HiveContext has been created, changing the value of this conf will not " + + "have effect.") +.booleanConf +.createWithDefault(false) + + val ALLOW_MULTIPLE_CONTEXTS_ERROR_MESSAGE = +SQLConfigBuilder("spark.sql.allowMultipleContexts.errorMessage") + .doc("The error message used by the exception when spark.sql.allowMultipleContexts is " + +"set to false.") + .stringConf --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15532] [SQL] Add SQLConf.ALLOW_MULTIPLE...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13310#discussion_r64671030 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -70,6 +70,27 @@ object SQLConf { .intConf .createWithDefault(10) + val ALLOW_MULTIPLE_CONTEXTS = SQLConfigBuilder("spark.sql.allowMultipleContexts") +.doc("When set to true, creating multiple SQLContexts/HiveContexts is allowed. " + + "When set to false, only one SQLContext/HiveContext is allowed to be created " + + "through the constructor (new SQLContexts/HiveContexts created through newSession " + + "method is allowed). Please note that this conf needs to be set in Spark Conf. Once " + + "a SQLContext/HiveContext has been created, changing the value of this conf will not " + + "have effect.") +.booleanConf +.createWithDefault(false) + + val ALLOW_MULTIPLE_CONTEXTS_ERROR_MESSAGE = +SQLConfigBuilder("spark.sql.allowMultipleContexts.errorMessage") --- End diff -- This name is not great. I am open to any suggestion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15532] [SQL] Add SQLConf.ALLOW_MULTIPLE...
GitHub user yhuai opened a pull request: https://github.com/apache/spark/pull/13310 [SPARK-15532] [SQL] Add SQLConf.ALLOW_MULTIPLE_CONTEXTS back and make the error message configurable ## What changes were proposed in this pull request? This PR adds `spark.sql.allowMultipleContexts` back. ## How was this patch tested? MultiSQLContextsSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/yhuai/spark SPARK-15532 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13310.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13310 commit 313e59f0c74e2344a58b81b41162f8f854ed180b Author: Yin Huai <yh...@databricks.com> Date: 2016-05-25T23:32:03Z [SPARK-15532] [SQL] Add SQLConf.ALLOW_MULTIPLE_CONTEXTS back and make the error message configurable --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15534][SPARK-15535][SQL] Truncate table...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13302#discussion_r64667194 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -288,9 +288,10 @@ case class TruncateTableCommand( override def run(sparkSession: SparkSession): Seq[Row] = { val catalog = sparkSession.sessionState.catalog if (!catalog.tableExists(tableName)) { - logError(s"table '$tableName' in TRUNCATE TABLE does not exist.") + throw new AnalysisException(s"Table '$tableName' in TRUNCATE TABLE does not exist.") --- End diff -- For drop table with IF EXISTS keyword, it is unfortunate that Hive does not throw an exception when the table does not exist and changing it breaks lots of tests in HiveCompatibilitySuite. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15534][SPARK-15535][SQL] Truncate table...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13302#issuecomment-221700058 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13484][SQL] Prevent illegal NULL propag...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/11371#discussion_r64521117 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1443,6 +1445,32 @@ class Analyzer( } /** + * Corrects attribute references in an expression tree of some operators (e.g., filters and + * projects) if these operators have a join as a child and the references point to columns on the + * input relation of the join. This is because some joins change the nullability of input columns + * and this could cause illegal optimization (e.g., NULL propagation) and wrong answers. + * See SPARK-13484 and SPARK-13801 for the concrete queries of this case. + */ + object SolveIllegalReferences extends Rule[LogicalPlan] { + +private def replaceReferences(e: Expression, attrMap: AttributeMap[Attribute]) = e.transform { + case a: AttributeReference => attrMap.get(a).getOrElse(a) +} + +def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { + case q: LogicalPlan => +q.transform { + case f @ Filter(filterCondition, ExtractJoinOutputAttributes(join, joinOutputMap)) => +f.copy(condition = replaceReferences(filterCondition, joinOutputMap)) --- End diff -- https://github.com/apache/spark/pull/13290/files --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Prevent illegal NULL propagation when fi...
GitHub user yhuai opened a pull request: https://github.com/apache/spark/pull/13290 [SQL] Prevent illegal NULL propagation when filtering outer-join results ## What changes were proposed in this pull request? This is another approach for addressing SPARK-13484 (the first approach is https://github.com/apache/spark/pull/11371). You can merge this pull request into a Git repository by running: $ git pull https://github.com/yhuai/spark SPARK-13484 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13290.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13290 commit 2dcecc8a934d6224d7680f5d25b7a98c53779d79 Author: Takeshi YAMAMURO <linguin@gmail.com> Date: 2016-02-25T15:56:29Z Avoid illegal NULL propagation commit 1ebfa80e8d52d959ffc98d191019b9e4242dd4a6 Author: Takeshi YAMAMURO <linguin@gmail.com> Date: 2016-02-26T07:48:11Z Add comments commit ab5d4f1174839abbd7f33cf587f75ae75cc76b68 Author: Takeshi YAMAMURO <linguin@gmail.com> Date: 2016-02-27T15:16:54Z Add a new rule to solve illegal references commit 8850cb3046a302afd012e4508bf8bfbd9f59051d Author: Takeshi YAMAMURO <linguin@gmail.com> Date: 2016-03-02T06:19:59Z Use foreach not map commit b636b34bbe09f4152567b15b1fae30f76f7ffe55 Author: Takeshi YAMAMURO <linguin@gmail.com> Date: 2016-04-15T02:17:01Z Solve illegal references in Projects commit dfd952870d0b157dc8571ddee053a9fe91d372ed Author: Takeshi YAMAMURO <linguin@gmail.com> Date: 2016-04-15T04:28:38Z Add tests in DataFrameJoinSuite commit 78ed4ef7a5d6c4565efd468cc48d6bfd1321f08b Author: Takeshi YAMAMURO <linguin@gmail.com> Date: 2016-04-15T08:44:34Z Fix test codes in ResolveNaturalJoinSuite commit 3cec4cce8126faae615990c56f829368f306b55e Author: Yin Huai <yh...@databricks.com> Date: 2016-05-25T06:16:30Z Try to have a rule to fix nullability --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13484][SQL] Prevent illegal NULL propag...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/11371#discussion_r64506496 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1443,6 +1445,32 @@ class Analyzer( } /** + * Corrects attribute references in an expression tree of some operators (e.g., filters and + * projects) if these operators have a join as a child and the references point to columns on the + * input relation of the join. This is because some joins change the nullability of input columns + * and this could cause illegal optimization (e.g., NULL propagation) and wrong answers. + * See SPARK-13484 and SPARK-13801 for the concrete queries of this case. + */ + object SolveIllegalReferences extends Rule[LogicalPlan] { + +private def replaceReferences(e: Expression, attrMap: AttributeMap[Attribute]) = e.transform { + case a: AttributeReference => attrMap.get(a).getOrElse(a) +} + +def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { + case q: LogicalPlan => +q.transform { + case f @ Filter(filterCondition, ExtractJoinOutputAttributes(join, joinOutputMap)) => +f.copy(condition = replaceReferences(filterCondition, joinOutputMap)) --- End diff -- How about we use a `q.transformUp` to fix the nullability in a bottom-up way? For every node, we create an `AttributeMap` using the output of its child. Then, we use `transformExpressions` to fix the nullability if necessary. Let me try it out and ping you when I have a version. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15498][TESTS] fix slow tests
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13273#issuecomment-221324600 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s)...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13212#discussion_r64325475 --- Diff: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala --- @@ -238,4 +238,23 @@ class CliSuite extends SparkFunSuite with BeforeAndAfterAll with Logging { runCliWithin(2.minute, Seq("-e", "!echo \"This is a test for Spark-11624\";"))( "" -> "This is a test for Spark-11624") } + + test("list jars") { +val jarFile = Thread.currentThread().getContextClassLoader.getResource("TestUDTF.jar") +runCliWithin(2.minute)( + s"ADD JAR $jarFile" -> "", + s"LIST JARS" -> "TestUDTF.jar", + s"List JAR $jarFile" -> "TestUDTF.jar" +) + } + + test("list files") { +val dataFilePath = Thread.currentThread().getContextClassLoader + .getResource("data/files/small_kv.txt") +runCliWithin(2.minute)( + s"ADD FILE $dataFilePath" -> "", + s"LIST FILES" -> "small_kv.txt", + s"LIST FILE $dataFilePath" -> "small_kv.txt" +) + } --- End diff -- Seems it is failing? Can you take a look? https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.2/1102/testReport/junit/org.apache.spark.sql.hive.thriftserver/CliSuite/list_files/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15455] For IsolatedClientLoader, we nee...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13236#issuecomment-221133646 @rdblue Thanks. That is a good point :) I am closing this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15455] For IsolatedClientLoader, we nee...
Github user yhuai closed the pull request at: https://github.com/apache/spark/pull/13236 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15455] For IsolatedClientLoader, we nee...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13236#issuecomment-221080895 We support different versions of Hive metastore (for example, Spark depends on Hive 1.2.1 but we can talk to a metastore using Hive 0.12). Also, the Hadoop used by this client can be different from the Hadoop used by Spark. We have a IsolatedClientLoader for the class isolation purpose. So, it is possible that we have to reload Hadoop classes (those classes are from a different version of Hadoop than the Hadoop used by Spark). This is the reason that we need to have a option to allow us load Hadoop classes instead of always sharing Hadoop classes. Because of this, we cannot pass Configuration to HiveClientImpl. At here, Hadoop Configuration is only the container of all users' Hadoop settings. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15455] For IsolatedClientLoader, we nee...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13236#issuecomment-221069266 I can create a Hadoop Configuration inside HiveClientImpl and use it to create the HiveConf. The main issue is that we cannot pass a Hadoop Configuration in a HiveClientImpl because there may be two versions of `Configuration` loaded by two classloaders. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15455] For IsolatedClientLoader, we nee...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13236#issuecomment-221050299 @rdblue Thank you for looking at this! The reason that I added the flag to disable sharing Hadoop classes is that hadoops used by Spark and metastore client may not be binary compatible (e.g. hadoop 2 used by spark and hadoop 1 used by the metastore client). For HiveClientImpl, it is a wrapper of a Hive's metastore client. As long as we can propagate user set configurations to its internal HiveConf (to make it talk the metastore correctly), it is fine. Regarding the classloader, we actually do not use the classloader originally associated with the Hadoop Configuration and we always explicitly set the classloader associated with the HiveConf created inside HiveClientImpl. Regarding Configuration.addDefaultResource, before 2.0, we did not pass Hadoop Configuration to HiveClientImpl (it was called ClientWrapper in Spark 1.6). Since we were not relying on Configuration.addDefaultResource, using a Map should not change anything. btw, I am fine to change the default value of the flag to true (sharing hadoop classes) if you think it represents common use cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15455] For IsolatedClientLoader, we nee...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13236#discussion_r64157490 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -73,7 +74,7 @@ import org.apache.spark.util.{CircularBuffer, Utils} private[hive] class HiveClientImpl( override val version: HiveVersion, sparkConf: SparkConf, -hadoopConf: Configuration, +hadoopConf: Map[String, String], --- End diff -- Also, the reason that I am adding a flag to not Hadoop classes sharing is that the versions used by Spark and the HiveClientImpl can be different. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15455] For IsolatedClientLoader, we nee...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13236#discussion_r64157471 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -73,7 +74,7 @@ import org.apache.spark.util.{CircularBuffer, Utils} private[hive] class HiveClientImpl( override val version: HiveVersion, sparkConf: SparkConf, -hadoopConf: Configuration, +hadoopConf: Map[String, String], --- End diff -- @rdblue FYI. I am changing the type from `Configuration` to `Map[String, String]` because it is possible that Hadoop classes are not shared between Spark side and the `IsolatedClientLoader` side. When Hadoop classes are not shared, if we pass a `Configuration` object to `HiveClientImpl`, the class of this `Configuration` object (this class is loaded by Spark's classloader) is different from the `Configuration` class loaded by `IsolatedClientLoader`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-15280] Input/Output] Refactored OrcOutputWriter and moved serialization to a new class.
Repository: spark Updated Branches: refs/heads/master 201a51f36 -> c18fa464f [SPARK-15280] Input/Output] Refactored OrcOutputWriter and moved serialization to a new class. ## What changes were proposed in this pull request? Refactoring: Separated ORC serialization logic from OrcOutputWriter and moved to a new class called OrcSerializer. ## How was this patch tested? Manual tests & existing tests. Author: Ergin SeyfeCloses #13066 from seyfe/orc_serializer. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c18fa464 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c18fa464 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c18fa464 Branch: refs/heads/master Commit: c18fa464f404ed2612f8c4d355cb0544b355975b Parents: 201a51f Author: Ergin Seyfe Authored: Sat May 21 16:08:31 2016 -0700 Committer: Yin Huai Committed: Sat May 21 16:08:31 2016 -0700 -- .../apache/spark/sql/hive/orc/OrcRelation.scala | 84 +++- 1 file changed, 45 insertions(+), 39 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c18fa464/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala index 6e55137..38f50c1 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala @@ -149,39 +149,70 @@ private[sql] class DefaultSource } } -private[orc] class OrcOutputWriter( -path: String, -bucketId: Option[Int], -dataSchema: StructType, -context: TaskAttemptContext) - extends OutputWriter with HiveInspectors { +private[orc] class OrcSerializer(dataSchema: StructType, conf: Configuration) + extends HiveInspectors { + + def serialize(row: InternalRow): Writable = { +wrapOrcStruct(cachedOrcStruct, structOI, row) +serializer.serialize(cachedOrcStruct, structOI) + } - private val serializer = { + private[this] val serializer = { val table = new Properties() table.setProperty("columns", dataSchema.fieldNames.mkString(",")) table.setProperty("columns.types", dataSchema.map(_.dataType.catalogString).mkString(":")) val serde = new OrcSerde -val configuration = context.getConfiguration -serde.initialize(configuration, table) +serde.initialize(conf, table) serde } - // Object inspector converted from the schema of the relation to be written. - private val structOI = { + // Object inspector converted from the schema of the relation to be serialized. + private[this] val structOI = { val typeInfo = TypeInfoUtils.getTypeInfoFromTypeString(dataSchema.catalogString) OrcStruct.createObjectInspector(typeInfo.asInstanceOf[StructTypeInfo]) .asInstanceOf[SettableStructObjectInspector] } + private[this] val cachedOrcStruct = structOI.create().asInstanceOf[OrcStruct] + + private[this] def wrapOrcStruct( + struct: OrcStruct, + oi: SettableStructObjectInspector, + row: InternalRow): Unit = { +val fieldRefs = oi.getAllStructFieldRefs +var i = 0 +while (i < fieldRefs.size) { + + oi.setStructFieldData( +struct, +fieldRefs.get(i), +wrap( + row.get(i, dataSchema(i).dataType), + fieldRefs.get(i).getFieldObjectInspector, + dataSchema(i).dataType)) + i += 1 +} + } +} + +private[orc] class OrcOutputWriter( +path: String, +bucketId: Option[Int], +dataSchema: StructType, +context: TaskAttemptContext) + extends OutputWriter { + + private[this] val conf = context.getConfiguration + + private[this] val serializer = new OrcSerializer(dataSchema, conf) + // `OrcRecordWriter.close()` creates an empty file if no rows are written at all. We use this // flag to decide whether `OrcRecordWriter.close()` needs to be called. private var recordWriterInstantiated = false private lazy val recordWriter: RecordWriter[NullWritable, Writable] = { recordWriterInstantiated = true - -val conf = context.getConfiguration val uniqueWriteJobId = conf.get("spark.sql.sources.writeJobUUID") val taskAttemptId = context.getTaskAttemptID val partition = taskAttemptId.getTaskID.getId @@ -206,33 +237,8 @@ private[orc] class OrcOutputWriter( override def write(row: Row): Unit = throw new UnsupportedOperationException("call writeInternal") - private def wrapOrcStruct( - struct: OrcStruct, - oi: SettableStructObjectInspector, - row: InternalRow): Unit = { -val fieldRefs =
spark git commit: [SPARK-15280] Input/Output] Refactored OrcOutputWriter and moved serialization to a new class.
Repository: spark Updated Branches: refs/heads/branch-2.0 4148a9c2c -> 6871deb93 [SPARK-15280] Input/Output] Refactored OrcOutputWriter and moved serialization to a new class. ## What changes were proposed in this pull request? Refactoring: Separated ORC serialization logic from OrcOutputWriter and moved to a new class called OrcSerializer. ## How was this patch tested? Manual tests & existing tests. Author: Ergin SeyfeCloses #13066 from seyfe/orc_serializer. (cherry picked from commit c18fa464f404ed2612f8c4d355cb0544b355975b) Signed-off-by: Yin Huai Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6871deb9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6871deb9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6871deb9 Branch: refs/heads/branch-2.0 Commit: 6871deb937fd6d6185b1d2a7a2ea36535ce303ea Parents: 4148a9c Author: Ergin Seyfe Authored: Sat May 21 16:08:31 2016 -0700 Committer: Yin Huai Committed: Sat May 21 16:08:51 2016 -0700 -- .../apache/spark/sql/hive/orc/OrcRelation.scala | 84 +++- 1 file changed, 45 insertions(+), 39 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6871deb9/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala index 6e55137..38f50c1 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala @@ -149,39 +149,70 @@ private[sql] class DefaultSource } } -private[orc] class OrcOutputWriter( -path: String, -bucketId: Option[Int], -dataSchema: StructType, -context: TaskAttemptContext) - extends OutputWriter with HiveInspectors { +private[orc] class OrcSerializer(dataSchema: StructType, conf: Configuration) + extends HiveInspectors { + + def serialize(row: InternalRow): Writable = { +wrapOrcStruct(cachedOrcStruct, structOI, row) +serializer.serialize(cachedOrcStruct, structOI) + } - private val serializer = { + private[this] val serializer = { val table = new Properties() table.setProperty("columns", dataSchema.fieldNames.mkString(",")) table.setProperty("columns.types", dataSchema.map(_.dataType.catalogString).mkString(":")) val serde = new OrcSerde -val configuration = context.getConfiguration -serde.initialize(configuration, table) +serde.initialize(conf, table) serde } - // Object inspector converted from the schema of the relation to be written. - private val structOI = { + // Object inspector converted from the schema of the relation to be serialized. + private[this] val structOI = { val typeInfo = TypeInfoUtils.getTypeInfoFromTypeString(dataSchema.catalogString) OrcStruct.createObjectInspector(typeInfo.asInstanceOf[StructTypeInfo]) .asInstanceOf[SettableStructObjectInspector] } + private[this] val cachedOrcStruct = structOI.create().asInstanceOf[OrcStruct] + + private[this] def wrapOrcStruct( + struct: OrcStruct, + oi: SettableStructObjectInspector, + row: InternalRow): Unit = { +val fieldRefs = oi.getAllStructFieldRefs +var i = 0 +while (i < fieldRefs.size) { + + oi.setStructFieldData( +struct, +fieldRefs.get(i), +wrap( + row.get(i, dataSchema(i).dataType), + fieldRefs.get(i).getFieldObjectInspector, + dataSchema(i).dataType)) + i += 1 +} + } +} + +private[orc] class OrcOutputWriter( +path: String, +bucketId: Option[Int], +dataSchema: StructType, +context: TaskAttemptContext) + extends OutputWriter { + + private[this] val conf = context.getConfiguration + + private[this] val serializer = new OrcSerializer(dataSchema, conf) + // `OrcRecordWriter.close()` creates an empty file if no rows are written at all. We use this // flag to decide whether `OrcRecordWriter.close()` needs to be called. private var recordWriterInstantiated = false private lazy val recordWriter: RecordWriter[NullWritable, Writable] = { recordWriterInstantiated = true - -val conf = context.getConfiguration val uniqueWriteJobId = conf.get("spark.sql.sources.writeJobUUID") val taskAttemptId = context.getTaskAttemptID val partition = taskAttemptId.getTaskID.getId @@ -206,33 +237,8 @@ private[orc] class OrcOutputWriter( override def write(row: Row): Unit = throw new UnsupportedOperationException("call writeInternal") - private def wrapOrcStruct( -
[GitHub] spark pull request: [SPARK-15280] [Input/Output] Refactored OrcOut...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13066#issuecomment-220805144 Merging to master and branch 2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15280] [Input/Output] Refactored OrcOut...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13066#issuecomment-220795260 LGTM. I will merge this to master and branch 2.0 once the description is updated. Also regarding "How was this patch tested?", we can say that manual tests and existing tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15280] [Input/Output] Refactored OrcOut...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13066#issuecomment-220795180 @seyfe Can you also update the description (since we are doing a refactoring instead of adding a new public class)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15280] [Input/Output] Refactored OrcOut...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13066#issuecomment-220750857 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15455] For IsolatedClientLoader, we nee...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13236#issuecomment-220750748 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15455] For IsolatedClientLoader, we nee...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13236#discussion_r64123606 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala --- @@ -129,6 +129,14 @@ private[spark] object HiveUtils extends Logging { .toSequence .createWithDefault(Nil) + val HIVE_METASTORE_SHARE_HADOOPCLASSES = +SQLConfigBuilder("spark.sql.hive.metastore.shareHadoopClasses") + .doc("When set to true, the classloader used to load Hive metastore client will " + +"not explicitly load Hadoop classes. Instead, Hadoop classes loaded by the classloader " + +"that loads Spark will be used.") + .booleanConf + .createWithDefault(false) --- End diff -- @marmbrus I am setting this conf to false by default, which changes the behavior (our master always share hadoop clsses). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15455] For IsolatedClientLoader, we nee...
GitHub user yhuai opened a pull request: https://github.com/apache/spark/pull/13236 [SPARK-15455] For IsolatedClientLoader, we need to provide a conf to disable sharing Hadoop classes ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/yhuai/spark notsharehadoop Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13236.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13236 commit 06168a65816e7a0b01f031c58b0cffbf7306114d Author: Yin Huai <yh...@databricks.com> Date: 2016-05-20T22:12:06Z Add a conf to control if we want to share Hadoop classes for IsolatedClientLoader. commit 4639c8583f8ff7b45c0b4ceb48df2aefa0288691 Author: Yin Huai <yh...@databricks.com> Date: 2016-05-20T22:54:13Z Do not pass hadoop Configuration to HiveClientImpl --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15279][SQL] Catch conflicting SerDe whe...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13068#issuecomment-220728799 Thanks! LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15438] [SQL] improve explain of whole s...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13204#issuecomment-220706346 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14130] [SQL] Throw exceptions for ALTER...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/12714#discussion_r64079773 --- Diff: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 --- @@ -179,6 +173,11 @@ unsupportedHiveNativeCommands | kw1=ALTER kw2=TABLE tableIdentifier kw3=TOUCH | kw1=ALTER kw2=TABLE tableIdentifier partitionSpec? kw3=COMPACT | kw1=ALTER kw2=TABLE tableIdentifier partitionSpec? kw3=CONCATENATE +| kw1=START kw2=TRANSACTION +| kw1=COMMIT +| kw1=ROLLBACK +| kw1=DFS --- End diff -- Can you explain the reason that we also need to change CLI driver? I guess I am missing some context. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15280] [Input/Output] Refactored OrcOut...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13066#discussion_r64079348 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala --- @@ -167,39 +166,69 @@ private[sql] class DefaultSource } } -private[orc] class OrcOutputWriter( -path: String, -bucketId: Option[Int], -dataSchema: StructType, -context: TaskAttemptContext) - extends OutputWriter with HiveInspectors { +class OrcSerializer(dataSchema: StructType, conf: Configuration) extends HiveInspectors { --- End diff -- Can we add `private[orc]` modifier? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14998][SQL]fix ArrayIndexOutOfBoundsExc...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/12772#issuecomment-220525633 At here, `A\tB\tC\t\t` represents 5 fields and `A\tB\tC\t` represents 4 fields, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14998][SQL]fix ArrayIndexOutOfBoundsExc...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/12772#issuecomment-220525398 can you add a regression test? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: explain of whole stage codegen
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13204#issuecomment-220525308 I like 3. We can tell what operators are in a single WholeStageCodeGen operator and we can also know what are input operators of a WholeStageCodeGen. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14998][SQL]fix ArrayIndexOutOfBoundsExc...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/12772#issuecomment-220499145 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15424][SQL] Revert SPARK-14807 Create a...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13207#issuecomment-220498914 tes this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14261][SQL] Memory leak in Spark Thrift...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/12932#issuecomment-220496773 @dosoft Thank you for the PR. What is the memory footprint of a leaked driver? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14261][SQL] Memory leak in Spark Thrift...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/12932#issuecomment-220496577 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15421][SQL] Validate DDL property value...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13205#issuecomment-220476667 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15337] [SPARK-15338] [SQL] Enable Run-t...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13128#issuecomment-220411852 All Spark confs are not mutable. For SQL, those metastore related confs are also not mutable considering the long-running metastore client that we have. I do think it is reasonable to explicitly mark `spark.sql.warehouse.dir` as not mutable. We can make it clear in our doc (for example, have a list of immutable confs of SQL in Spark SQL programming guide). If we want to throw exceptions, it will be good to consider both Spark conf and immutable SQL confs together. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11206] Support SQL UI on the history se...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/10061#discussion_r63822163 --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala --- @@ -96,6 +100,7 @@ private[spark] object JsonProtocol { executorMetricsUpdateToJson(metricsUpdate) case blockUpdated: SparkListenerBlockUpdated => throw new MatchError(blockUpdated) // TODO(ekl) implement this + case _ => parse(mapper.writeValueAsString(event)) --- End diff -- > Events are a public API, and they should be carefully crafted, since changing them affects user applications (including event logs). If there is unnecessary information in the event, then it's a bug in the event definition, not here. Yea. I totally agree. However, my concern is that having this line at here will make the developer harder to spot issues during the development. Since the serialization works automatically, we are not making a self-review on what will be serialized and what methods will be called during serialization a mandatory step, which makes the auditing work much harder. Although it introduces more work to the developer to make every event explicitly handled, when we review the pull request, we can clearly know what will be serialized and how a event is serialized when a pull request is submitted. What do you think? btw, if I am missing any context, please let me know :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15367] [SQL] Add refreshTable back
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13156#discussion_r63820108 --- Diff: sql/hivecontext-compatibility/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala --- @@ -58,4 +58,16 @@ class HiveContext private[hive]( sparkSession.sharedState.asInstanceOf[HiveSharedState] } + /** + * Invalidate and refresh all the cached the metadata of the given table. For performance reasons, + * Spark SQL or the external data source library it uses might cache certain metadata about a + * table, such as the location of blocks. When those change outside of Spark SQL, users should + * call this function to invalidate the cache. + * + * @since 1.3.0 + */ + def refreshTable(tableName: String): Unit = { --- End diff -- This class is for the compatibility purpose. Let's leave it as is. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15337] [SPARK-15338] [SQL] Enable Run-t...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13128#issuecomment-220203789 @gatorsmile For 2.0, all confs should go into spark conf or sql conf. What is the reason to change this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15320] [SQL] Spark-SQL Cli Ignores Para...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13111#issuecomment-220203561 @gatorsmile For 2.0, we will use `spark.sql.warehouse.dir` instead of using `hive.metastore.warehouse.dir` (the support of this has been dropped). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-15192][SQL] null check for SparkSession.createDataFrame
Repository: spark Updated Branches: refs/heads/branch-2.0 36acf8856 -> f5784459e [SPARK-15192][SQL] null check for SparkSession.createDataFrame ## What changes were proposed in this pull request? This PR adds null check in `SparkSession.createDataFrame`, so that we can make sure the passed in rows matches the given schema. ## How was this patch tested? new tests in `DatasetSuite` Author: Wenchen FanCloses #13008 from cloud-fan/row-encoder. (cherry picked from commit ebfe3a1f2c77e6869c3c36ba67afb7fabe6a94f5) Signed-off-by: Yin Huai Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f5784459 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f5784459 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f5784459 Branch: refs/heads/branch-2.0 Commit: f5784459e84e84d6641725bf0d0fb31db742456c Parents: 36acf88 Author: Wenchen Fan Authored: Wed May 18 18:06:38 2016 -0700 Committer: Yin Huai Committed: Wed May 18 18:06:52 2016 -0700 -- .../scala/org/apache/spark/mllib/fpm/FPGrowth.scala| 2 +- .../apache/spark/sql/catalyst/ScalaReflection.scala| 4 ++-- .../spark/sql/catalyst/encoders/RowEncoder.scala | 10 +++--- .../sql/catalyst/expressions/BoundAttribute.scala | 2 +- .../sql/catalyst/expressions/objects/objects.scala | 4 +++- .../main/scala/org/apache/spark/sql/SparkSession.scala | 4 ++-- .../scala/org/apache/spark/sql/api/r/SQLUtils.scala| 5 - .../test/scala/org/apache/spark/sql/DatasetSuite.scala | 13 +++-- .../scala/org/apache/spark/sql/test/SQLTestUtils.scala | 6 +- 9 files changed, 28 insertions(+), 22 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f5784459/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala index 9166faa..28e4966 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala @@ -116,7 +116,7 @@ object FPGrowthModel extends Loader[FPGrowthModel[_]] { StructField("freq", LongType)) val schema = StructType(fields) val rowDataRDD = model.freqItemsets.map { x => -Row(x.items, x.freq) +Row(x.items.toSeq, x.freq) } sqlContext.createDataFrame(rowDataRDD, schema).write.parquet(Loader.dataPath(path)) } http://git-wip-us.apache.org/repos/asf/spark/blob/f5784459/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala index cb9a62d..c0fa220 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala @@ -113,8 +113,8 @@ object ScalaReflection extends ScalaReflection { * Returns true if the value of this data type is same between internal and external. */ def isNativeType(dt: DataType): Boolean = dt match { -case BooleanType | ByteType | ShortType | IntegerType | LongType | - FloatType | DoubleType | BinaryType => true +case NullType | BooleanType | ByteType | ShortType | IntegerType | LongType | + FloatType | DoubleType | BinaryType | CalendarIntervalType => true case _ => false } http://git-wip-us.apache.org/repos/asf/spark/blob/f5784459/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala index a5f39aa..71b39c5 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala @@ -70,8 +70,7 @@ object RowEncoder { private def serializerFor( inputObject: Expression, inputType: DataType): Expression = inputType match { -case NullType | BooleanType | ByteType | ShortType | IntegerType | LongType | - FloatType | DoubleType | BinaryType | CalendarIntervalType => inputObject +case dt if ScalaReflection.isNativeType(dt) => inputObject case p: PythonUserDefinedType =>
spark git commit: [SPARK-15192][SQL] null check for SparkSession.createDataFrame
Repository: spark Updated Branches: refs/heads/master 32be51fba -> ebfe3a1f2 [SPARK-15192][SQL] null check for SparkSession.createDataFrame ## What changes were proposed in this pull request? This PR adds null check in `SparkSession.createDataFrame`, so that we can make sure the passed in rows matches the given schema. ## How was this patch tested? new tests in `DatasetSuite` Author: Wenchen FanCloses #13008 from cloud-fan/row-encoder. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ebfe3a1f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ebfe3a1f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ebfe3a1f Branch: refs/heads/master Commit: ebfe3a1f2c77e6869c3c36ba67afb7fabe6a94f5 Parents: 32be51f Author: Wenchen Fan Authored: Wed May 18 18:06:38 2016 -0700 Committer: Yin Huai Committed: Wed May 18 18:06:38 2016 -0700 -- .../scala/org/apache/spark/mllib/fpm/FPGrowth.scala| 2 +- .../apache/spark/sql/catalyst/ScalaReflection.scala| 4 ++-- .../spark/sql/catalyst/encoders/RowEncoder.scala | 10 +++--- .../sql/catalyst/expressions/BoundAttribute.scala | 2 +- .../sql/catalyst/expressions/objects/objects.scala | 4 +++- .../main/scala/org/apache/spark/sql/SparkSession.scala | 4 ++-- .../scala/org/apache/spark/sql/api/r/SQLUtils.scala| 5 - .../test/scala/org/apache/spark/sql/DatasetSuite.scala | 13 +++-- .../scala/org/apache/spark/sql/test/SQLTestUtils.scala | 6 +- 9 files changed, 28 insertions(+), 22 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ebfe3a1f/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala index 9166faa..28e4966 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala @@ -116,7 +116,7 @@ object FPGrowthModel extends Loader[FPGrowthModel[_]] { StructField("freq", LongType)) val schema = StructType(fields) val rowDataRDD = model.freqItemsets.map { x => -Row(x.items, x.freq) +Row(x.items.toSeq, x.freq) } sqlContext.createDataFrame(rowDataRDD, schema).write.parquet(Loader.dataPath(path)) } http://git-wip-us.apache.org/repos/asf/spark/blob/ebfe3a1f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala index cb9a62d..c0fa220 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala @@ -113,8 +113,8 @@ object ScalaReflection extends ScalaReflection { * Returns true if the value of this data type is same between internal and external. */ def isNativeType(dt: DataType): Boolean = dt match { -case BooleanType | ByteType | ShortType | IntegerType | LongType | - FloatType | DoubleType | BinaryType => true +case NullType | BooleanType | ByteType | ShortType | IntegerType | LongType | + FloatType | DoubleType | BinaryType | CalendarIntervalType => true case _ => false } http://git-wip-us.apache.org/repos/asf/spark/blob/ebfe3a1f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala index a5f39aa..71b39c5 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala @@ -70,8 +70,7 @@ object RowEncoder { private def serializerFor( inputObject: Expression, inputType: DataType): Expression = inputType match { -case NullType | BooleanType | ByteType | ShortType | IntegerType | LongType | - FloatType | DoubleType | BinaryType | CalendarIntervalType => inputObject +case dt if ScalaReflection.isNativeType(dt) => inputObject case p: PythonUserDefinedType => serializerFor(inputObject, p.sqlType) @@ -151,7 +150,7 @@ object RowEncoder { case StructType(fields) => val
[GitHub] spark pull request: [SPARK-15192][SQL] null check for SparkSession...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13008#issuecomment-220201533 Thanks! Merging to master and branch 2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11206] Support SQL UI on the history se...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/10061#discussion_r63808009 --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala --- @@ -96,6 +100,7 @@ private[spark] object JsonProtocol { executorMetricsUpdateToJson(metricsUpdate) case blockUpdated: SparkListenerBlockUpdated => throw new MatchError(blockUpdated) // TODO(ekl) implement this + case _ => parse(mapper.writeValueAsString(event)) --- End diff -- It is very possible that we silently pull in unnecessary information. If we have new event types, we should handle those explicitly instead of relying on this line. I am proposing to revert this line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11206] Support SQL UI on the history se...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/10061#discussion_r63804220 --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala --- @@ -96,6 +100,7 @@ private[spark] object JsonProtocol { executorMetricsUpdateToJson(metricsUpdate) case blockUpdated: SparkListenerBlockUpdated => throw new MatchError(blockUpdated) // TODO(ekl) implement this + case _ => parse(mapper.writeValueAsString(event)) --- End diff -- What is the reason to add this line? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14346][SQL] Lists unsupported Hive feat...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13173#discussion_r63732226 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -633,16 +633,16 @@ case class ShowCreateTableCommand(table: TableIdentifier) extends RunnableComman } private def showCreateHiveTable(metadata: CatalogTable): String = { -def reportUnsupportedError(): Unit = { - throw new UnsupportedOperationException( +def reportUnsupportedError(features: Seq[String]): Unit = { + throw new AnalysisException( s"Failed to execute SHOW CREATE TABLE against table ${metadata.identifier.quotedString}, " + - "because it contains table structure(s) (e.g. skewed columns) that Spark SQL doesn't " + - "support yet." + "which is created by Hive and uses the following feature(s) that are not yet supported " + + "by Spark SQL:\n" + features.map(" - " + _).mkString("\n") --- End diff -- It sounds like that we are missing functionalities. Can we update the error message? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15367] [SQL] Add refreshTable back
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13156#discussion_r63729506 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala --- @@ -622,7 +622,7 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv .mode(SaveMode.Append) .saveAsTable("arrayInParquet") - sessionState.refreshTable("arrayInParquet") + sparkSession.catalog.refreshTable("arrayInParquet") --- End diff -- Actually, invalidateTable and refreshTable do have different meanings. The current implementation of `HiveMetastoreCatalog.refreshTable` is `HiveMetastoreCatalog.invalidateTable` (and then we retrieve the new metadata lazily). But, it does not mean that `refreshTable` and `invalidateTable` have the same semantic. If we should remove any of `invalidateTable` or `refreshTable` should be discussed in a different thread. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15367] [SQL] Add refreshTable back
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13156#discussion_r63727680 --- Diff: sql/hivecontext-compatibility/src/test/scala/org/apache/spark/sql/hive/HiveContextCompatibilitySuite.scala --- @@ -99,4 +105,41 @@ class HiveContextCompatibilitySuite extends SparkFunSuite with BeforeAndAfterEac assert(databases3.toSeq == Seq("default")) } + test("check change after refresh") { --- End diff -- Looks like `RefreshTable` command is actually doing more work. I think we need to make `RefreshTable` and `sparkSession.catalog.refreshTable` have the same behavior. Can you make that change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15192][SQL] null check for SparkSession...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13008#discussion_r63726478 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala --- @@ -116,7 +116,7 @@ object FPGrowthModel extends Loader[FPGrowthModel[_]] { StructField("freq", LongType)) val schema = StructType(fields) val rowDataRDD = model.freqItemsets.map { x => -Row(x.items, x.freq) +Row(x.items.toSeq, x.freq) --- End diff -- Do we need to call `toSeq` at here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-14346] Fix scala-2.10 build
Repository: spark Updated Branches: refs/heads/branch-2.0 2dddec40d -> 1db37417c [SPARK-14346] Fix scala-2.10 build ## What changes were proposed in this pull request? Scala 2.10 build was broken by #13079. I am reverting the change of that line. Author: Yin Huai <yh...@databricks.com> Closes #13157 from yhuai/SPARK-14346-fix-scala2.10. (cherry picked from commit 2a5db9c140b9d60a5ec91018be19bec7b80850ee) Signed-off-by: Yin Huai <yh...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1db37417 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1db37417 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1db37417 Branch: refs/heads/branch-2.0 Commit: 1db37417c25429c0001c19d2f10f4a314fe4585c Parents: 2dddec4 Author: Yin Huai <yh...@databricks.com> Authored: Tue May 17 18:02:31 2016 -0700 Committer: Yin Huai <yh...@databricks.com> Committed: Tue May 17 18:02:45 2016 -0700 -- .../scala/org/apache/spark/sql/catalyst/catalog/interface.scala| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1db37417/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala index d4f5cbb..3fdd411 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala @@ -208,6 +208,6 @@ case class SimpleCatalogRelation( } require( -metadata.identifier.database.contains(databaseName), +metadata.identifier.database == Some(databaseName), "provided database does not match the one specified in the table definition") } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-14346] Fix scala-2.10 build
Repository: spark Updated Branches: refs/heads/master 25b315e6c -> 2a5db9c14 [SPARK-14346] Fix scala-2.10 build ## What changes were proposed in this pull request? Scala 2.10 build was broken by #13079. I am reverting the change of that line. Author: Yin Huai <yh...@databricks.com> Closes #13157 from yhuai/SPARK-14346-fix-scala2.10. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2a5db9c1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2a5db9c1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2a5db9c1 Branch: refs/heads/master Commit: 2a5db9c140b9d60a5ec91018be19bec7b80850ee Parents: 25b315e Author: Yin Huai <yh...@databricks.com> Authored: Tue May 17 18:02:31 2016 -0700 Committer: Yin Huai <yh...@databricks.com> Committed: Tue May 17 18:02:31 2016 -0700 -- .../scala/org/apache/spark/sql/catalyst/catalog/interface.scala| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2a5db9c1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala index d4f5cbb..3fdd411 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala @@ -208,6 +208,6 @@ case class SimpleCatalogRelation( } require( -metadata.identifier.database.contains(databaseName), +metadata.identifier.database == Some(databaseName), "provided database does not match the one specified in the table definition") } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14346] Fix scala-2.10 build
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13157#issuecomment-219897800 I am merging this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14346] Fix scala-2.10 build
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13157#issuecomment-219885469 tested locally. The fix is good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14346] Fix scala-2.10 build
GitHub user yhuai opened a pull request: https://github.com/apache/spark/pull/13157 [SPARK-14346] Fix scala-2.10 build ## What changes were proposed in this pull request? Scala 2.10 build was broken by https://github.com/apache/spark/pull/13079/files. I am reverting the change of that line. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yhuai/spark SPARK-14346-fix-scala2.10 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13157.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13157 commit 5b64ca3f71aaf796d94537bc08670893afb38c88 Author: Yin Huai <yh...@databricks.com> Date: 2016-05-17T23:18:34Z [SPARK-14346] Fix scala-2.10 build --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14346][SQL] Native SHOW CREATE TABLE fo...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13079#discussion_r63620803 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -200,6 +207,7 @@ case class SimpleCatalogRelation( } } - require(metadata.identifier.database == Some(databaseName), + require( +metadata.identifier.database.contains(databaseName), --- End diff -- `contains` is not in scala 2.10. Let me fixing the build. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-14346][SQL] Native SHOW CREATE TABLE for Hive tables/views
Repository: spark Updated Branches: refs/heads/branch-2.0 7b62b7c11 -> 2dddec40d [SPARK-14346][SQL] Native SHOW CREATE TABLE for Hive tables/views ## What changes were proposed in this pull request? This is a follow-up of #12781. It adds native `SHOW CREATE TABLE` support for Hive tables and views. A new field `hasUnsupportedFeatures` is added to `CatalogTable` to indicate whether all table metadata retrieved from the concrete underlying external catalog (i.e. Hive metastore in this case) can be mapped to fields in `CatalogTable`. This flag is useful when the target Hive table contains structures that can't be handled by Spark SQL, e.g., skewed columns and storage handler, etc.. ## How was this patch tested? New test cases are added in `ShowCreateTableSuite` to do round-trip tests. Author: Cheng LianCloses #13079 from liancheng/spark-14346-show-create-table-for-hive-tables. (cherry picked from commit b674e67c22bf663334e537e35787c00533adbb04) Signed-off-by: Yin Huai Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2dddec40 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2dddec40 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2dddec40 Branch: refs/heads/branch-2.0 Commit: 2dddec40d6562d1d16bb242bf7dc730431ee1e3e Parents: 7b62b7c Author: Cheng Lian Authored: Tue May 17 15:56:44 2016 -0700 Committer: Yin Huai Committed: Tue May 17 15:56:57 2016 -0700 -- .../spark/sql/catalyst/catalog/interface.scala | 12 +- .../spark/sql/execution/command/tables.scala| 184 +- .../spark/sql/hive/client/HiveClientImpl.scala | 10 +- .../spark/sql/hive/ShowCreateTableSuite.scala | 185 ++- 4 files changed, 333 insertions(+), 58 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2dddec40/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala index d215655..d4f5cbb 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala @@ -79,6 +79,12 @@ case class CatalogTablePartition( * * Note that Hive's metastore also tracks skewed columns. We should consider adding that in the * future once we have a better understanding of how we want to handle skewed columns. + * + * @param hasUnsupportedFeatures is used to indicate whether all table metadata entries retrieved + *from the concrete underlying external catalog (e.g. Hive metastore) are supported by + *Spark SQL. For example, if the underlying Hive table has skewed columns, this information + *can't be mapped to [[CatalogTable]] since Spark SQL doesn't handle skewed columns for now. + *In this case `hasUnsupportedFeatures` is set to true. By default, it is false. */ case class CatalogTable( identifier: TableIdentifier, @@ -95,7 +101,8 @@ case class CatalogTable( properties: Map[String, String] = Map.empty, viewOriginalText: Option[String] = None, viewText: Option[String] = None, -comment: Option[String] = None) { +comment: Option[String] = None, +hasUnsupportedFeatures: Boolean = false) { // Verify that the provided columns are part of the schema private val colNames = schema.map(_.name).toSet @@ -200,6 +207,7 @@ case class SimpleCatalogRelation( } } - require(metadata.identifier.database == Some(databaseName), + require( +metadata.identifier.database.contains(databaseName), "provided database does not match the one specified in the table definition") } http://git-wip-us.apache.org/repos/asf/spark/blob/2dddec40/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala index bb4f1ff..1fc02d1 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala @@ -626,40 +626,149 @@ case class ShowCreateTableCommand(table: TableIdentifier) extends RunnableComman val stmt = if (DDLUtils.isDatasourceTable(tableMetadata)) { showCreateDataSourceTable(tableMetadata) } else { - throw new UnsupportedOperationException( -
spark git commit: [SPARK-14346][SQL] Native SHOW CREATE TABLE for Hive tables/views
Repository: spark Updated Branches: refs/heads/master 8e8bc9f95 -> b674e67c2 [SPARK-14346][SQL] Native SHOW CREATE TABLE for Hive tables/views ## What changes were proposed in this pull request? This is a follow-up of #12781. It adds native `SHOW CREATE TABLE` support for Hive tables and views. A new field `hasUnsupportedFeatures` is added to `CatalogTable` to indicate whether all table metadata retrieved from the concrete underlying external catalog (i.e. Hive metastore in this case) can be mapped to fields in `CatalogTable`. This flag is useful when the target Hive table contains structures that can't be handled by Spark SQL, e.g., skewed columns and storage handler, etc.. ## How was this patch tested? New test cases are added in `ShowCreateTableSuite` to do round-trip tests. Author: Cheng LianCloses #13079 from liancheng/spark-14346-show-create-table-for-hive-tables. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b674e67c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b674e67c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b674e67c Branch: refs/heads/master Commit: b674e67c22bf663334e537e35787c00533adbb04 Parents: 8e8bc9f Author: Cheng Lian Authored: Tue May 17 15:56:44 2016 -0700 Committer: Yin Huai Committed: Tue May 17 15:56:44 2016 -0700 -- .../spark/sql/catalyst/catalog/interface.scala | 12 +- .../spark/sql/execution/command/tables.scala| 184 +- .../spark/sql/hive/client/HiveClientImpl.scala | 10 +- .../spark/sql/hive/ShowCreateTableSuite.scala | 185 ++- 4 files changed, 333 insertions(+), 58 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b674e67c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala index d215655..d4f5cbb 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala @@ -79,6 +79,12 @@ case class CatalogTablePartition( * * Note that Hive's metastore also tracks skewed columns. We should consider adding that in the * future once we have a better understanding of how we want to handle skewed columns. + * + * @param hasUnsupportedFeatures is used to indicate whether all table metadata entries retrieved + *from the concrete underlying external catalog (e.g. Hive metastore) are supported by + *Spark SQL. For example, if the underlying Hive table has skewed columns, this information + *can't be mapped to [[CatalogTable]] since Spark SQL doesn't handle skewed columns for now. + *In this case `hasUnsupportedFeatures` is set to true. By default, it is false. */ case class CatalogTable( identifier: TableIdentifier, @@ -95,7 +101,8 @@ case class CatalogTable( properties: Map[String, String] = Map.empty, viewOriginalText: Option[String] = None, viewText: Option[String] = None, -comment: Option[String] = None) { +comment: Option[String] = None, +hasUnsupportedFeatures: Boolean = false) { // Verify that the provided columns are part of the schema private val colNames = schema.map(_.name).toSet @@ -200,6 +207,7 @@ case class SimpleCatalogRelation( } } - require(metadata.identifier.database == Some(databaseName), + require( +metadata.identifier.database.contains(databaseName), "provided database does not match the one specified in the table definition") } http://git-wip-us.apache.org/repos/asf/spark/blob/b674e67c/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala index bb4f1ff..1fc02d1 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala @@ -626,40 +626,149 @@ case class ShowCreateTableCommand(table: TableIdentifier) extends RunnableComman val stmt = if (DDLUtils.isDatasourceTable(tableMetadata)) { showCreateDataSourceTable(tableMetadata) } else { - throw new UnsupportedOperationException( -"SHOW CREATE TABLE only supports Spark SQL data source tables.") + showCreateHiveTable(tableMetadata) }
[GitHub] spark pull request: [SPARK-14346][SQL] Native SHOW CREATE TABLE fo...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13079#issuecomment-219878464 LGTM. I am merging this to master and 2.0. Let's change the error message and exception class in a separate pr. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14346][SQL] Native SHOW CREATE TABLE fo...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13079#discussion_r63618404 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -626,40 +626,142 @@ case class ShowCreateTableCommand(table: TableIdentifier) extends RunnableComman val stmt = if (DDLUtils.isDatasourceTable(tableMetadata)) { showCreateDataSourceTable(tableMetadata) } else { - throw new UnsupportedOperationException( -"SHOW CREATE TABLE only supports Spark SQL data source tables.") + showCreateHiveTable(tableMetadata) } Seq(Row(stmt)) } + private def showCreateHiveTable(metadata: CatalogTable): String = { +def reportUnsupportedError(): Unit = { + throw new UnsupportedOperationException( --- End diff -- Let's be consistent with other places and change this to AnalysisException. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15317][Core]Don't store accumulators fo...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/13153#issuecomment-219877343 Maybe also put the size comparison in the description? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15367] [SQL] Add refreshTable back
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13156#discussion_r63613203 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/MultiDatabaseSuite.scala --- @@ -202,7 +202,8 @@ class MultiDatabaseSuite extends QueryTest with SQLTestUtils with TestHiveSingle activateDatabase(db) { sql( -s"""CREATE EXTERNAL TABLE t (id BIGINT) +s""" + |CREATE EXTERNAL TABLE t (id BIGINT) --- End diff -- Seems these formatting changes are not necessary? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org