[GitHub] spark issue #14874: [SPARK-17180][SPARK-17309][SPARK-17323][SQL] create Alte...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14874 **[Test build #64703 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64703/consoleFull)** for PR 14874 at commit [`8480945`](https://github.com/apache/spark/commit/8480945f2cc30972f33f1c55100c4263b83a3497). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14531: [SPARK-16943] [SPARK-16942] [SQL] Fix multiple bu...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14531#discussion_r76932912 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -58,18 +63,32 @@ case class CreateTableLikeCommand( throw new AnalysisException( s"Source table in CREATE TABLE LIKE does not exist: '$sourceTable'") } -if (catalog.isTemporaryTable(sourceTable)) { - throw new AnalysisException( -s"Source table in CREATE TABLE LIKE cannot be temporary: '$sourceTable'") -} -val tableToCreate = catalog.getTableMetadata(sourceTable).copy( - identifier = targetTable, - tableType = CatalogTableType.MANAGED, - createTime = System.currentTimeMillis, - lastAccessTime = -1).withNewStorage(locationUri = None) +val sourceTableDesc = catalog.getTableMetadata(sourceTable) -catalog.createTable(tableToCreate, ifNotExists) +val newSerdeProp = + if (DDLUtils.isDatasourceTable(sourceTableDesc)) { +val newPath = catalog.defaultTablePath(targetTable) +sourceTableDesc.storage.properties.filterKeys(_.toLowerCase != "path") ++ + Map("path" -> newPath) + } else { +sourceTableDesc.storage.properties + } +val newStorage = sourceTableDesc.storage.copy( + locationUri = None, + properties = newSerdeProp) + +val newTableDesc = + CatalogTable( +identifier = targetTable, +tableType = CatalogTableType.MANAGED, +storage = newStorage, +schema = sourceTableDesc.schema, +provider = sourceTableDesc.provider, --- End diff -- uh... You are right! So many things happened in the past 3 weeks. : ) Let me fix it now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14823: [SPARK-17257][SQL] the physical plan of CREATE TABLE or ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14823 I missed this ping. Will review it tomorrow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14859: [SPARK-17200][PROJECT INFRA][BUILD][SparkR] Automate bui...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14859 **[Test build #64702 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64702/consoleFull)** for PR 14859 at commit [`b1a5076`](https://github.com/apache/spark/commit/b1a50764dcc71981fdc96e5a4b8d2e208f7692ec). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76932253 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala --- @@ -140,7 +145,12 @@ private[hive] case class MetastoreRelation( sparkSession.sessionState.conf.defaultSizeInBytes }) } - ) +if (catalogTable.catalogStats.isDefined) { --- End diff -- Actually the `catalogStats` here is already obtained from Hive's number in `constructStatsFromHive` below. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14889: [SPARK-17326][SPARKR] Fix tests with HiveContext in Spar...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/14889 Thanks @HyukjinKwon - This is a great catch. LGTM pending tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14874: [SPARK-17180][SPARK-17309][SPARK-17323][SQL] crea...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14874#discussion_r76932176 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLViewSuite.scala --- @@ -274,6 +276,75 @@ class SQLViewSuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } + test("should not allow ALTER VIEW AS when the view does not exist") { +intercept[NoSuchTableException]( + sql("ALTER VIEW testView AS SELECT 1, 2") +) + +intercept[NoSuchTableException]( + sql("ALTER VIEW default.testView AS SELECT 1, 2") +) + } + + test("ALTER VIEW AS should try to alter temp view first if view name has no database part") { +withTempView("test_view") { + withView("test_view") { --- End diff -- The same here. We just need to change the ordering: ``` withView("test_view") { withTempView("test_view") { ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14874: [SPARK-17180][SPARK-17309][SPARK-17323][SQL] crea...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14874#discussion_r76932113 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLViewSuite.scala --- @@ -274,6 +276,75 @@ class SQLViewSuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } + test("should not allow ALTER VIEW AS when the view does not exist") { +intercept[NoSuchTableException]( + sql("ALTER VIEW testView AS SELECT 1, 2") +) + +intercept[NoSuchTableException]( + sql("ALTER VIEW default.testView AS SELECT 1, 2") +) + } + + test("ALTER VIEW AS should try to alter temp view first if view name has no database part") { +withTempView("test_view") { + withView("test_view") { +sql("CREATE VIEW test_view AS SELECT 1 AS a, 2 AS b") +sql("CREATE TEMP VIEW test_view AS SELECT 1 AS a, 2 AS b") + +sql("ALTER VIEW test_view AS SELECT 3 AS i, 4 AS j") + +// The temporary view should be updated. +checkAnswer(spark.table("test_view"), Row(3, 4)) + +// The permanent view should stay same. +checkAnswer(spark.table("default.test_view"), Row(1, 2)) + } +} + } + + test("ALTER VIEW AS should alter permanent view if view name has database part") { +withTempView("test_view") { + withView("test_view") { --- End diff -- Based on my understanding, this will drop the temporary view because the resolution preference of `drop view` and then `withTempView` is unable to find any temporary view. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14859: [SPARK-17200][PROJECT INFRA][BUILD][SparkR] Automate bui...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14859 HiveContext tests with SparkR is already being skipped due to https://github.com/apache/spark/pull/14889. I manually fixed this and tested this, here. https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13873: [SPARK-16167][SQL] RowEncoder should preserve array/map ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13873 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64698/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13873: [SPARK-16167][SQL] RowEncoder should preserve array/map ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13873 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14874: [SPARK-17180][SPARK-17309][SPARK-17323][SQL] crea...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14874#discussion_r76931872 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLViewSuite.scala --- @@ -274,6 +276,75 @@ class SQLViewSuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } + test("should not allow ALTER VIEW AS when the view does not exist") { +intercept[NoSuchTableException]( + sql("ALTER VIEW testView AS SELECT 1, 2") +) + +intercept[NoSuchTableException]( + sql("ALTER VIEW default.testView AS SELECT 1, 2") +) + } + + test("ALTER VIEW AS should try to alter temp view first if view name has no database part") { +withTempView("test_view") { + withView("test_view") { +sql("CREATE VIEW test_view AS SELECT 1 AS a, 2 AS b") +sql("CREATE TEMP VIEW test_view AS SELECT 1 AS a, 2 AS b") + +sql("ALTER VIEW test_view AS SELECT 3 AS i, 4 AS j") + +// The temporary view should be updated. +checkAnswer(spark.table("test_view"), Row(3, 4)) + +// The permanent view should stay same. +checkAnswer(spark.table("default.test_view"), Row(1, 2)) + } +} + } + + test("ALTER VIEW AS should alter permanent view if view name has database part") { +withTempView("test_view") { + withView("test_view") { +sql("CREATE VIEW test_view AS SELECT 1 AS a, 2 AS b") +sql("CREATE TEMP VIEW test_view AS SELECT 1 AS a, 2 AS b") + +sql("ALTER VIEW default.test_view AS SELECT 3 AS i, 4 AS j") + +// The temporary view should stay same. +checkAnswer(spark.table("test_view"), Row(1, 2)) + +// The permanent view should be updated. +checkAnswer(spark.table("default.test_view"), Row(3, 4)) + } +} + } + + test("ALTER VIEW AS should keep the previous table properties, comment, create_time, etc.") { +withTempView("test_view") { --- End diff -- `test_view` is not a temporary view, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14874: [SPARK-17180][SPARK-17309][SPARK-17323][SQL] create Alte...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14874 LGTM except one minor comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14889: [SPARK-17326][SPARKR] Fix tests with HiveContext in Spar...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14889 **[Test build #64700 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64700/consoleFull)** for PR 14889 at commit [`2cdcf4f`](https://github.com/apache/spark/commit/2cdcf4f17fd6023d35852f524e2826cc685814dd). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14862: [SPARK-17295][SQL] Create TestHiveSessionState use refle...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14862 **[Test build #64701 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64701/consoleFull)** for PR 14862 at commit [`714e3a9`](https://github.com/apache/spark/commit/714e3a99c8af857f6ec275bba97160d5bd5d998c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13873: [SPARK-16167][SQL] RowEncoder should preserve array/map ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13873 **[Test build #64698 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64698/consoleFull)** for PR 13873 at commit [`b22867b`](https://github.com/apache/spark/commit/b22867b365dc679b71f8b7df8ce3516382f9f119). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76931629 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -235,14 +235,18 @@ class SessionCatalog( * Note: If the underlying implementation does not support altering a certain field, * this becomes a no-op. */ - def alterTable(tableDefinition: CatalogTable): Unit = { + def alterTable(tableDefinition: CatalogTable, fromAnalyze: Boolean = false): Unit = { --- End diff -- I'll fix this, thank you --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76931067 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala --- @@ -140,7 +145,12 @@ private[hive] case class MetastoreRelation( sparkSession.sessionState.conf.defaultSizeInBytes }) } - ) +if (catalogTable.catalogStats.isDefined) { --- End diff -- When `catalogStats` is defined, why we still go to use Hive's number? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14883: [SPARK-17319] [SQL] Move addJar from HiveSessionState to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14883 After reading the code, it sounds like adding session-scoped ADD JAR is not simple if we want to pass it to every worker node after the PR: https://github.com/apache/spark/pull/8909 CC @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14889: [SPARK-17326][SPARKR] Fix tests with HiveContext in Spar...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14889 cc @rxin, @felixcheung and @shivaram --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14889: [SPARK-17326][SPARKR] Fix tests with HiveContext ...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/14889 [SPARK-17326][SPARKR] Fix tests with HiveContext in SparkR not to be skipped always ## What changes were proposed in this pull request? Currently, `HiveContext` in SparkR is not being tested and always skipped. This is because the initiation of `TestHiveContext` is being failed due to trying to load non-existing data paths (test tables). This is introduced from https://github.com/apache/spark/pull/14005 This enables the tests with SparkR. ## How was this patch tested? Manually, **Before** (on Mac OS) ``` ... Skipped 1. create DataFrame from RDD (@test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped 2. test HiveContext (@test_sparkSQL.R#1041) - Hive is not build with SparkSQL, skipped 3. read/write ORC files (@test_sparkSQL.R#1748) - Hive is not build with SparkSQL, skipped 4. enableHiveSupport on SparkSession (@test_sparkSQL.R#2480) - Hive is not build with SparkSQL, skipped ... ``` **After** (on Mac OS) ``` ... Skipped 1. sparkJars tag in SparkContext (@test_Windows.R#21) - This test is only for Windows, skipped ... ``` Please refer the tests below (on Windows) - Before: https://ci.appveyor.com/project/HyukjinKwon/spark/build/45-test123 - After: https://ci.appveyor.com/project/HyukjinKwon/spark/build/46-test123 You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-17326 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14889.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14889 commit 2cdcf4f17fd6023d35852f524e2826cc685814dd Author: hyukjinkwon Date: 2016-08-31T06:32:29Z Tests with HiveContext in SparkR being skipped always --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14531: [SPARK-16943] [SPARK-16942] [SQL] Fix multiple bu...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14531#discussion_r76930695 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -58,18 +63,32 @@ case class CreateTableLikeCommand( throw new AnalysisException( s"Source table in CREATE TABLE LIKE does not exist: '$sourceTable'") } -if (catalog.isTemporaryTable(sourceTable)) { - throw new AnalysisException( -s"Source table in CREATE TABLE LIKE cannot be temporary: '$sourceTable'") -} -val tableToCreate = catalog.getTableMetadata(sourceTable).copy( - identifier = targetTable, - tableType = CatalogTableType.MANAGED, - createTime = System.currentTimeMillis, - lastAccessTime = -1).withNewStorage(locationUri = None) +val sourceTableDesc = catalog.getTableMetadata(sourceTable) -catalog.createTable(tableToCreate, ifNotExists) +val newSerdeProp = + if (DDLUtils.isDatasourceTable(sourceTableDesc)) { +val newPath = catalog.defaultTablePath(targetTable) +sourceTableDesc.storage.properties.filterKeys(_.toLowerCase != "path") ++ + Map("path" -> newPath) + } else { +sourceTableDesc.storage.properties + } +val newStorage = sourceTableDesc.storage.copy( + locationUri = None, + properties = newSerdeProp) + +val newTableDesc = + CatalogTable( +identifier = targetTable, +tableType = CatalogTableType.MANAGED, +storage = newStorage, +schema = sourceTableDesc.schema, +provider = sourceTableDesc.provider, --- End diff -- permanent view too --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76930682 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala --- @@ -140,7 +145,12 @@ private[hive] case class MetastoreRelation( sparkSession.sessionState.conf.defaultSizeInBytes }) } - ) +if (catalogTable.catalogStats.isDefined) { --- End diff -- We can skip the above computation of `sizeInBytes` if `catalogStats` is defined. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14531: [SPARK-16943] [SPARK-16942] [SQL] Fix multiple bu...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14531#discussion_r76930187 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -58,18 +63,32 @@ case class CreateTableLikeCommand( throw new AnalysisException( s"Source table in CREATE TABLE LIKE does not exist: '$sourceTable'") } -if (catalog.isTemporaryTable(sourceTable)) { - throw new AnalysisException( -s"Source table in CREATE TABLE LIKE cannot be temporary: '$sourceTable'") -} -val tableToCreate = catalog.getTableMetadata(sourceTable).copy( - identifier = targetTable, - tableType = CatalogTableType.MANAGED, - createTime = System.currentTimeMillis, - lastAccessTime = -1).withNewStorage(locationUri = None) +val sourceTableDesc = catalog.getTableMetadata(sourceTable) -catalog.createTable(tableToCreate, ifNotExists) +val newSerdeProp = + if (DDLUtils.isDatasourceTable(sourceTableDesc)) { +val newPath = catalog.defaultTablePath(targetTable) +sourceTableDesc.storage.properties.filterKeys(_.toLowerCase != "path") ++ + Map("path" -> newPath) + } else { +sourceTableDesc.storage.properties + } +val newStorage = sourceTableDesc.storage.copy( + locationUri = None, + properties = newSerdeProp) + +val newTableDesc = + CatalogTable( +identifier = targetTable, +tableType = CatalogTableType.MANAGED, +storage = newStorage, +schema = sourceTableDesc.schema, +provider = sourceTableDesc.provider, --- End diff -- if the source table is tmp view, the provider is `None`, we should set a default provider here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14710 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64697/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14710 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14710 **[Test build #64697 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64697/consoleFull)** for PR 14710 at commit [`5a2f30f`](https://github.com/apache/spark/commit/5a2f30f7a31bd8edba1932cabcaf71332837b92d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14883: [SPARK-17319] [SQL] Move addJar from HiveSessionState to...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14883 @cloud-fan @viirya After reading Hive's code, the JAR's scope should be session based. See the code: https://github.com/apache/hive/blob/0438701395161325a429b4fd8211213276aa0fef/ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java#L1188-L1200 Let me think how to fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14783: SPARK-16785 R dapply doesn't return array or raw columns
Github user clarkfitzg commented on the issue: https://github.com/apache/spark/pull/14783 Yes, this is only for a bug fix. @shivaram mentioned in a previous email exchange it would be good to see some performance benchmarks as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76929362 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -401,6 +401,13 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat } } + override def alterTableStats(tableDefinition: CatalogTable): Unit = withClient { --- End diff -- I see. OK, i'll remove this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14531: [SPARK-16943] [SPARK-16942] [SQL] Fix multiple bugs in C...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14531 **[Test build #64699 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64699/consoleFull)** for PR 14531 at commit [`eabf31f`](https://github.com/apache/spark/commit/eabf31fdc1b9491bca0f051808e7db0c1b6e12d3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14531: [SPARK-16943] [SPARK-16942] [SQL] Fix multiple bugs in C...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/14531 Let me do it now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14873: [SPARK-17308]Improved the spark core code by replacing a...
Github user shiv4nsh commented on the issue: https://github.com/apache/spark/pull/14873 @srowen : We are good on merging on this Right? or do this PR require some additional changes ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14868: [SPARK-16283][SQL][WIP] Implements percentile_approx agg...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14868 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64696/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14868: [SPARK-16283][SQL][WIP] Implements percentile_approx agg...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14868 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14868: [SPARK-16283][SQL][WIP] Implements percentile_approx agg...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14868 **[Test build #64696 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64696/consoleFull)** for PR 14868 at commit [`3f08c02`](https://github.com/apache/spark/commit/3f08c027add03c59251583420c76582a085b3573). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14888: [SPARK-17324] [SQL] Remove Direct Usage of HiveClient in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14888 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64694/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14888: [SPARK-17324] [SQL] Remove Direct Usage of HiveClient in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14888 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14888: [SPARK-17324] [SQL] Remove Direct Usage of HiveClient in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14888 **[Test build #64694 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64694/consoleFull)** for PR 14888 at commit [`d03e65d`](https://github.com/apache/spark/commit/d03e65d0f9b119ed767da124da360cfcf9e966b8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14868: [SPARK-16283][SQL][WIP] Implements percentile_approx agg...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14868 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14868: [SPARK-16283][SQL][WIP] Implements percentile_approx agg...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14868 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64695/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14868: [SPARK-16283][SQL][WIP] Implements percentile_approx agg...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14868 **[Test build #64695 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64695/consoleFull)** for PR 14868 at commit [`bc70a00`](https://github.com/apache/spark/commit/bc70a0023bb24175c06c03cb7acad7f9a6d34e36). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76926260 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -168,6 +169,81 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + private def checkMetastoreRelationStats( + tableName: String, + totalSize: Long, + rowCount: Option[BigInt]): Unit = { +val df = sql(s"SELECT * FROM $tableName") +val relations = df.queryExecution.analyzed.collect { case rel: MetastoreRelation => + rel.statistics + assert(rel.statistics.sizeInBytes === totalSize) + assert(rel.statistics.rowCount === rowCount) +} +assert(relations.size === 1) + } + + private def checkLogicalRelationStats(tableName: String, rowCount: Option[BigInt]): Unit = { +val df = sql(s"SELECT * FROM $tableName") +val relations = df.queryExecution.analyzed.collect { case rel: LogicalRelation => + assert(rel.statistics.sizeInBytes === rel.relation.sizeInBytes) --- End diff -- If this is the reason, you need to leave a TODO task in the code. Otherwise, we might forget it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76926133 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -401,6 +401,13 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat } } + override def alterTableStats(tableDefinition: CatalogTable): Unit = withClient { --- End diff -- Yeah, agree with @cloud-fan . If we want to set it using `alter table`, we should use the dedicated command (just like what Hive does): ```SQL ALTER TABLE UPDATE STATISTICS SET ``` Let us remove `alterTableStats` and minimize the code changes? We can discuss how to do it properly when we start this JIRA: https://issues.apache.org/jira/browse/SPARK-17282 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13252: [SPARK-15473][SQL] CSV data source writes header for emp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13252 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13252: [SPARK-15473][SQL] CSV data source writes header for emp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13252 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64693/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13252: [SPARK-15473][SQL] CSV data source writes header for emp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13252 **[Test build #64693 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64693/consoleFull)** for PR 13252 at commit [`031c9da`](https://github.com/apache/spark/commit/031c9dacba77c6197626d02ceb0e1081b18e187b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76924362 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -401,6 +401,13 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat } } + override def alterTableStats(tableDefinition: CatalogTable): Unit = withClient { --- End diff -- I'm not sure we wanna support the second way to set properties. If users set them with ALTER TABLE, we should throw exception. cc @yhuai @gatorsmile what do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14883: [SPARK-17319] [SQL] Move addJar from HiveSessionState to...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14883 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli#LanguageManualCli-HiveResources Looks like it is. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14883: [SPARK-17319] [SQL] Move addJar from HiveSessionS...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14883#discussion_r76923998 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala --- @@ -171,6 +171,7 @@ private[sql] class SessionState(sparkSession: SparkSession) { } def addJar(path: String): Unit = { --- End diff -- hmm, I think the addition of resources should be session-scoped. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13704: [SPARK-15985][SQL] Eliminate redundant cast from ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13704 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13873: [SPARK-16167][SQL] RowEncoder should preserve array/map ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13873 **[Test build #64698 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64698/consoleFull)** for PR 13873 at commit [`b22867b`](https://github.com/apache/spark/commit/b22867b365dc679b71f8b7df8ce3516382f9f119). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #10225: [SPARK-12196][Core] Store/retrieve blocks in diff...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/10225#discussion_r76923680 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala --- @@ -50,35 +50,98 @@ private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolea private val shutdownHook = addShutdownHook() + private abstract class FileAllocationStrategy { +def apply(filename: String): File + +protected def getFile(filename: String, storageDirs: Array[File]): File = { + require(storageDirs.nonEmpty, "could not find file when the directories are empty") + + // Figure out which local directory it hashes to, and which subdirectory in that + val hash = Utils.nonNegativeHash(filename) + val dirId = localDirs.indexOf(storageDirs(hash % storageDirs.length)) + val subDirId = (hash / storageDirs.length) % subDirsPerLocalDir + + // Create the subdirectory if it doesn't already exist + val subDir = subDirs(dirId).synchronized { +val old = subDirs(dirId)(subDirId) +if (old != null) { + old +} else { + val newDir = new File(localDirs(dirId), "%02x".format(subDirId)) + if (!newDir.exists() && !newDir.mkdir()) { +throw new IOException(s"Failed to create local dir in $newDir.") + } + subDirs(dirId)(subDirId) = newDir + newDir +} + } + + new File(subDir, filename) +} + } + /** Looks up a file by hashing it into one of our local subdirectories. */ // This method should be kept in sync with // org.apache.spark.network.shuffle.ExternalShuffleBlockResolver#getFile(). - def getFile(filename: String): File = { -// Figure out which local directory it hashes to, and which subdirectory in that -val hash = Utils.nonNegativeHash(filename) -val dirId = hash % localDirs.length -val subDirId = (hash / localDirs.length) % subDirsPerLocalDir - -// Create the subdirectory if it doesn't already exist -val subDir = subDirs(dirId).synchronized { - val old = subDirs(dirId)(subDirId) - if (old != null) { -old - } else { -val newDir = new File(localDirs(dirId), "%02x".format(subDirId)) -if (!newDir.exists() && !newDir.mkdir()) { - throw new IOException(s"Failed to create local dir in $newDir.") -} -subDirs(dirId)(subDirId) = newDir -newDir + private object hashAllocator extends FileAllocationStrategy { +def apply(filename: String): File = getFile(filename, localDirs) + } + + /** Looks up a file by hierarchy way in different speed storage devices. */ + private val hierarchyStore = conf.getOption("spark.storage.hierarchyStore") + private class HierarchyAllocator extends FileAllocationStrategy { +case class LayerInfo(key: String, threshold: Long, dirs: Array[File]) +val hsSpecs: Array[(String, Long)] = + // e.g.: hierarchyStore = "ssd 200GB, hdd 100GB" + hierarchyStore.get.trim.split(",").map { +s => val x = s.trim.split(" +") + (x(0).toLowerCase, Utils.byteStringAsBytes(x(1))) } +val hsLayers: Array[LayerInfo] = hsSpecs.map( + s => LayerInfo(s._1, s._2, localDirs.filter(_.getPath.toLowerCase.containsSlice(s._1))) +) +val lastLayerDirs = localDirs.filter(dir => !hsLayers.exists(_.dirs.contains(dir))) +val allLayers: Array[LayerInfo] = hsLayers :+ + LayerInfo("Last Storage", 10.toLong, lastLayerDirs) +val finalLayers: Array[LayerInfo] = allLayers.filter(_.dirs.nonEmpty) +logInfo("Hierarchy store info:") +for (layer <- finalLayers) { + logInfo("Layer: %s, Threshold: %s".format(layer.key, Utils.bytesToString(layer.threshold))) + layer.dirs.foreach { dir => logInfo("\t%s".format(dir.getCanonicalPath)) } } -new File(subDir, filename) +def apply(filename: String): File = { + var availableFile: File = null + for (layer <- finalLayers) { --- End diff -- Once you get `availableFile`, you can stop this loop early to prevent creating useless subdirs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apach
[GitHub] spark issue #13704: [SPARK-15985][SQL] Eliminate redundant cast from an arra...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13704 thanks, merging to master! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76923549 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala --- @@ -21,25 +21,55 @@ import scala.util.control.NonFatal import org.apache.hadoop.fs.{FileSystem, Path} -import org.apache.spark.sql.{AnalysisException, Row, SparkSession} +import org.apache.spark.sql.{AnalysisException, Dataset, Row, SparkSession} import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, CatalogTable} +import org.apache.spark.sql.catalyst.plans.logical.Statistics +import org.apache.spark.sql.execution.datasources.LogicalRelation /** * Analyzes the given table in the current database to generate statistics, which will be * used in query optimizations. - * - * Right now, it only supports Hive tables and it only updates the size of a Hive table - * in the Hive metastore. */ -case class AnalyzeTableCommand(tableName: String) extends RunnableCommand { +case class AnalyzeTableCommand(tableName: String, noscan: Boolean = true) extends RunnableCommand { override def run(sparkSession: SparkSession): Seq[Row] = { val sessionState = sparkSession.sessionState val tableIdent = sessionState.sqlParser.parseTableIdentifier(tableName) val relation = EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent)) +def updateTableStats( +catalogTable: CatalogTable, +oldTotalSize: Long, +oldRowCount: Long, +newTotalSize: Long): Unit = { + + var newStats: Option[Statistics] = None + if (newTotalSize > 0 && newTotalSize != oldTotalSize) { +newStats = Some(Statistics(sizeInBytes = newTotalSize)) + } + if (!noscan) { +val newRowCount = Dataset.ofRows(sparkSession, relation).count() +if (newRowCount >= 0 && newRowCount != oldRowCount) { + newStats = if (newStats.isDefined) { +newStats.map(_.copy(rowCount = Some(BigInt(newRowCount + } else { +Some(Statistics(sizeInBytes = oldTotalSize, rowCount = Some(BigInt(newRowCount + } +} + } + // Update the metastore if the above statistics of the table are different from those + // recorded in the metastore. + if (newStats.isDefined) { +sessionState.catalog.alterTable( + catalogTable.copy(catalogStats = newStats), fromAnalyze = true) + +// Refresh the cache of the table in the catalog. --- End diff -- This comment is confusing. We have two caches. One is the data cache, another is logical plan cache for data source tables. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76923392 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala --- @@ -21,25 +21,55 @@ import scala.util.control.NonFatal import org.apache.hadoop.fs.{FileSystem, Path} -import org.apache.spark.sql.{AnalysisException, Row, SparkSession} +import org.apache.spark.sql.{AnalysisException, Dataset, Row, SparkSession} import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, CatalogTable} +import org.apache.spark.sql.catalyst.plans.logical.Statistics +import org.apache.spark.sql.execution.datasources.LogicalRelation /** * Analyzes the given table in the current database to generate statistics, which will be * used in query optimizations. - * - * Right now, it only supports Hive tables and it only updates the size of a Hive table - * in the Hive metastore. */ -case class AnalyzeTableCommand(tableName: String) extends RunnableCommand { +case class AnalyzeTableCommand(tableName: String, noscan: Boolean = true) extends RunnableCommand { override def run(sparkSession: SparkSession): Seq[Row] = { val sessionState = sparkSession.sessionState val tableIdent = sessionState.sqlParser.parseTableIdentifier(tableName) val relation = EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent)) +def updateTableStats( --- End diff -- uh, this function interrupts the whole flow. Maybe you can move it out of this run function? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14531: [SPARK-16943] [SPARK-16942] [SQL] Fix multiple bugs in C...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14531 can you resolve the conflicts? thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14883: [SPARK-17319] [SQL] Move addJar from HiveSessionState to...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14883 @gatorsmile is the `ADD JAR` command in hive session-scoped? Our current implementation may be wrong.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14883: [SPARK-17319] [SQL] Move addJar from HiveSessionS...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14883#discussion_r76922568 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala --- @@ -171,6 +171,7 @@ private[sql] class SessionState(sparkSession: SparkSession) { } def addJar(path: String): Unit = { --- End diff -- so the `addJar` is not session-scoped by definition? I think `sparkSession.sparkContext.addJar(path)` is also cross-session --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14883: [SPARK-17319] [SQL] Move addJar from HiveSessionS...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14883#discussion_r76922145 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala --- @@ -509,4 +509,10 @@ class InMemoryCatalog( StringUtils.filterPattern(catalog(db).functions.keysIterator.toSeq, pattern) } + // -- + // Resources + // -- + + override def addJar(path: String): Unit = { /* no-op */ } --- End diff -- Yea, I think throw exception is better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14750: [SPARK-17183][SQL] put hive serde table schema to table ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14750 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14750: [SPARK-17183][SQL] put hive serde table schema to table ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14750 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64691/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76922102 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -168,6 +169,81 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + private def checkMetastoreRelationStats( + tableName: String, + totalSize: Long, + rowCount: Option[BigInt]): Unit = { +val df = sql(s"SELECT * FROM $tableName") +val relations = df.queryExecution.analyzed.collect { case rel: MetastoreRelation => + rel.statistics + assert(rel.statistics.sizeInBytes === totalSize) + assert(rel.statistics.rowCount === rowCount) +} +assert(relations.size === 1) + } + + private def checkLogicalRelationStats(tableName: String, rowCount: Option[BigInt]): Unit = { +val df = sql(s"SELECT * FROM $tableName") +val relations = df.queryExecution.analyzed.collect { case rel: LogicalRelation => + assert(rel.statistics.sizeInBytes === rel.relation.sizeInBytes) --- End diff -- btw, spark 2.0 has some bugs on Windows during tests, mainly about paths. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14750: [SPARK-17183][SQL] put hive serde table schema to table ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14750 **[Test build #64691 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64691/consoleFull)** for PR 14750 at commit [`52db0ed`](https://github.com/apache/spark/commit/52db0ed8dd7d9bb3f201b648a999068597942d26). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #10225: [SPARK-12196][Core] Store/retrieve blocks in diff...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/10225#discussion_r76922041 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala --- @@ -50,35 +50,98 @@ private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolea private val shutdownHook = addShutdownHook() + private abstract class FileAllocationStrategy { +def apply(filename: String): File + +protected def getFile(filename: String, storageDirs: Array[File]): File = { + require(storageDirs.nonEmpty, "could not find file when the directories are empty") + + // Figure out which local directory it hashes to, and which subdirectory in that + val hash = Utils.nonNegativeHash(filename) + val dirId = localDirs.indexOf(storageDirs(hash % storageDirs.length)) + val subDirId = (hash / storageDirs.length) % subDirsPerLocalDir + + // Create the subdirectory if it doesn't already exist + val subDir = subDirs(dirId).synchronized { +val old = subDirs(dirId)(subDirId) +if (old != null) { + old +} else { + val newDir = new File(localDirs(dirId), "%02x".format(subDirId)) + if (!newDir.exists() && !newDir.mkdir()) { +throw new IOException(s"Failed to create local dir in $newDir.") + } + subDirs(dirId)(subDirId) = newDir + newDir +} + } + + new File(subDir, filename) +} + } + /** Looks up a file by hashing it into one of our local subdirectories. */ // This method should be kept in sync with // org.apache.spark.network.shuffle.ExternalShuffleBlockResolver#getFile(). - def getFile(filename: String): File = { -// Figure out which local directory it hashes to, and which subdirectory in that -val hash = Utils.nonNegativeHash(filename) -val dirId = hash % localDirs.length -val subDirId = (hash / localDirs.length) % subDirsPerLocalDir - -// Create the subdirectory if it doesn't already exist -val subDir = subDirs(dirId).synchronized { - val old = subDirs(dirId)(subDirId) - if (old != null) { -old - } else { -val newDir = new File(localDirs(dirId), "%02x".format(subDirId)) -if (!newDir.exists() && !newDir.mkdir()) { - throw new IOException(s"Failed to create local dir in $newDir.") -} -subDirs(dirId)(subDirId) = newDir -newDir + private object hashAllocator extends FileAllocationStrategy { +def apply(filename: String): File = getFile(filename, localDirs) + } + + /** Looks up a file by hierarchy way in different speed storage devices. */ + private val hierarchyStore = conf.getOption("spark.storage.hierarchyStore") + private class HierarchyAllocator extends FileAllocationStrategy { +case class LayerInfo(key: String, threshold: Long, dirs: Array[File]) +val hsSpecs: Array[(String, Long)] = + // e.g.: hierarchyStore = "ssd 200GB, hdd 100GB" + hierarchyStore.get.trim.split(",").map { +s => val x = s.trim.split(" +") + (x(0).toLowerCase, Utils.byteStringAsBytes(x(1))) --- End diff -- It is better to add error handling here to prevent wrong format. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14783: SPARK-16785 R dapply doesn't return array or raw columns
Github user sun-rui commented on the issue: https://github.com/apache/spark/pull/14783 @clarkfitzg, your patch is for bug fix but not for performance improvement, right? If so, since there is no performance regression according to your benchmark, let's focus on the functionality. We can address performance issue in other JIRA issues. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76921850 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala --- @@ -88,15 +116,21 @@ case class AnalyzeTableCommand(tableName: String) extends RunnableCommand { } }.getOrElse(0L) -// Update the Hive metastore if the total size of the table is different than the size -// recorded in the Hive metastore. -// This logic is based on org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats(). -if (newTotalSize > 0 && newTotalSize != oldTotalSize) { - sessionState.catalog.alterTable( -catalogTable.copy( - properties = relation.catalogTable.properties + -(AnalyzeTableCommand.TOTAL_SIZE_FIELD -> newTotalSize.toString))) -} +updateTableStats( + catalogTable, + oldTotalSize = catalogTable.catalogStats.map(_.sizeInBytes.toLong).getOrElse(0L), + oldRowCount = catalogTable.catalogStats.flatMap(_.rowCount.map(_.toLong)).getOrElse(-1L), + newTotalSize = newTotalSize) + + // data source tables have been converted into LogicalRelations + case logicalRel: LogicalRelation if logicalRel.metastoreTableIdentifier.isDefined => +val tableIdentifier = logicalRel.metastoreTableIdentifier.get +val catalogTable = sessionState.catalog.getTableMetadata(tableIdentifier) +updateTableStats( + catalogTable, + oldTotalSize = logicalRel.statistics.sizeInBytes.toLong, + oldRowCount = logicalRel.statistics.rowCount.map(_.toLong).getOrElse(-1L), + newTotalSize = logicalRel.relation.sizeInBytes) case otherRelation => throw new AnalysisException(s"ANALYZE TABLE is only supported for Hive tables, " + --- End diff -- This message is out of dated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14744: [SPARK-17178][SPARKR][SPARKSUBMIT] Allow to set sparkr s...
Github user sun-rui commented on the issue: https://github.com/apache/spark/pull/14744 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14887: [SPARK-17321][YARN] YARN shuffle service should use good...
Github user SaintBacchus commented on the issue: https://github.com/apache/spark/pull/14887 If there are some bad disk in local-dirs, `NodeManager` will not pass these bad disk to spark executor. So it's not necessary to check it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76921588 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -168,6 +169,81 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + private def checkMetastoreRelationStats( + tableName: String, + totalSize: Long, + rowCount: Option[BigInt]): Unit = { +val df = sql(s"SELECT * FROM $tableName") +val relations = df.queryExecution.analyzed.collect { case rel: MetastoreRelation => + rel.statistics + assert(rel.statistics.sizeInBytes === totalSize) + assert(rel.statistics.rowCount === rowCount) +} +assert(relations.size === 1) + } + + private def checkLogicalRelationStats(tableName: String, rowCount: Option[BigInt]): Unit = { +val df = sql(s"SELECT * FROM $tableName") +val relations = df.queryExecution.analyzed.collect { case rel: LogicalRelation => + assert(rel.statistics.sizeInBytes === rel.relation.sizeInBytes) --- End diff -- seems that parquet size is different in Windows and Linux. I set an expected value initially, it worked on Windows but it went wrong in Spark CI. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14710 **[Test build #64697 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64697/consoleFull)** for PR 14710 at commit [`5a2f30f`](https://github.com/apache/spark/commit/5a2f30f7a31bd8edba1932cabcaf71332837b92d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14883: [SPARK-17319] [SQL] Move addJar from HiveSessionS...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14883#discussion_r76921445 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala --- @@ -509,4 +509,10 @@ class InMemoryCatalog( StringUtils.filterPattern(catalog(db).functions.keysIterator.toSeq, pattern) } + // -- + // Resources + // -- + + override def addJar(path: String): Unit = { /* no-op */ } --- End diff -- Will no-op here make user think that the Jar is loaded? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/14710 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76921301 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala --- @@ -33,7 +33,8 @@ import org.apache.spark.util.Utils case class LogicalRelation( relation: BaseRelation, expectedOutputAttributes: Option[Seq[Attribute]] = None, -metastoreTableIdentifier: Option[TableIdentifier] = None) +metastoreTableIdentifier: Option[TableIdentifier] = None, +inheritedStats: Option[Statistics] = None) --- End diff -- it uses catalogStats of CatalogTable in MetastoreRelation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76921020 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -168,6 +169,81 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + private def checkMetastoreRelationStats( + tableName: String, + totalSize: Long, + rowCount: Option[BigInt]): Unit = { +val df = sql(s"SELECT * FROM $tableName") +val relations = df.queryExecution.analyzed.collect { case rel: MetastoreRelation => + rel.statistics + assert(rel.statistics.sizeInBytes === totalSize) + assert(rel.statistics.rowCount === rowCount) +} +assert(relations.size === 1) + } + + private def checkLogicalRelationStats(tableName: String, rowCount: Option[BigInt]): Unit = { +val df = sql(s"SELECT * FROM $tableName") +val relations = df.queryExecution.analyzed.collect { case rel: LogicalRelation => + assert(rel.statistics.sizeInBytes === rel.relation.sizeInBytes) --- End diff -- Can you put a comment to explain why you just compare these two values, instead of comparing them with the expected values? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76920921 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -168,6 +169,81 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + private def checkMetastoreRelationStats( + tableName: String, + totalSize: Long, + rowCount: Option[BigInt]): Unit = { +val df = sql(s"SELECT * FROM $tableName") +val relations = df.queryExecution.analyzed.collect { case rel: MetastoreRelation => + rel.statistics --- End diff -- this is useless, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14710 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14710 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64692/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14710: [SPARK-16533][CORE] resolve deadlocking in driver when e...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14710 **[Test build #64692 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64692/consoleFull)** for PR 14710 at commit [`5a2f30f`](https://github.com/apache/spark/commit/5a2f30f7a31bd8edba1932cabcaf71332837b92d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76920817 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -168,6 +169,81 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + private def checkMetastoreRelationStats( + tableName: String, + totalSize: Long, + rowCount: Option[BigInt]): Unit = { +val df = sql(s"SELECT * FROM $tableName") +val relations = df.queryExecution.analyzed.collect { case rel: MetastoreRelation => + rel.statistics + assert(rel.statistics.sizeInBytes === totalSize) + assert(rel.statistics.rowCount === rowCount) +} +assert(relations.size === 1) + } + + private def checkLogicalRelationStats(tableName: String, rowCount: Option[BigInt]): Unit = { +val df = sql(s"SELECT * FROM $tableName") +val relations = df.queryExecution.analyzed.collect { case rel: LogicalRelation => + assert(rel.statistics.sizeInBytes === rel.relation.sizeInBytes) + assert(rel.statistics.rowCount === rowCount) +} +assert(relations.size === 1) + } + + test("test table-level statistics for hive tables created in HiveExternalCatalog") { +val textTable = "textTable" +val parquetTable = "parquetTable" +val orcTable = "orcTable" +withTable(textTable, parquetTable, orcTable) { + sql(s"CREATE TABLE $textTable (key STRING, value STRING) STORED AS TEXTFILE") + sql(s"INSERT INTO TABLE $textTable SELECT * FROM src") + + // noscan won't count the number of rows + sql(s"ANALYZE TABLE $textTable COMPUTE STATISTICS noscan") + checkMetastoreRelationStats(textTable, 5812, None) + + // without noscan, we count the number of rows + sql(s"ANALYZE TABLE $textTable COMPUTE STATISTICS") + checkMetastoreRelationStats(textTable, 5812, Some(500)) + + // test whether the old stats are removed + sql(s"INSERT INTO TABLE $textTable SELECT * FROM src") + sql(s"ANALYZE TABLE $textTable COMPUTE STATISTICS noscan") + checkMetastoreRelationStats(textTable, 11624, None) + + // test statistics of LogicalRelation inherited from MetastoreRelation + sql(s"CREATE TABLE $parquetTable (key STRING, value STRING) STORED AS PARQUET") + sql(s"CREATE TABLE $orcTable (key STRING, value STRING) STORED AS ORC") + sql(s"INSERT INTO TABLE $parquetTable SELECT * FROM src") + sql(s"INSERT INTO TABLE $orcTable SELECT * FROM src") + sql(s"ANALYZE TABLE $parquetTable COMPUTE STATISTICS") + sql(s"ANALYZE TABLE $orcTable COMPUTE STATISTICS") + + checkLogicalRelationStats(parquetTable, Some(500)) + + withSQLConf("spark.sql.hive.convertMetastoreOrc" -> "true") { +checkLogicalRelationStats(orcTable, Some(500)) + } +} + } + + test("test table-level statistics for data source table created in HiveExternalCatalog") { +val parquetTable = "parquetTable" +withTable(parquetTable) { + sql(s"CREATE TABLE $parquetTable (key STRING, value STRING) USING PARQUET") --- End diff -- Maybe you can check its `CatalogTable` and confirm it is a datasource table thought `DDLUtils.isDatasourceTable(table)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76920785 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala --- @@ -33,7 +33,8 @@ import org.apache.spark.util.Utils case class LogicalRelation( relation: BaseRelation, expectedOutputAttributes: Option[Seq[Attribute]] = None, -metastoreTableIdentifier: Option[TableIdentifier] = None) +metastoreTableIdentifier: Option[TableIdentifier] = None, +inheritedStats: Option[Statistics] = None) --- End diff -- For MetastoreRelation, isn't LogicalRelation simply using MetastoreRelation's statistics? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14874: [SPARK-17180][SPARK-17309][SPARK-17323][SQL] create Alte...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14874 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64690/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14874: [SPARK-17180][SPARK-17309][SPARK-17323][SQL] create Alte...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14874 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14874: [SPARK-17180][SPARK-17309][SPARK-17323][SQL] create Alte...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14874 **[Test build #64690 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64690/consoleFull)** for PR 14874 at commit [`51726ff`](https://github.com/apache/spark/commit/51726ff82fa818717f9ec52b89ca17a62ca8bb14). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class AlterViewAsCommand(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76920593 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -168,6 +169,81 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + private def checkMetastoreRelationStats( + tableName: String, + totalSize: Long, + rowCount: Option[BigInt]): Unit = { +val df = sql(s"SELECT * FROM $tableName") +val relations = df.queryExecution.analyzed.collect { case rel: MetastoreRelation => + rel.statistics + assert(rel.statistics.sizeInBytes === totalSize) + assert(rel.statistics.rowCount === rowCount) +} +assert(relations.size === 1) + } + + private def checkLogicalRelationStats(tableName: String, rowCount: Option[BigInt]): Unit = { +val df = sql(s"SELECT * FROM $tableName") +val relations = df.queryExecution.analyzed.collect { case rel: LogicalRelation => + assert(rel.statistics.sizeInBytes === rel.relation.sizeInBytes) + assert(rel.statistics.rowCount === rowCount) +} +assert(relations.size === 1) + } + + test("test table-level statistics for hive tables created in HiveExternalCatalog") { +val textTable = "textTable" +val parquetTable = "parquetTable" +val orcTable = "orcTable" +withTable(textTable, parquetTable, orcTable) { + sql(s"CREATE TABLE $textTable (key STRING, value STRING) STORED AS TEXTFILE") + sql(s"INSERT INTO TABLE $textTable SELECT * FROM src") --- End diff -- To ensure the correctness, we also `checkMetastoreRelationStats` before data changes (`INSERT`) and before statistics collection (`ANALYZE`) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76920340 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -168,6 +169,81 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + private def checkMetastoreRelationStats( + tableName: String, + totalSize: Long, + rowCount: Option[BigInt]): Unit = { +val df = sql(s"SELECT * FROM $tableName") +val relations = df.queryExecution.analyzed.collect { case rel: MetastoreRelation => + rel.statistics + assert(rel.statistics.sizeInBytes === totalSize) + assert(rel.statistics.rowCount === rowCount) +} +assert(relations.size === 1) + } + + private def checkLogicalRelationStats(tableName: String, rowCount: Option[BigInt]): Unit = { +val df = sql(s"SELECT * FROM $tableName") +val relations = df.queryExecution.analyzed.collect { case rel: LogicalRelation => + assert(rel.statistics.sizeInBytes === rel.relation.sizeInBytes) + assert(rel.statistics.rowCount === rowCount) +} +assert(relations.size === 1) + } + + test("test table-level statistics for hive tables created in HiveExternalCatalog") { --- End diff -- Can you split the test case to multiple smaller independent ones? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76920328 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala --- @@ -33,7 +33,8 @@ import org.apache.spark.util.Utils case class LogicalRelation( relation: BaseRelation, expectedOutputAttributes: Option[Seq[Attribute]] = None, -metastoreTableIdentifier: Option[TableIdentifier] = None) +metastoreTableIdentifier: Option[TableIdentifier] = None, +inheritedStats: Option[Statistics] = None) --- End diff -- Since LogicalRelation is converted from Parquet/Orc MetastoreRelation or SimpleCatalogRelation, I think the current name is more indicative --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76920300 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -168,6 +169,81 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + private def checkMetastoreRelationStats( + tableName: String, + totalSize: Long, + rowCount: Option[BigInt]): Unit = { +val df = sql(s"SELECT * FROM $tableName") +val relations = df.queryExecution.analyzed.collect { case rel: MetastoreRelation => + rel.statistics + assert(rel.statistics.sizeInBytes === totalSize) + assert(rel.statistics.rowCount === rowCount) +} +assert(relations.size === 1) + } + + private def checkLogicalRelationStats(tableName: String, rowCount: Option[BigInt]): Unit = { +val df = sql(s"SELECT * FROM $tableName") +val relations = df.queryExecution.analyzed.collect { case rel: LogicalRelation => + assert(rel.statistics.sizeInBytes === rel.relation.sizeInBytes) + assert(rel.statistics.rowCount === rowCount) +} +assert(relations.size === 1) + } + + test("test table-level statistics for hive tables created in HiveExternalCatalog") { +val textTable = "textTable" +val parquetTable = "parquetTable" +val orcTable = "orcTable" +withTable(textTable, parquetTable, orcTable) { + sql(s"CREATE TABLE $textTable (key STRING, value STRING) STORED AS TEXTFILE") + sql(s"INSERT INTO TABLE $textTable SELECT * FROM src") + + // noscan won't count the number of rows + sql(s"ANALYZE TABLE $textTable COMPUTE STATISTICS noscan") + checkMetastoreRelationStats(textTable, 5812, None) --- End diff -- `checkMetastoreRelationStats(textTable, 5812, None)` => `checkMetastoreRelationStats(textTable, expectedTotalSize =5812, expectedRowCount = None)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76920234 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -168,6 +169,81 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton with SQLTestUtils TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + private def checkMetastoreRelationStats( + tableName: String, + totalSize: Long, + rowCount: Option[BigInt]): Unit = { --- End diff -- `totalSize` => `expectedTotalSize` `rowCount` => `expectedRowCount` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14859: [SPARK-17200][PROJECT INFRA][BUILD][SparkR] Autom...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/14859#discussion_r76920073 --- Diff: appveyor.yml --- @@ -0,0 +1,43 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +version: "{build}-{branch}" + +shallow_clone: true + +platform: x64 +configuration: Debug + +cache: + - C:\Users\appveyor\.m2 + +install: + # Install maven and dependencies + - ps: .\dev\appveyor-install-dependencies.ps1 + # Required package for R unit tests + - cmd: R -e "install.packages('testthat', repos='http://cran.us.r-project.org')" + +build_script: + - cmd: mvn -DskipTests -Psparkr package --- End diff -- Thanks @dongjoon-hyun ! I am testing with extra profiles. I will take a look and address your comment as far as I can! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76919947 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -235,14 +235,18 @@ class SessionCatalog( * Note: If the underlying implementation does not support altering a certain field, * this becomes a no-op. */ - def alterTable(tableDefinition: CatalogTable): Unit = { + def alterTable(tableDefinition: CatalogTable, fromAnalyze: Boolean = false): Unit = { --- End diff -- please see my [comment](https://github.com/apache/spark/pull/14712#discussion_r76919518) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76919518 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -401,6 +401,13 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat } } + override def alterTableStats(tableDefinition: CatalogTable): Unit = withClient { --- End diff -- @viirya There's two ways to set/replace properties: 1. We use statistics info in CatalogTable to set these properties - this is called by analyze command and I put this logic into alterTableStats method; 2. We set these properties directly - this is the path of alter table command. If we put alterTableStats logic in the original alterTable method, we can't set the properties by the second way, because they will be always replaced by stats in CatalogTable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14868: [SPARK-16283][SQL][WIP] Implements percentile_app...
Github user clockfly commented on a diff in the pull request: https://github.com/apache/spark/pull/14868#discussion_r76918794 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala --- @@ -0,0 +1,321 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import java.nio.ByteBuffer + +import com.google.common.primitives.{Doubles, Ints, Longs} + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.{InternalRow} +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, TypeCheckSuccess} +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile.{PercentileDigest} +import org.apache.spark.sql.catalyst.util.{ArrayData, GenericArrayData} +import org.apache.spark.sql.catalyst.util.QuantileSummaries +import org.apache.spark.sql.catalyst.util.QuantileSummaries.{defaultCompressThreshold, Stats} +import org.apache.spark.sql.types._ + +/** + * The ApproximatePercentile function returns the approximate percentile(s) of a column at the given + * percentage(s). A percentile is a watermark value below which a given percentage of the column + * values fall. For example, the percentile of column `col` at percentage 50% is the median of + * column `col`. + * + * This function supports partial aggregation. + * + * @param child child expression that can produce column value with `child.eval(inputRow)` + * @param percentageExpression Expression that represents a single percentage value or + * an array of percentage values. Each percentage value must be between + * 0.0 and 1.0. + * @param accuracyExpression Integer literal expression of approximation accuracy. Higher value + * yields better accuracy, the default value is + * DEFAULT_PERCENTILE_ACCURACY. + */ +@ExpressionDescription( + usage = +""" + _FUNC_(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric + column `col` at the given percentage. The value of percentage must be between 0.0 + and 1.0. The `accuracy` parameter (default: 1) is a positive integer literal which + controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields + better accuracy, `1.0/accuracy` is the relative error of the approximation. + + _FUNC_(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate + percentile array of column `col` at the given percentage array. Each value of the + percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 1) is + a positive integer literal which controls approximation accuracy at the cost of memory. + Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of + the approximation. +""") +case class ApproximatePercentile( +child: Expression, +percentageExpression: Expression, +accuracyExpression: Expression, +override val mutableAggBufferOffset: Int, +override val inputAggBufferOffset: Int) extends TypedImperativeAggregate[PercentileDigest] { + + def this(child: Expression, percentageExpression: Expression, accuracyExpression: Expression) = { +this(child, percentageExpression, accuracyExpression, 0, 0) + } + + def this(child: Expression, percentageExpression: Expression) = { +this(child, percentageExpression, Literal(ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)) + } + + // Mark as lazy so that accuracyExpression is not evaluated during tree transformation. + private lazy val
[GitHub] spark issue #14868: [SPARK-16283][SQL][WIP] Implements percentile_approx agg...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14868 **[Test build #64696 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64696/consoleFull)** for PR 14868 at commit [`3f08c02`](https://github.com/apache/spark/commit/3f08c027add03c59251583420c76582a085b3573). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76918381 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala --- @@ -33,7 +33,8 @@ import org.apache.spark.util.Utils case class LogicalRelation( relation: BaseRelation, expectedOutputAttributes: Option[Seq[Attribute]] = None, -metastoreTableIdentifier: Option[TableIdentifier] = None) +metastoreTableIdentifier: Option[TableIdentifier] = None, +inheritedStats: Option[Statistics] = None) --- End diff -- How about `expectedStatistics`, or `catalogStatistics`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14888: [SPARK-17324] [SQL] Remove Direct Usage of HiveClient in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14888 **[Test build #64694 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64694/consoleFull)** for PR 14888 at commit [`d03e65d`](https://github.com/apache/spark/commit/d03e65d0f9b119ed767da124da360cfcf9e966b8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14868: [SPARK-16283][SQL][WIP] Implements percentile_approx agg...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14868 **[Test build #64695 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64695/consoleFull)** for PR 14868 at commit [`bc70a00`](https://github.com/apache/spark/commit/bc70a0023bb24175c06c03cb7acad7f9a6d34e36). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r76918174 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -235,14 +235,18 @@ class SessionCatalog( * Note: If the underlying implementation does not support altering a certain field, * this becomes a no-op. */ - def alterTable(tableDefinition: CatalogTable): Unit = { + def alterTable(tableDefinition: CatalogTable, fromAnalyze: Boolean = false): Unit = { --- End diff -- The additional flag parameter `fromAnalyze` looks weird for me. Why would we need to have two APIs for alter table and use a flag like this to choose among them? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14868: [SPARK-16283][SQL][WIP] Implements percentile_approx agg...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14868 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64687/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14868: [SPARK-16283][SQL][WIP] Implements percentile_approx agg...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14868 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org