[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/17330#discussion_r106795772 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala --- @@ -83,6 +116,19 @@ object SubqueryExpression { case _ => false }.isDefined } + + /** + * Clean the outer references by normalizing them to BindReference in the same way + * we clean up the arguments during LogicalPlan.sameResult. This enables to compare two --- End diff -- @gatorsmile OK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/17330#discussion_r106795763 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala --- @@ -61,6 +63,37 @@ abstract class SubqueryExpression( } } +/** + * This expression is used to represent any form of subquery expression namely + * ListQuery, Exists and ScalarSubquery. This is only used to make sure the + * expression equality works properly when LogicalPlan.sameResult is called + * on plans containing SubqueryExpression(s). This is only a transient expression + * that only lives in the scope of sameResult function call. In other words, analyzer, + * optimizer or planner never sees this expression type during transformation of + * plans. + */ +case class CanonicalizedSubqueryExpr(expr: SubqueryExpression) + extends UnaryExpression with Unevaluable { + override def dataType: DataType = expr.dataType + override def nullable: Boolean = expr.nullable + override def child: Expression = expr + override def toString: String = s"CanonicalizedSubqueryExpr(${expr.toString})" + + // Hashcode is generated conservatively for now i.e it does not include the + // sub query plan. Doing so causes issue when we canonicalize expressions to + // re-order them based on hashcode. + // TODO : improve the hashcode generation by considering the plan info. + override def hashCode(): Int = { +val h = Objects.hashCode(expr.children) +h * 31 + Objects.hashCode(this.getClass.getName) + } + + override def equals(o: Any): Boolean = o match { +case n: CanonicalizedSubqueryExpr => expr.semanticEquals(n.expr) +case other => false --- End diff -- @gatorsmile Will change. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/17330#discussion_r106795750 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala --- @@ -83,6 +116,19 @@ object SubqueryExpression { case _ => false }.isDefined } + + /** + * Clean the outer references by normalizing them to BindReference in the same way + * we clean up the arguments during LogicalPlan.sameResult. This enables to compare two + * plans which has subquery expressions. + */ + def canonicalize(e: SubqueryExpression, attrs: AttributeSeq): CanonicalizedSubqueryExpr = { +// Normalize the outer references in the subquery plan. +val subPlan = e.plan.transformAllExpressions { + case o @ OuterReference(e) => BindReferences.bindReference(e, attrs, allowFailures = true) --- End diff -- @gatorsmile Will change. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17343 **[Test build #74802 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74802/testReport)** for PR 17343 at commit [`00da825`](https://github.com/apache/spark/commit/00da8254d060291fe6f2fdec3e30b2f30d5a69c8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16971 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74793/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16971 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16971 **[Test build #74793 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74793/testReport)** for PR 16971 at commit [`ed6dacd`](https://github.com/apache/spark/commit/ed6dacdb3e3bdfd4e9ccb5c57bf8b4118636b0c6). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17191 We have the same limitation. To do it in mySQL and Postgres, you need to use quotes/backticks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17330 Generally, it looks good to me. cc @hvanhovell @rxin @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17330#discussion_r106795294 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala --- @@ -655,6 +663,148 @@ class CachedTableSuite extends QueryTest with SQLTestUtils with SharedSQLContext } } + test("SPARK-19993 subquery caching") { +withTempView("t1", "t2") { + Seq(1).toDF("c1").createOrReplaceTempView("t1") + Seq(1).toDF("c1").createOrReplaceTempView("t2") + + val ds1 = +sql( + """ +|SELECT * FROM t1 +|WHERE +|NOT EXISTS (SELECT * FROM t1) + """.stripMargin) + assert(getNumInMemoryRelations(ds1) == 0) + + ds1.cache() + + val cachedDs = +sql( + """ +|SELECT * FROM t1 +|WHERE +|NOT EXISTS (SELECT * FROM t1) + """.stripMargin) + assert(getNumInMemoryRelations(cachedDs) == 1) + + // Additional predicate in the subquery plan should cause a cache miss + val cachedMissDs = +sql( + """ +|SELECT * FROM t1 +|WHERE +|NOT EXISTS (SELECT * FROM t1 where c1 = 0) + """.stripMargin) + assert(getNumInMemoryRelations(cachedMissDs) == 0) + + // Simple correlated predicate in subquery + val ds2 = +sql( + """ +|SELECT * FROM t1 +|WHERE +|t1.c1 in (SELECT t2.c1 FROM t2 where t1.c1 = t2.c1) + """.stripMargin) + assert(getNumInMemoryRelations(ds2) == 0) + + ds2.cache() + + val cachedDs2 = +sql( + """ +|SELECT * FROM t1 +|WHERE +|t1.c1 in (SELECT t2.c1 FROM t2 where t1.c1 = t2.c1) + """.stripMargin) + + assert(getNumInMemoryRelations(cachedDs2) == 1) + + spark.catalog.cacheTable("t1") + ds1.unpersist() --- End diff -- How about splitting the test cases to multiple individual ones? The cache will be cleaned for each test case. This can avoid any extra checking, like `assert(getNumInMemoryRelations(cachedMissDs) == 0)` or `ds1.unpersist()` ```Scala override def afterEach(): Unit = { try { spark.catalog.clearCache() } finally { super.afterEach() } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17342 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74792/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17342 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17342 **[Test build #74792 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74792/testReport)** for PR 17342 at commit [`04556c9`](https://github.com/apache/spark/commit/04556c9f2f4feb53e3f644d795a38de4a4e919ca). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17330#discussion_r106795228 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala --- @@ -83,6 +116,19 @@ object SubqueryExpression { case _ => false }.isDefined } + + /** + * Clean the outer references by normalizing them to BindReference in the same way + * we clean up the arguments during LogicalPlan.sameResult. This enables to compare two --- End diff -- Also replace `SubqueryExpression ` by `CanonicalizedSubqueryExpr` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17330#discussion_r106795140 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala --- @@ -61,6 +63,37 @@ abstract class SubqueryExpression( } } +/** + * This expression is used to represent any form of subquery expression namely + * ListQuery, Exists and ScalarSubquery. This is only used to make sure the + * expression equality works properly when LogicalPlan.sameResult is called + * on plans containing SubqueryExpression(s). This is only a transient expression + * that only lives in the scope of sameResult function call. In other words, analyzer, + * optimizer or planner never sees this expression type during transformation of + * plans. + */ +case class CanonicalizedSubqueryExpr(expr: SubqueryExpression) + extends UnaryExpression with Unevaluable { + override def dataType: DataType = expr.dataType + override def nullable: Boolean = expr.nullable + override def child: Expression = expr + override def toString: String = s"CanonicalizedSubqueryExpr(${expr.toString})" + + // Hashcode is generated conservatively for now i.e it does not include the + // sub query plan. Doing so causes issue when we canonicalize expressions to + // re-order them based on hashcode. + // TODO : improve the hashcode generation by considering the plan info. + override def hashCode(): Int = { +val h = Objects.hashCode(expr.children) +h * 31 + Objects.hashCode(this.getClass.getName) + } + + override def equals(o: Any): Boolean = o match { +case n: CanonicalizedSubqueryExpr => expr.semanticEquals(n.expr) +case other => false --- End diff -- ```Scala case CanonicalizedSubqueryExpr(e) => expr.semanticEquals(e) case _ => false ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17330#discussion_r106795081 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala --- @@ -83,6 +116,19 @@ object SubqueryExpression { case _ => false }.isDefined } + + /** + * Clean the outer references by normalizing them to BindReference in the same way + * we clean up the arguments during LogicalPlan.sameResult. This enables to compare two --- End diff -- Nit: `QueryPlan.sameResult` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/17170 Could you update this PR to have the parameter itemsCol And remove predictionCol (if I recall, we don't expose that in the R API for other models either) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/17330#discussion_r106795011 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala --- @@ -83,6 +116,19 @@ object SubqueryExpression { case _ => false }.isDefined } + + /** + * Clean the outer references by normalizing them to BindReference in the same way + * we clean up the arguments during LogicalPlan.sameResult. This enables to compare two + * plans which has subquery expressions. + */ + def canonicalize(e: SubqueryExpression, attrs: AttributeSeq): CanonicalizedSubqueryExpr = { +// Normalize the outer references in the subquery plan. +val subPlan = e.plan.transformAllExpressions { + case o @ OuterReference(e) => BindReferences.bindReference(e, attrs, allowFailures = true) --- End diff -- Nit: `case OuterReference(r) => BindReferences.bindReference(r, attrs, allowFailures = true)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17343 **[Test build #74800 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74800/testReport)** for PR 17343 at commit [`1834db6`](https://github.com/apache/spark/commit/1834db60b7f504862f6ef03bc828264c65bdabd3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16971 LGTM pending Jenkins cc @thunterdb @MLnick --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16596 **[Test build #74801 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74801/testReport)** for PR 16596 at commit [`e33b50a`](https://github.com/apache/spark/commit/e33b50aae78c79a425ab1e935498919eb0350c97). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/17343 cc - @rxin, @squito, @zsxwing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16596 **[Test build #74799 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74799/testReport)** for PR 16596 at commit [`1821e21`](https://github.com/apache/spark/commit/1821e21483904cf2890e9c7ba420d72a20623a74). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17343 **[Test build #74798 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74798/testReport)** for PR 17343 at commit [`e9ac76e`](https://github.com/apache/spark/commit/e9ac76edb055d08699d9de7a5ff77b7ca8a7f5c6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream ...
GitHub user sitalkedia opened a pull request: https://github.com/apache/spark/pull/17343 [SPARK-20014] Optimize mergeSpillsWithFileStream method ## What changes were proposed in this pull request? When the individual partition size in a spill is small, mergeSpillsWithTransferTo method does many small disk ios which is really inefficient. One way to improve the performance will be to use mergeSpillsWithFileStream method by turning off transfer to and using buffered file read/write to improve the io throughput. However, the current implementation of mergeSpillsWithFileStream does not do a buffer read/write of the files and in addition to that it unnecessarily flushes the output files for each partitions. ## How was this patch tested? Unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/sitalkedia/spark upstream_mergeSpillsWithFileStream Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17343.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17343 commit e9ac76edb055d08699d9de7a5ff77b7ca8a7f5c6 Author: Sital Kedia Date: 2017-03-19T00:24:10Z [SPARK-20014] Optimize mergeSpillsWithFileStream method --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16596 **[Test build #74797 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74797/testReport)** for PR 16596 at commit [`cc44ae5`](https://github.com/apache/spark/commit/cc44ae577a972c26623e26d349aa2990d33b5b28). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16596 **[Test build #74796 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74796/testReport)** for PR 16596 at commit [`61d6ba6`](https://github.com/apache/spark/commit/61d6ba64774b4c65a4c05f69e1a97f4f978464db). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16596 updated. pretty sure this is an issue on Windows only --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16596 **[Test build #74795 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74795/testReport)** for PR 16596 at commit [`f37c891`](https://github.com/apache/spark/commit/f37c891a8f38c244c8be7c452581778d1e2e180f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17341: [SPARK-20013][SQL]add a newTablePath parameter for renam...
Github user windpiger commented on the issue: https://github.com/apache/spark/pull/17341 cc @cloud-fan @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17338: [SPARK-19990][SQL][test-maven]create a temp file for fil...
Github user windpiger commented on the issue: https://github.com/apache/spark/pull/17338 yes,it is. It (jar:file://) will be re-resolved by new Path later, and will throw an exception described in the jira. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16330: [SPARK-18817][SPARKR][SQL] change derby log output to te...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16330 **[Test build #74794 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74794/testReport)** for PR 16330 at commit [`ac9fbfc`](https://github.com/apache/spark/commit/ac9fbfc5d511877f7775c620ff8e1c672880ee50). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16330: [SPARK-18817][SPARKR][SQL] change derby log outpu...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/16330#discussion_r106794245 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -2909,6 +2910,30 @@ test_that("Collect on DataFrame when NAs exists at the top of a timestamp column expect_equal(class(ldf3$col3), c("POSIXct", "POSIXt")) }) +compare_list <- function(list1, list2) { + # get testthat to show the diff by first making the 2 lists equal in length + expect_equal(length(list1), length(list2)) + l <- max(length(list1), length(list2)) + length(list1) <- l + length(list2) <- l + expect_equal(sort(list1, na.last = TRUE), sort(list2, na.last = TRUE)) +} + +# This should always be the last test in this test file. +test_that("No extra files are created in SPARK_HOME by starting session and making calls", { + # Check that it is not creating any extra file. + # Does not check the tempdir which would be cleaned up after. + filesAfter <- list.files(path = sparkRDir, all.files = TRUE) + + expect_true(length(sparkRFilesBefore) > 0) + # first, ensure derby.log is not there + expect_false("derby.log" %in% filesAfter) + # second, ensure only spark-warehouse is created when calling SparkSession, enableHiveSupport = F --- End diff -- agreed. updated, hope it's better now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17192#discussion_r106793811 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala --- @@ -425,8 +425,8 @@ object FunctionRegistry { expression[BitwiseXor]("^"), // json -expression[StructToJson]("to_json"), -expression[JsonToStruct]("from_json"), +expression[StructsToJson]("to_json"), +expression[JsonToStructs]("from_json"), --- End diff -- + @maropu @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16971 **[Test build #74793 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74793/testReport)** for PR 16971 at commit [`ed6dacd`](https://github.com/apache/spark/commit/ed6dacdb3e3bdfd4e9ccb5c57bf8b4118636b0c6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16971 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17192#discussion_r106794013 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -624,41 +627,58 @@ case class StructToJson( lazy val writer = new CharArrayWriter() @transient - lazy val gen = -new JacksonGenerator( - child.dataType.asInstanceOf[StructType], - writer, - new JSONOptions(options, timeZoneId.get)) + lazy val gen = new JacksonGenerator( +rowSchema, writer, new JSONOptions(options, timeZoneId.get)) + + @transient + lazy val rowSchema = child.dataType match { +case st: StructType => st +case ArrayType(st: StructType, _) => st + } + + // This converts rows to the JSON output according to the given schema. + @transient + lazy val converter: Any => UTF8String = { +def getAndReset(): UTF8String = { + gen.flush() + val json = writer.toString + writer.reset() + UTF8String.fromString(json) +} + +child.dataType match { + case _: StructType => +(row: Any) => + gen.write(row.asInstanceOf[InternalRow]) + getAndReset() + case ArrayType(_: StructType, _) => +(arr: Any) => + gen.write(arr.asInstanceOf[ArrayData]) + getAndReset() +} + } override def dataType: DataType = StringType - override def checkInputDataTypes(): TypeCheckResult = { -if (StructType.acceptsType(child.dataType)) { + override def checkInputDataTypes(): TypeCheckResult = child.dataType match { +case _: StructType | ArrayType(_: StructType, _) => --- End diff -- this seems to become a bit more strict from before? was: `StructType.acceptsType` - anything can be accepted as struct now: `match { case _: StructType` - must be struct am I understand this correctly? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/15363#discussion_r106794032 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -20,19 +20,340 @@ package org.apache.spark.sql.catalyst.optimizer import scala.annotation.tailrec import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins +import org.apache.spark.sql.catalyst.planning.{BaseTableAccess, ExtractFiltersAndInnerJoins} import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.rules._ +import org.apache.spark.sql.catalyst.CatalystConf + +/** + * Encapsulates star-schema join detection. + */ +case class DetectStarSchemaJoin(conf: CatalystConf) extends PredicateHelper { + + /** + * Star schema consists of one or more fact tables referencing a number of dimension + * tables. In general, star-schema joins are detected using the following conditions: + * 1. Informational RI constraints (reliable detection) + *+ Dimension contains a primary key that is being joined to the fact table. + *+ Fact table contains foreign keys referencing multiple dimension tables. + * 2. Cardinality based heuristics + *+ Usually, the table with the highest cardinality is the fact table. + *+ Table being joined with the most number of tables is the fact table. + * + * To detect star joins, the algorithm uses a combination of the above two conditions. + * The fact table is chosen based on the cardinality heuristics, and the dimension + * tables are chosen based on the RI constraints. A star join will consist of the largest + * fact table joined with the dimension tables on their primary keys. To detect that a + * column is a primary key, the algorithm uses table and column statistics. + * + * Since Catalyst only supports left-deep tree plans, the algorithm currently returns only --- End diff -- @cloud-fan Yes, currently I am only generating the star join for the largest fact table. I can generalize the algorithm once we use CBO. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17192#discussion_r106793959 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -624,41 +627,58 @@ case class StructToJson( lazy val writer = new CharArrayWriter() @transient - lazy val gen = -new JacksonGenerator( - child.dataType.asInstanceOf[StructType], - writer, - new JSONOptions(options, timeZoneId.get)) + lazy val gen = new JacksonGenerator( +rowSchema, writer, new JSONOptions(options, timeZoneId.get)) + + @transient + lazy val rowSchema = child.dataType match { +case st: StructType => st +case ArrayType(st: StructType, _) => st --- End diff -- could we end up with `ArrayType(StructType, something_not_struct)` in this match? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17192#discussion_r106793800 --- Diff: python/pyspark/sql/functions.py --- @@ -1774,10 +1774,11 @@ def json_tuple(col, *fields): def from_json(col, schema, options={}): """ Parses a column containing a JSON string into a [[StructType]] or [[ArrayType]] -with the specified schema. Returns `null`, in the case of an unparseable string. +of [[StructType]]s with the specified schema. Returns `null`, in the case of an unparseable +string. --- End diff -- if we are updating this in python, we should update R too? https://github.com/HyukjinKwon/spark/blob/185ea6003d60feed20c56de61c17bc304663d99a/R/pkg/R/functions.R#L2440 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15363#discussion_r106793947 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/StarJoinSuite.scala --- @@ -0,0 +1,488 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import org.apache.spark.sql.{Row, _} +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.SQLTestUtils + + +class StarJoinSuite extends QueryTest with SQLTestUtils with TestHiveSingleton { --- End diff -- Yes! This is a good idea. It might take times to rewrite the whole things. How about doing it as a follow-up PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/15363#discussion_r106793898 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer import scala.annotation.tailrec import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins +import org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, PhysicalOperation} import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.rules._ +import org.apache.spark.sql.catalyst.CatalystConf + +/** + * Encapsulates star-schema join detection. + */ +case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper { + + /** + * Star schema consists of one or more fact tables referencing a number of dimension + * tables. In general, star-schema joins are detected using the following conditions: + * 1. Informational RI constraints (reliable detection) + *+ Dimension contains a primary key that is being joined to the fact table. + *+ Fact table contains foreign keys referencing multiple dimension tables. + * 2. Cardinality based heuristics + *+ Usually, the table with the highest cardinality is the fact table. + *+ Table being joined with the most number of tables is the fact table. + * + * To detect star joins, the algorithm uses a combination of the above two conditions. + * The fact table is chosen based on the cardinality heuristics, and the dimension + * tables are chosen based on the RI constraints. A star join will consist of the largest + * fact table joined with the dimension tables on their primary keys. To detect that a + * column is a primary key, the algorithm uses table and column statistics. + * + * Since Catalyst only supports left-deep tree plans, the algorithm currently returns only + * the star join with the largest fact table. Choosing the largest fact table on the + * driving arm to avoid large inners is in general a good heuristic. This restriction can + * be lifted with support for bushy tree plans. + * + * The highlights of the algorithm are the following: + * + * Given a set of joined tables/plans, the algorithm first verifies if they are eligible + * for star join detection. An eligible plan is a base table access with valid statistics. + * A base table access represents Project or Filter operators above a LeafNode. Conservatively, + * the algorithm only considers base table access as part of a star join since they provide + * reliable statistics. + * + * If some of the plans are not base table access, or statistics are not available, the algorithm + * returns an empty star join plan since, in the absence of statistics, it cannot make + * good planning decisions. Otherwise, the algorithm finds the table with the largest cardinality + * (number of rows), which is assumed to be a fact table. + * + * Next, it computes the set of dimension tables for the current fact table. A dimension table + * is assumed to be in a RI relationship with a fact table. To infer column uniqueness, + * the algorithm compares the number of distinct values with the total number of rows in the + * table. If their relative difference is within certain limits (i.e. ndvMaxError * 2, adjusted + * based on 1TB TPC-DS data), the column is assumed to be unique. + */ + def findStarJoins( + input: Seq[LogicalPlan], + conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = { + +val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]] + +if (!conf.starSchemaDetection || input.size < 2) { + emptyStarJoinPlan +} else { + // Find if the input plans are eligible for star join detection. + // An eligible plan is a base table access with valid statistics. + val foundEligibleJoin = input.forall { +case PhysicalOperation(_, _, t: LeafNode) if t.stats(conf).rowCount.isDefined => true +case _ => false + } + + if (!foundEligibleJoin) { +// Some plans don't have stats or are complex plans. Conservatively, +// return an empty star join. This restriction can be lifted +// once statistics are propagated in the plan. +emptyStarJoinPlan + } else { +// Find the fact table using cardinality based heuristics i.e. +// the table with the largest number of rows. +val sortedFactTables = input.map { plan => +
[GitHub] spark pull request #16982: [SPARK-19654][SPARKR][SS] Structured Streaming AP...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16982 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16626 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74791/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16626 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16626 **[Test build #74791 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74791/testReport)** for PR 16626 at commit [`a28fc42`](https://github.com/apache/spark/commit/a28fc42cee6dc52516e074ced7c4351ee6baa45d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17250: [SPARK-19911][STREAMING] Add builder interface for Kines...
Github user brkyvz commented on the issue: https://github.com/apache/spark/pull/17250 @budde Do you think you can update this PR? The 2.2 branch will be cut on Monday (2017-03-18). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16626 Adding a check in the existing test case to see if `HIVE_TYPE_STRING` is correctly populated in the metadata. LGTM except a few minor comments cc @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16626#discussion_r106792988 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -2178,4 +2177,136 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils { } } } + + Seq("parquet", "json", "csv").foreach { provider => --- End diff -- Let us introduce another variable. ```Scala val supportedNativeFileFormatsForAlterTableAddColumns = Seq("parquet", "json", "csv") ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16626#discussion_r106792941 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -175,6 +178,74 @@ case class AlterTableRenameCommand( } /** + * A command that add columns to a table + * The syntax of using this command in SQL is: + * {{{ + * ALTER TABLE table_identifier + * ADD COLUMNS (col_name data_type [COMMENT col_comment], ...); + * }}} +*/ +case class AlterTableAddColumnsCommand( +table: TableIdentifier, +columns: Seq[StructField]) extends RunnableCommand { + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog +val catalogTable = verifyAlterTableAddColumn(catalog, table) + +try { + sparkSession.catalog.uncacheTable(table.quotedString) +} catch { + case NonFatal(e) => +log.warn(s"Exception when attempting to uncache table ${table.quotedString}", e) +} +catalog.refreshTable(table) +catalog.alterTableSchema( + table, catalogTable.schema.copy(fields = catalogTable.schema.fields ++ columns)) + +Seq.empty[Row] + } + + /** + * ALTER TABLE ADD COLUMNS command does not support temporary view/table, + * view, or datasource table with text, orc formats or external provider. + * For datasource table, it currently only supports parquet, json, csv. + */ + private def verifyAlterTableAddColumn( + catalog: SessionCatalog, + table: TableIdentifier): CatalogTable = { +val catalogTable = catalog.getTempViewOrPermanentTableMetadata(table) + +if (catalogTable.tableType == CatalogTableType.VIEW) { + throw new AnalysisException( +s""" + |ALTER ADD COLUMNS does not support views. + |You must drop and re-create the views for adding the new columns. Views: $table + """.stripMargin) +} + +if (DDLUtils.isDatasourceTable(catalogTable)) { + DataSource.lookupDataSource(catalogTable.provider.get).newInstance() match { +// For datasource table, this command can only support the following File format. +// TextFileFormat only default to one column "value" +// OrcFileFormat can not handle difference between user-specified schema and +// inferred schema yet. TODO, once this issue is resolved , we can add Orc back. +// Hive type is already considered as hive serde table, so the logic will not +// come in here. +case _: JsonFileFormat | _: CSVFileFormat | _: ParquetFileFormat => +case s => + throw new AnalysisException( +s""" + |ALTER ADD COLUMNS does not support datasource table with type $s. + |You must drop and re-create the table for adding the new columns. Tables: $table + """.stripMargin) --- End diff -- Nit: fix the format --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16626#discussion_r106792952 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -165,7 +165,6 @@ class InMemoryCatalogedDDLSuite extends DDLSuite with SharedSQLContext with Befo assert(e.contains("Hive support is required to CREATE Hive TABLE (AS SELECT)")) } } - --- End diff -- revert it back --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16626#discussion_r106792918 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala --- @@ -450,6 +451,21 @@ abstract class SessionCatalogSuite extends PlanTest { } } + test("alter table add columns") { --- End diff -- Also add a negative test case for dropping columns, although we do not support it now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16626#discussion_r106792896 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala --- @@ -450,6 +451,21 @@ abstract class SessionCatalogSuite extends PlanTest { } } + test("alter table add columns") { +withBasicCatalog { sessionCatalog => + sessionCatalog.createTable(newTable("t1", "default"), ignoreIfExists = false) + val oldTab = sessionCatalog.externalCatalog.getTable("default", "t1") + sessionCatalog.alterTableSchema( +TableIdentifier("t1", Some("default")), oldTab.schema.add("c3", IntegerType)) + + val newTab = sessionCatalog.externalCatalog.getTable("default", "t1") + // construct the expected table schema + val oldTabSchema = StructType(oldTab.dataSchema.fields ++ --- End diff -- `oldTabSchema ` -> `expectedTabSchema ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17342 **[Test build #74792 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74792/testReport)** for PR 17342 at commit [`04556c9`](https://github.com/apache/spark/commit/04556c9f2f4feb53e3f644d795a38de4a4e919ca). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs
Github user weiqingy commented on the issue: https://github.com/apache/spark/pull/17342 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs
Github user weiqingy commented on the issue: https://github.com/apache/spark/pull/17342 `org.apache.spark.storage.BlockManagerProactiveReplicationSuite.proactive block replication - 3 replicas - 2 block manager deletions` failed, but it passed locally. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/17297 >> I don't think its true that it relaunches all tasks that hadn't completed when the fetch failure occurred. it relaunches all the tasks haven't completed, by the time the stage gets resubmitted. More tasks can complete in between the time of the first failure, and the time the stage is resubmitted. Actually, I realized that it's not true. If you looked at the code (https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1419), when the stage fails because of fetch failure, we remove the stage from the output commiter. So if any task completes between the time of first fetch failure and the time stage is resubmitted, will be denied to commit the output and so the scheduler re-launches all tasks in the stage with the fetch failure that hadn't completed when the fetch failure occurred. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17334: [SPARK-19998][Block Manager]BlockRDD block not found Exc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17334 **[Test build #3603 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3603/testReport)** for PR 17334 at commit [`98393f8`](https://github.com/apache/spark/commit/98393f8e346be0cd1cbe73b6861668e17446990d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17342 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74790/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17342 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17342 **[Test build #74790 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74790/testReport)** for PR 17342 at commit [`04556c9`](https://github.com/apache/spark/commit/04556c9f2f4feb53e3f644d795a38de4a4e919ca). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/15363#discussion_r106791067 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/StarJoinSuite.scala --- @@ -0,0 +1,488 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import org.apache.spark.sql.{Row, _} +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.SQLTestUtils + + +class StarJoinSuite extends QueryTest with SQLTestUtils with TestHiveSingleton { --- End diff -- @cloud-fan I will look at ```JoinReorderSuite```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/15363#discussion_r106790932 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer import scala.annotation.tailrec import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins +import org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, PhysicalOperation} import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.rules._ +import org.apache.spark.sql.catalyst.CatalystConf + +/** + * Encapsulates star-schema join detection. + */ +case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper { + + /** + * Star schema consists of one or more fact tables referencing a number of dimension + * tables. In general, star-schema joins are detected using the following conditions: + * 1. Informational RI constraints (reliable detection) + *+ Dimension contains a primary key that is being joined to the fact table. + *+ Fact table contains foreign keys referencing multiple dimension tables. + * 2. Cardinality based heuristics + *+ Usually, the table with the highest cardinality is the fact table. + *+ Table being joined with the most number of tables is the fact table. + * + * To detect star joins, the algorithm uses a combination of the above two conditions. + * The fact table is chosen based on the cardinality heuristics, and the dimension + * tables are chosen based on the RI constraints. A star join will consist of the largest + * fact table joined with the dimension tables on their primary keys. To detect that a + * column is a primary key, the algorithm uses table and column statistics. + * + * Since Catalyst only supports left-deep tree plans, the algorithm currently returns only + * the star join with the largest fact table. Choosing the largest fact table on the + * driving arm to avoid large inners is in general a good heuristic. This restriction can + * be lifted with support for bushy tree plans. + * + * The highlights of the algorithm are the following: + * + * Given a set of joined tables/plans, the algorithm first verifies if they are eligible + * for star join detection. An eligible plan is a base table access with valid statistics. + * A base table access represents Project or Filter operators above a LeafNode. Conservatively, + * the algorithm only considers base table access as part of a star join since they provide + * reliable statistics. + * + * If some of the plans are not base table access, or statistics are not available, the algorithm + * returns an empty star join plan since, in the absence of statistics, it cannot make + * good planning decisions. Otherwise, the algorithm finds the table with the largest cardinality + * (number of rows), which is assumed to be a fact table. + * + * Next, it computes the set of dimension tables for the current fact table. A dimension table + * is assumed to be in a RI relationship with a fact table. To infer column uniqueness, + * the algorithm compares the number of distinct values with the total number of rows in the + * table. If their relative difference is within certain limits (i.e. ndvMaxError * 2, adjusted + * based on 1TB TPC-DS data), the column is assumed to be unique. + */ + def findStarJoins( + input: Seq[LogicalPlan], + conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = { + +val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]] + +if (!conf.starSchemaDetection || input.size < 2) { + emptyStarJoinPlan +} else { + // Find if the input plans are eligible for star join detection. + // An eligible plan is a base table access with valid statistics. + val foundEligibleJoin = input.forall { +case PhysicalOperation(_, _, t: LeafNode) if t.stats(conf).rowCount.isDefined => true +case _ => false + } + + if (!foundEligibleJoin) { +// Some plans don't have stats or are complex plans. Conservatively, +// return an empty star join. This restriction can be lifted +// once statistics are propagated in the plan. +emptyStarJoinPlan + } else { +// Find the fact table using cardinality based heuristics i.e. +// the table with the largest number of rows. +val sortedFactTables = input.map { plan => +
[GitHub] spark issue #16330: [SPARK-18817][SPARKR][SQL] change derby log output to te...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16330 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16330: [SPARK-18817][SPARKR][SQL] change derby log output to te...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16330 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74788/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16330: [SPARK-18817][SPARKR][SQL] change derby log output to te...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16330 **[Test build #74788 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74788/testReport)** for PR 16330 at commit [`2eb75f8`](https://github.com/apache/spark/commit/2eb75f8d73ec3ce1eb85d7c501e4c072499b8f44). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16626 **[Test build #74791 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74791/testReport)** for PR 16626 at commit [`a28fc42`](https://github.com/apache/spark/commit/a28fc42cee6dc52516e074ced7c4351ee6baa45d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/15363#discussion_r106790507 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SimpleCatalystConf.scala --- @@ -40,6 +40,9 @@ case class SimpleCatalystConf( override val cboEnabled: Boolean = false, override val joinReorderEnabled: Boolean = false, override val joinReorderDPThreshold: Int = 12, +override val starSchemaDetection: Boolean = false, +override val starSchemaFTRatio: Double = 0.9, +override val ndvMaxError: Double = 0.05, --- End diff -- @cloud-fan Thank you. I will update the file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/15363#discussion_r106790475 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer import scala.annotation.tailrec import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins +import org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, PhysicalOperation} import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.rules._ +import org.apache.spark.sql.catalyst.CatalystConf + +/** + * Encapsulates star-schema join detection. + */ +case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper { + + /** + * Star schema consists of one or more fact tables referencing a number of dimension + * tables. In general, star-schema joins are detected using the following conditions: + * 1. Informational RI constraints (reliable detection) + *+ Dimension contains a primary key that is being joined to the fact table. + *+ Fact table contains foreign keys referencing multiple dimension tables. + * 2. Cardinality based heuristics + *+ Usually, the table with the highest cardinality is the fact table. + *+ Table being joined with the most number of tables is the fact table. + * + * To detect star joins, the algorithm uses a combination of the above two conditions. + * The fact table is chosen based on the cardinality heuristics, and the dimension + * tables are chosen based on the RI constraints. A star join will consist of the largest + * fact table joined with the dimension tables on their primary keys. To detect that a + * column is a primary key, the algorithm uses table and column statistics. + * + * Since Catalyst only supports left-deep tree plans, the algorithm currently returns only + * the star join with the largest fact table. Choosing the largest fact table on the + * driving arm to avoid large inners is in general a good heuristic. This restriction can + * be lifted with support for bushy tree plans. + * + * The highlights of the algorithm are the following: + * + * Given a set of joined tables/plans, the algorithm first verifies if they are eligible + * for star join detection. An eligible plan is a base table access with valid statistics. + * A base table access represents Project or Filter operators above a LeafNode. Conservatively, + * the algorithm only considers base table access as part of a star join since they provide + * reliable statistics. + * + * If some of the plans are not base table access, or statistics are not available, the algorithm + * returns an empty star join plan since, in the absence of statistics, it cannot make + * good planning decisions. Otherwise, the algorithm finds the table with the largest cardinality + * (number of rows), which is assumed to be a fact table. + * + * Next, it computes the set of dimension tables for the current fact table. A dimension table + * is assumed to be in a RI relationship with a fact table. To infer column uniqueness, + * the algorithm compares the number of distinct values with the total number of rows in the + * table. If their relative difference is within certain limits (i.e. ndvMaxError * 2, adjusted + * based on 1TB TPC-DS data), the column is assumed to be unique. + */ + def findStarJoins( + input: Seq[LogicalPlan], + conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = { + +val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]] + +if (!conf.starSchemaDetection || input.size < 2) { + emptyStarJoinPlan +} else { + // Find if the input plans are eligible for star join detection. + // An eligible plan is a base table access with valid statistics. + val foundEligibleJoin = input.forall { +case PhysicalOperation(_, _, t: LeafNode) if t.stats(conf).rowCount.isDefined => true +case _ => false + } + + if (!foundEligibleJoin) { +// Some plans don't have stats or are complex plans. Conservatively, +// return an empty star join. This restriction can be lifted +// once statistics are propagated in the plan. +emptyStarJoinPlan + } else { +// Find the fact table using cardinality based heuristics i.e. +// the table with the largest number of rows. +val sortedFactTables = input.map { plan => +
[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/15363#discussion_r106790425 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer import scala.annotation.tailrec import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins +import org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, PhysicalOperation} import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.rules._ +import org.apache.spark.sql.catalyst.CatalystConf + +/** + * Encapsulates star-schema join detection. + */ +case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper { + + /** + * Star schema consists of one or more fact tables referencing a number of dimension + * tables. In general, star-schema joins are detected using the following conditions: + * 1. Informational RI constraints (reliable detection) + *+ Dimension contains a primary key that is being joined to the fact table. + *+ Fact table contains foreign keys referencing multiple dimension tables. + * 2. Cardinality based heuristics + *+ Usually, the table with the highest cardinality is the fact table. + *+ Table being joined with the most number of tables is the fact table. + * + * To detect star joins, the algorithm uses a combination of the above two conditions. + * The fact table is chosen based on the cardinality heuristics, and the dimension + * tables are chosen based on the RI constraints. A star join will consist of the largest + * fact table joined with the dimension tables on their primary keys. To detect that a + * column is a primary key, the algorithm uses table and column statistics. + * + * Since Catalyst only supports left-deep tree plans, the algorithm currently returns only + * the star join with the largest fact table. Choosing the largest fact table on the + * driving arm to avoid large inners is in general a good heuristic. This restriction can + * be lifted with support for bushy tree plans. + * + * The highlights of the algorithm are the following: + * + * Given a set of joined tables/plans, the algorithm first verifies if they are eligible + * for star join detection. An eligible plan is a base table access with valid statistics. + * A base table access represents Project or Filter operators above a LeafNode. Conservatively, + * the algorithm only considers base table access as part of a star join since they provide + * reliable statistics. + * + * If some of the plans are not base table access, or statistics are not available, the algorithm + * returns an empty star join plan since, in the absence of statistics, it cannot make + * good planning decisions. Otherwise, the algorithm finds the table with the largest cardinality + * (number of rows), which is assumed to be a fact table. + * + * Next, it computes the set of dimension tables for the current fact table. A dimension table + * is assumed to be in a RI relationship with a fact table. To infer column uniqueness, + * the algorithm compares the number of distinct values with the total number of rows in the + * table. If their relative difference is within certain limits (i.e. ndvMaxError * 2, adjusted + * based on 1TB TPC-DS data), the column is assumed to be unique. + */ + def findStarJoins( + input: Seq[LogicalPlan], + conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = { + +val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]] + +if (!conf.starSchemaDetection || input.size < 2) { + emptyStarJoinPlan +} else { + // Find if the input plans are eligible for star join detection. + // An eligible plan is a base table access with valid statistics. + val foundEligibleJoin = input.forall { +case PhysicalOperation(_, _, t: LeafNode) if t.stats(conf).rowCount.isDefined => true +case _ => false + } + + if (!foundEligibleJoin) { +// Some plans don't have stats or are complex plans. Conservatively, +// return an empty star join. This restriction can be lifted +// once statistics are propagated in the plan. +emptyStarJoinPlan + } else { +// Find the fact table using cardinality based heuristics i.e. +// the table with the largest number of rows. +val sortedFactTables = input.map { plan => +
[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...
Github user xwu0226 commented on a diff in the pull request: https://github.com/apache/spark/pull/16626#discussion_r106790141 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -1860,4 +1861,119 @@ class HiveDDLSuite } } } + + hiveFormats.foreach { tableType => +test(s"alter hive serde table add columns -- partitioned - $tableType") { + withTable("tab") { +sql( + s""" + |CREATE TABLE tab (c1 int, c2 int) + |PARTITIONED BY (c3 int) STORED AS $tableType + """.stripMargin) + +sql("INSERT INTO tab PARTITION (c3=1) VALUES (1, 2)") +sql("ALTER TABLE tab ADD COLUMNS (c4 int)") +checkAnswer( + sql("SELECT * FROM tab WHERE c3 = 1"), + Seq(Row(1, 2, null, 1)) +) +assert(sql("SELECT * FROM tab").schema --- End diff -- maybe `spark.table("tab")` is shorter? because to use getTableMetaData, i need to use `spark.sessionState.catalog.getTableMetadata` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17334: [SPARK-19998][Block Manager]BlockRDD block not found Exc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17334 **[Test build #3603 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3603/testReport)** for PR 17334 at commit [`98393f8`](https://github.com/apache/spark/commit/98393f8e346be0cd1cbe73b6861668e17446990d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17311: [SPARK-19970][SQL] Table owner should be USER instead of...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/17311 Hi, @vanzin and @srowen . Could you review this PR (again) when you have sometime? I feel guilty because this PR need to be verified by manually on kerberized clusters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17219 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17251: [SPARK-19910][SQL] `stack` should not reject NULL values...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/17251 Hi, @cloud-fan . Is it possible that Spark 2.1.1 includes this fix? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17219 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74786/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17219 **[Test build #74786 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74786/testReport)** for PR 17219 at commit [`fd28ed7`](https://github.com/apache/spark/commit/fd28ed7c83896c310296238666cf68f0455b837b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16330: [SPARK-18817][SPARKR][SQL] change derby log output to te...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/16330 Had a minor comment on the test case. LGTM otherwise and waiting for Jenkins --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16330: [SPARK-18817][SPARKR][SQL] change derby log outpu...
Github user shivaram commented on a diff in the pull request: https://github.com/apache/spark/pull/16330#discussion_r106789593 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -2909,6 +2910,30 @@ test_that("Collect on DataFrame when NAs exists at the top of a timestamp column expect_equal(class(ldf3$col3), c("POSIXct", "POSIXt")) }) +compare_list <- function(list1, list2) { + # get testthat to show the diff by first making the 2 lists equal in length + expect_equal(length(list1), length(list2)) + l <- max(length(list1), length(list2)) + length(list1) <- l + length(list2) <- l + expect_equal(sort(list1, na.last = TRUE), sort(list2, na.last = TRUE)) +} + +# This should always be the last test in this test file. +test_that("No extra files are created in SPARK_HOME by starting session and making calls", { + # Check that it is not creating any extra file. + # Does not check the tempdir which would be cleaned up after. + filesAfter <- list.files(path = sparkRDir, all.files = TRUE) + + expect_true(length(sparkRFilesBefore) > 0) + # first, ensure derby.log is not there + expect_false("derby.log" %in% filesAfter) + # second, ensure only spark-warehouse is created when calling SparkSession, enableHiveSupport = F --- End diff -- I'm a little confused how these two setdiff commands map to with or without hive support. Can we make this a bit more easier to understand ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r106788877 --- Diff: core/src/main/scala/org/apache/spark/TaskEndReason.scala --- @@ -212,8 +212,8 @@ case object TaskResultLost extends TaskFailedReason { * Task was killed intentionally and needs to be rescheduled. */ @DeveloperApi -case object TaskKilled extends TaskFailedReason { - override def toErrorString: String = "TaskKilled (killed intentionally)" +case class TaskKilled(reason: String) extends TaskFailedReason { + override def toErrorString: String = s"TaskKilled ($reason)" --- End diff -- Since this was part of DeveloperApi, what is the impact of making this change ? In JsonProtocol, in mocked user code ? If it does introduce backward incompatible changes, is there a way to mitigate this ? Perhaps make TaskKilled a class with apply/unapply/serde/toString (essentially all that case class provides) and case object with apply with default reason = null (and logged when used as deprecated) ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r106789380 --- Diff: core/src/test/scala/org/apache/spark/SparkContextSuite.scala --- @@ -540,6 +540,39 @@ class SparkContextSuite extends SparkFunSuite with LocalSparkContext with Eventu } } + // Launches one task that will run forever. Once the SparkListener detects the task has + // started, kill and re-schedule it. The second run of the task will complete immediately. + // If this test times out, then the first version of the task wasn't killed successfully. + test("Killing tasks") { +sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local")) + +SparkContextSuite.isTaskStarted = false +SparkContextSuite.taskKilled = false + +val listener = new SparkListener { + override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = { +eventually(timeout(10.seconds)) { + assert(SparkContextSuite.isTaskStarted) +} +if (!SparkContextSuite.taskKilled) { + SparkContextSuite.taskKilled = true + sc.killTaskAttempt(taskStart.taskInfo.taskId, true, "first attempt will hang") +} + } --- End diff -- A `onTaskEnd` validating failure of the first task and launch/success of second attempt should be added here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r106788727 --- Diff: core/src/main/scala/org/apache/spark/ui/UIUtils.scala --- @@ -354,7 +354,7 @@ private[spark] object UIUtils extends Logging { {completed}/{total} { if (failed > 0) s"($failed failed)" } { if (skipped > 0) s"($skipped skipped)" } -{ if (killed > 0) s"($killed killed)" } +{ reasonToNumKilled.map { case (reason, count) => s"($count killed: $reason)" } } --- End diff -- Would be good to sort it in some order and expose it. Either by desc count or alphabetically : so that updates do not reorder arbitrarily. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r106789297 --- Diff: core/src/test/scala/org/apache/spark/SparkContextSuite.scala --- @@ -540,6 +540,39 @@ class SparkContextSuite extends SparkFunSuite with LocalSparkContext with Eventu } } + // Launches one task that will run forever. Once the SparkListener detects the task has + // started, kill and re-schedule it. The second run of the task will complete immediately. + // If this test times out, then the first version of the task wasn't killed successfully. + test("Killing tasks") { +sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local")) + +SparkContextSuite.isTaskStarted = false +SparkContextSuite.taskKilled = false + +val listener = new SparkListener { + override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = { +eventually(timeout(10.seconds)) { + assert(SparkContextSuite.isTaskStarted) +} +if (!SparkContextSuite.taskKilled) { + SparkContextSuite.taskKilled = true + sc.killTaskAttempt(taskStart.taskInfo.taskId, true, "first attempt will hang") +} + } +} +sc.addSparkListener(listener) +eventually(timeout(20.seconds)) { + sc.parallelize(1 to 1).foreach { x => +// first attempt will hang +if (!SparkContextSuite.isTaskStarted) { + SparkContextSuite.isTaskStarted = true + Thread.sleep(999) +} +// second attempt succeeds immediately + } --- End diff -- Nothing actually runs in this test currently - we need to force the RDD's computation (a count() or runJob here) Also, we need to validate that the second task actually ran (which is why the bug in the test was not detected). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/17166#discussion_r106789004 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala --- @@ -467,7 +474,7 @@ private[spark] class TaskSchedulerImpl private[scheduler]( taskState: TaskState, reason: TaskFailedReason): Unit = synchronized { taskSetManager.handleFailedTask(tid, taskState, reason) -if (!taskSetManager.isZombie && taskState != TaskState.KILLED) { +if (!taskSetManager.isZombie) { --- End diff -- @kayousterhout Are we making the change that killed tasks can/should be retried ? If yes, this is a behavior change; and TSM.handleFailedTask(), we need to do the same. This is what I mentioned w.r.t killed not resulting in task resubmission. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/15363#discussion_r106789403 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer import scala.annotation.tailrec import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins +import org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, PhysicalOperation} import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.rules._ +import org.apache.spark.sql.catalyst.CatalystConf + +/** + * Encapsulates star-schema join detection. + */ +case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper { + + /** + * Star schema consists of one or more fact tables referencing a number of dimension + * tables. In general, star-schema joins are detected using the following conditions: + * 1. Informational RI constraints (reliable detection) + *+ Dimension contains a primary key that is being joined to the fact table. + *+ Fact table contains foreign keys referencing multiple dimension tables. + * 2. Cardinality based heuristics + *+ Usually, the table with the highest cardinality is the fact table. + *+ Table being joined with the most number of tables is the fact table. + * + * To detect star joins, the algorithm uses a combination of the above two conditions. + * The fact table is chosen based on the cardinality heuristics, and the dimension + * tables are chosen based on the RI constraints. A star join will consist of the largest + * fact table joined with the dimension tables on their primary keys. To detect that a + * column is a primary key, the algorithm uses table and column statistics. + * + * Since Catalyst only supports left-deep tree plans, the algorithm currently returns only + * the star join with the largest fact table. Choosing the largest fact table on the + * driving arm to avoid large inners is in general a good heuristic. This restriction can + * be lifted with support for bushy tree plans. + * + * The highlights of the algorithm are the following: + * + * Given a set of joined tables/plans, the algorithm first verifies if they are eligible + * for star join detection. An eligible plan is a base table access with valid statistics. + * A base table access represents Project or Filter operators above a LeafNode. Conservatively, + * the algorithm only considers base table access as part of a star join since they provide + * reliable statistics. + * + * If some of the plans are not base table access, or statistics are not available, the algorithm + * returns an empty star join plan since, in the absence of statistics, it cannot make + * good planning decisions. Otherwise, the algorithm finds the table with the largest cardinality + * (number of rows), which is assumed to be a fact table. + * + * Next, it computes the set of dimension tables for the current fact table. A dimension table + * is assumed to be in a RI relationship with a fact table. To infer column uniqueness, + * the algorithm compares the number of distinct values with the total number of rows in the + * table. If their relative difference is within certain limits (i.e. ndvMaxError * 2, adjusted + * based on 1TB TPC-DS data), the column is assumed to be unique. + */ + def findStarJoins( + input: Seq[LogicalPlan], + conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = { + +val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]] + +if (!conf.starSchemaDetection || input.size < 2) { + emptyStarJoinPlan +} else { + // Find if the input plans are eligible for star join detection. + // An eligible plan is a base table access with valid statistics. + val foundEligibleJoin = input.forall { +case PhysicalOperation(_, _, t: LeafNode) if t.stats(conf).rowCount.isDefined => true +case _ => false + } + + if (!foundEligibleJoin) { +// Some plans don't have stats or are complex plans. Conservatively, +// return an empty star join. This restriction can be lifted +// once statistics are propagated in the plan. +emptyStarJoinPlan + } else { +// Find the fact table using cardinality based heuristics i.e. +// the table with the largest number of rows. +val sortedFactTables = input.map { plan => +
[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/15363#discussion_r106789422 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer import scala.annotation.tailrec import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins +import org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, PhysicalOperation} import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.rules._ +import org.apache.spark.sql.catalyst.CatalystConf + +/** + * Encapsulates star-schema join detection. + */ +case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper { + + /** + * Star schema consists of one or more fact tables referencing a number of dimension + * tables. In general, star-schema joins are detected using the following conditions: + * 1. Informational RI constraints (reliable detection) + *+ Dimension contains a primary key that is being joined to the fact table. + *+ Fact table contains foreign keys referencing multiple dimension tables. + * 2. Cardinality based heuristics + *+ Usually, the table with the highest cardinality is the fact table. + *+ Table being joined with the most number of tables is the fact table. + * + * To detect star joins, the algorithm uses a combination of the above two conditions. + * The fact table is chosen based on the cardinality heuristics, and the dimension + * tables are chosen based on the RI constraints. A star join will consist of the largest + * fact table joined with the dimension tables on their primary keys. To detect that a + * column is a primary key, the algorithm uses table and column statistics. + * + * Since Catalyst only supports left-deep tree plans, the algorithm currently returns only + * the star join with the largest fact table. Choosing the largest fact table on the + * driving arm to avoid large inners is in general a good heuristic. This restriction can + * be lifted with support for bushy tree plans. + * + * The highlights of the algorithm are the following: + * + * Given a set of joined tables/plans, the algorithm first verifies if they are eligible + * for star join detection. An eligible plan is a base table access with valid statistics. + * A base table access represents Project or Filter operators above a LeafNode. Conservatively, + * the algorithm only considers base table access as part of a star join since they provide + * reliable statistics. + * + * If some of the plans are not base table access, or statistics are not available, the algorithm + * returns an empty star join plan since, in the absence of statistics, it cannot make + * good planning decisions. Otherwise, the algorithm finds the table with the largest cardinality + * (number of rows), which is assumed to be a fact table. + * + * Next, it computes the set of dimension tables for the current fact table. A dimension table + * is assumed to be in a RI relationship with a fact table. To infer column uniqueness, + * the algorithm compares the number of distinct values with the total number of rows in the + * table. If their relative difference is within certain limits (i.e. ndvMaxError * 2, adjusted + * based on 1TB TPC-DS data), the column is assumed to be unique. + */ + def findStarJoins( + input: Seq[LogicalPlan], + conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = { --- End diff -- @cloud-fan Please see my comment above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/15363#discussion_r106789293 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer import scala.annotation.tailrec import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins +import org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, PhysicalOperation} import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.rules._ +import org.apache.spark.sql.catalyst.CatalystConf + +/** + * Encapsulates star-schema join detection. + */ +case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper { + + /** + * Star schema consists of one or more fact tables referencing a number of dimension + * tables. In general, star-schema joins are detected using the following conditions: + * 1. Informational RI constraints (reliable detection) + *+ Dimension contains a primary key that is being joined to the fact table. + *+ Fact table contains foreign keys referencing multiple dimension tables. + * 2. Cardinality based heuristics + *+ Usually, the table with the highest cardinality is the fact table. + *+ Table being joined with the most number of tables is the fact table. + * + * To detect star joins, the algorithm uses a combination of the above two conditions. + * The fact table is chosen based on the cardinality heuristics, and the dimension + * tables are chosen based on the RI constraints. A star join will consist of the largest + * fact table joined with the dimension tables on their primary keys. To detect that a + * column is a primary key, the algorithm uses table and column statistics. + * + * Since Catalyst only supports left-deep tree plans, the algorithm currently returns only + * the star join with the largest fact table. Choosing the largest fact table on the + * driving arm to avoid large inners is in general a good heuristic. This restriction can + * be lifted with support for bushy tree plans. + * + * The highlights of the algorithm are the following: + * + * Given a set of joined tables/plans, the algorithm first verifies if they are eligible + * for star join detection. An eligible plan is a base table access with valid statistics. + * A base table access represents Project or Filter operators above a LeafNode. Conservatively, + * the algorithm only considers base table access as part of a star join since they provide + * reliable statistics. + * + * If some of the plans are not base table access, or statistics are not available, the algorithm + * returns an empty star join plan since, in the absence of statistics, it cannot make + * good planning decisions. Otherwise, the algorithm finds the table with the largest cardinality + * (number of rows), which is assumed to be a fact table. + * + * Next, it computes the set of dimension tables for the current fact table. A dimension table + * is assumed to be in a RI relationship with a fact table. To infer column uniqueness, + * the algorithm compares the number of distinct values with the total number of rows in the + * table. If their relative difference is within certain limits (i.e. ndvMaxError * 2, adjusted + * based on 1TB TPC-DS data), the column is assumed to be unique. + */ + def findStarJoins( + input: Seq[LogicalPlan], + conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = { + +val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]] + +if (!conf.starSchemaDetection || input.size < 2) { + emptyStarJoinPlan +} else { + // Find if the input plans are eligible for star join detection. + // An eligible plan is a base table access with valid statistics. + val foundEligibleJoin = input.forall { +case PhysicalOperation(_, _, t: LeafNode) if t.stats(conf).rowCount.isDefined => true +case _ => false + } + + if (!foundEligibleJoin) { +// Some plans don't have stats or are complex plans. Conservatively, +// return an empty star join. This restriction can be lifted +// once statistics are propagated in the plan. +emptyStarJoinPlan + } else { +// Find the fact table using cardinality based heuristics i.e. +// the table with the largest number of rows. +val sortedFactTables = input.map { plan => +
[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE][WIP] spark-submit should han...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/16596 Ok sounds good. Since this touches spark-submit scripts that are shared across all languages it would be good to get loop in some other reviewers as well. I can do that once we have the new diff --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/15363#discussion_r106789103 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -83,9 +411,19 @@ object ReorderJoin extends Rule[LogicalPlan] with PredicateHelper { } def apply(plan: LogicalPlan): LogicalPlan = plan transform { -case j @ ExtractFiltersAndInnerJoins(input, conditions) +case ExtractFiltersAndInnerJoins(input, conditions) if input.size > 2 && conditions.nonEmpty => - createOrderedJoin(input, conditions) + if (conf.starSchemaDetection && !conf.cboEnabled) { --- End diff -- @cloud-fan It doesn't conflict with CBO, but just to be safe. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/15363#discussion_r106789008 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer import scala.annotation.tailrec import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins +import org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, PhysicalOperation} import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.rules._ +import org.apache.spark.sql.catalyst.CatalystConf + +/** + * Encapsulates star-schema join detection. + */ +case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper { + + /** + * Star schema consists of one or more fact tables referencing a number of dimension + * tables. In general, star-schema joins are detected using the following conditions: + * 1. Informational RI constraints (reliable detection) + *+ Dimension contains a primary key that is being joined to the fact table. + *+ Fact table contains foreign keys referencing multiple dimension tables. + * 2. Cardinality based heuristics + *+ Usually, the table with the highest cardinality is the fact table. + *+ Table being joined with the most number of tables is the fact table. + * + * To detect star joins, the algorithm uses a combination of the above two conditions. + * The fact table is chosen based on the cardinality heuristics, and the dimension + * tables are chosen based on the RI constraints. A star join will consist of the largest + * fact table joined with the dimension tables on their primary keys. To detect that a + * column is a primary key, the algorithm uses table and column statistics. + * + * Since Catalyst only supports left-deep tree plans, the algorithm currently returns only + * the star join with the largest fact table. Choosing the largest fact table on the + * driving arm to avoid large inners is in general a good heuristic. This restriction can + * be lifted with support for bushy tree plans. + * + * The highlights of the algorithm are the following: + * + * Given a set of joined tables/plans, the algorithm first verifies if they are eligible + * for star join detection. An eligible plan is a base table access with valid statistics. + * A base table access represents Project or Filter operators above a LeafNode. Conservatively, + * the algorithm only considers base table access as part of a star join since they provide + * reliable statistics. + * + * If some of the plans are not base table access, or statistics are not available, the algorithm + * returns an empty star join plan since, in the absence of statistics, it cannot make + * good planning decisions. Otherwise, the algorithm finds the table with the largest cardinality + * (number of rows), which is assumed to be a fact table. + * + * Next, it computes the set of dimension tables for the current fact table. A dimension table + * is assumed to be in a RI relationship with a fact table. To infer column uniqueness, + * the algorithm compares the number of distinct values with the total number of rows in the + * table. If their relative difference is within certain limits (i.e. ndvMaxError * 2, adjusted + * based on 1TB TPC-DS data), the column is assumed to be unique. + */ + def findStarJoins( + input: Seq[LogicalPlan], + conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = { + +val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]] + +if (!conf.starSchemaDetection || input.size < 2) { + emptyStarJoinPlan +} else { + // Find if the input plans are eligible for star join detection. + // An eligible plan is a base table access with valid statistics. + val foundEligibleJoin = input.forall { +case PhysicalOperation(_, _, t: LeafNode) if t.stats(conf).rowCount.isDefined => true +case _ => false + } + + if (!foundEligibleJoin) { +// Some plans don't have stats or are complex plans. Conservatively, +// return an empty star join. This restriction can be lifted +// once statistics are propagated in the plan. +emptyStarJoinPlan + } else { +// Find the fact table using cardinality based heuristics i.e. +// the table with the largest number of rows. +val sortedFactTables = input.map { plan => +
[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17342 **[Test build #74790 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74790/testReport)** for PR 17342 at commit [`04556c9`](https://github.com/apache/spark/commit/04556c9f2f4feb53e3f644d795a38de4a4e919ca). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17342: [SPARK-18910][SPARK-12868] Allow adding jars from...
GitHub user weiqingy opened a pull request: https://github.com/apache/spark/pull/17342 [SPARK-18910][SPARK-12868] Allow adding jars from hdfs ## What changes were proposed in this pull request? Spark 2.2 is going to be cut, it'll be great if SPARK-12868 can be resolved before that. There have been several PRs for this like [PR#16324](https://github.com/apache/spark/pull/16324) , but all of them are inactivity for a long time or have been closed. This PR added a SparkUrlStreamHandlerFactory, which relies on 'protocol' to choose the appropriate UrlStreamHandlerFactory like FsUrlStreamHandlerFactory to create URLStreamHandler. ## How was this patch tested? 1. Add a new unit test. 2. Check manually. Before: throw an exception with " failed unknown protocol: hdfs" https://cloud.githubusercontent.com/assets/8546874/24075277/5abe0a7c-0bd5-11e7-900e-ec3d3105da0b.png";> After: https://cloud.githubusercontent.com/assets/8546874/24075283/69382a60-0bd5-11e7-8d30-d9405c3aaaba.png";> You can merge this pull request into a Git repository by running: $ git pull https://github.com/weiqingy/spark SPARK-18910 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17342.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17342 commit 04556c9f2f4feb53e3f644d795a38de4a4e919ca Author: Weiqing Yang Date: 2017-03-18T18:55:28Z [SPARK-18910] Allow adding jars from hdfs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/15363#discussion_r106788869 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer import scala.annotation.tailrec import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins +import org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, PhysicalOperation} import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.rules._ +import org.apache.spark.sql.catalyst.CatalystConf + +/** + * Encapsulates star-schema join detection. + */ +case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper { + + /** + * Star schema consists of one or more fact tables referencing a number of dimension + * tables. In general, star-schema joins are detected using the following conditions: + * 1. Informational RI constraints (reliable detection) + *+ Dimension contains a primary key that is being joined to the fact table. + *+ Fact table contains foreign keys referencing multiple dimension tables. + * 2. Cardinality based heuristics + *+ Usually, the table with the highest cardinality is the fact table. + *+ Table being joined with the most number of tables is the fact table. + * + * To detect star joins, the algorithm uses a combination of the above two conditions. + * The fact table is chosen based on the cardinality heuristics, and the dimension + * tables are chosen based on the RI constraints. A star join will consist of the largest + * fact table joined with the dimension tables on their primary keys. To detect that a + * column is a primary key, the algorithm uses table and column statistics. + * + * Since Catalyst only supports left-deep tree plans, the algorithm currently returns only + * the star join with the largest fact table. Choosing the largest fact table on the + * driving arm to avoid large inners is in general a good heuristic. This restriction can + * be lifted with support for bushy tree plans. + * + * The highlights of the algorithm are the following: + * + * Given a set of joined tables/plans, the algorithm first verifies if they are eligible + * for star join detection. An eligible plan is a base table access with valid statistics. + * A base table access represents Project or Filter operators above a LeafNode. Conservatively, + * the algorithm only considers base table access as part of a star join since they provide + * reliable statistics. + * + * If some of the plans are not base table access, or statistics are not available, the algorithm + * returns an empty star join plan since, in the absence of statistics, it cannot make + * good planning decisions. Otherwise, the algorithm finds the table with the largest cardinality + * (number of rows), which is assumed to be a fact table. + * + * Next, it computes the set of dimension tables for the current fact table. A dimension table + * is assumed to be in a RI relationship with a fact table. To infer column uniqueness, + * the algorithm compares the number of distinct values with the total number of rows in the + * table. If their relative difference is within certain limits (i.e. ndvMaxError * 2, adjusted + * based on 1TB TPC-DS data), the column is assumed to be unique. + */ + def findStarJoins( + input: Seq[LogicalPlan], + conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = { + +val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]] + +if (!conf.starSchemaDetection || input.size < 2) { + emptyStarJoinPlan +} else { + // Find if the input plans are eligible for star join detection. + // An eligible plan is a base table access with valid statistics. + val foundEligibleJoin = input.forall { +case PhysicalOperation(_, _, t: LeafNode) if t.stats(conf).rowCount.isDefined => true +case _ => false + } + + if (!foundEligibleJoin) { +// Some plans don't have stats or are complex plans. Conservatively, +// return an empty star join. This restriction can be lifted +// once statistics are propagated in the plan. +emptyStarJoinPlan + } else { +// Find the fact table using cardinality based heuristics i.e. +// the table with the largest number of rows. +val sortedFactTables = input.map { plan => +
[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/15363#discussion_r106788720 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer import scala.annotation.tailrec import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins +import org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, PhysicalOperation} import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.rules._ +import org.apache.spark.sql.catalyst.CatalystConf + +/** + * Encapsulates star-schema join detection. + */ +case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper { + + /** + * Star schema consists of one or more fact tables referencing a number of dimension + * tables. In general, star-schema joins are detected using the following conditions: + * 1. Informational RI constraints (reliable detection) + *+ Dimension contains a primary key that is being joined to the fact table. + *+ Fact table contains foreign keys referencing multiple dimension tables. + * 2. Cardinality based heuristics + *+ Usually, the table with the highest cardinality is the fact table. + *+ Table being joined with the most number of tables is the fact table. + * + * To detect star joins, the algorithm uses a combination of the above two conditions. + * The fact table is chosen based on the cardinality heuristics, and the dimension + * tables are chosen based on the RI constraints. A star join will consist of the largest + * fact table joined with the dimension tables on their primary keys. To detect that a + * column is a primary key, the algorithm uses table and column statistics. + * + * Since Catalyst only supports left-deep tree plans, the algorithm currently returns only + * the star join with the largest fact table. Choosing the largest fact table on the + * driving arm to avoid large inners is in general a good heuristic. This restriction can + * be lifted with support for bushy tree plans. + * + * The highlights of the algorithm are the following: + * + * Given a set of joined tables/plans, the algorithm first verifies if they are eligible + * for star join detection. An eligible plan is a base table access with valid statistics. + * A base table access represents Project or Filter operators above a LeafNode. Conservatively, + * the algorithm only considers base table access as part of a star join since they provide + * reliable statistics. + * + * If some of the plans are not base table access, or statistics are not available, the algorithm + * returns an empty star join plan since, in the absence of statistics, it cannot make + * good planning decisions. Otherwise, the algorithm finds the table with the largest cardinality + * (number of rows), which is assumed to be a fact table. + * + * Next, it computes the set of dimension tables for the current fact table. A dimension table + * is assumed to be in a RI relationship with a fact table. To infer column uniqueness, + * the algorithm compares the number of distinct values with the total number of rows in the + * table. If their relative difference is within certain limits (i.e. ndvMaxError * 2, adjusted + * based on 1TB TPC-DS data), the column is assumed to be unique. + */ + def findStarJoins( + input: Seq[LogicalPlan], + conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = { + +val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]] + +if (!conf.starSchemaDetection || input.size < 2) { + emptyStarJoinPlan +} else { + // Find if the input plans are eligible for star join detection. + // An eligible plan is a base table access with valid statistics. + val foundEligibleJoin = input.forall { +case PhysicalOperation(_, _, t: LeafNode) if t.stats(conf).rowCount.isDefined => true +case _ => false + } + + if (!foundEligibleJoin) { +// Some plans don't have stats or are complex plans. Conservatively, +// return an empty star join. This restriction can be lifted +// once statistics are propagated in the plan. +emptyStarJoinPlan + } else { +// Find the fact table using cardinality based heuristics i.e. +// the table with the largest number of rows. +val sortedFactTables = input.map { plan => +
[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...
Github user ioana-delaney commented on a diff in the pull request: https://github.com/apache/spark/pull/15363#discussion_r106788630 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer import scala.annotation.tailrec import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins +import org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, PhysicalOperation} import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical._ import org.apache.spark.sql.catalyst.rules._ +import org.apache.spark.sql.catalyst.CatalystConf + +/** + * Encapsulates star-schema join detection. + */ +case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper { + + /** + * Star schema consists of one or more fact tables referencing a number of dimension + * tables. In general, star-schema joins are detected using the following conditions: + * 1. Informational RI constraints (reliable detection) + *+ Dimension contains a primary key that is being joined to the fact table. + *+ Fact table contains foreign keys referencing multiple dimension tables. + * 2. Cardinality based heuristics + *+ Usually, the table with the highest cardinality is the fact table. + *+ Table being joined with the most number of tables is the fact table. + * + * To detect star joins, the algorithm uses a combination of the above two conditions. + * The fact table is chosen based on the cardinality heuristics, and the dimension + * tables are chosen based on the RI constraints. A star join will consist of the largest + * fact table joined with the dimension tables on their primary keys. To detect that a + * column is a primary key, the algorithm uses table and column statistics. + * + * Since Catalyst only supports left-deep tree plans, the algorithm currently returns only + * the star join with the largest fact table. Choosing the largest fact table on the + * driving arm to avoid large inners is in general a good heuristic. This restriction can + * be lifted with support for bushy tree plans. + * + * The highlights of the algorithm are the following: + * + * Given a set of joined tables/plans, the algorithm first verifies if they are eligible + * for star join detection. An eligible plan is a base table access with valid statistics. + * A base table access represents Project or Filter operators above a LeafNode. Conservatively, + * the algorithm only considers base table access as part of a star join since they provide + * reliable statistics. + * + * If some of the plans are not base table access, or statistics are not available, the algorithm + * returns an empty star join plan since, in the absence of statistics, it cannot make + * good planning decisions. Otherwise, the algorithm finds the table with the largest cardinality + * (number of rows), which is assumed to be a fact table. + * + * Next, it computes the set of dimension tables for the current fact table. A dimension table + * is assumed to be in a RI relationship with a fact table. To infer column uniqueness, + * the algorithm compares the number of distinct values with the total number of rows in the + * table. If their relative difference is within certain limits (i.e. ndvMaxError * 2, adjusted + * based on 1TB TPC-DS data), the column is assumed to be unique. + */ + def findStarJoins( + input: Seq[LogicalPlan], + conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = { + +val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]] + +if (!conf.starSchemaDetection || input.size < 2) { + emptyStarJoinPlan +} else { + // Find if the input plans are eligible for star join detection. + // An eligible plan is a base table access with valid statistics. + val foundEligibleJoin = input.forall { --- End diff -- @cloud-fan Star-join detection works with and without CBO. Project/Filter cardinality is computed only if cbo is enabled. Here, I am doing a simple validation to ensure statistics were run on the base relations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubsc
[GitHub] spark issue #17334: [SPARK-19998][Block Manager]BlockRDD block not found Exc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17334 **[Test build #3602 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3602/testReport)** for PR 17334 at commit [`98393f8`](https://github.com/apache/spark/commit/98393f8e346be0cd1cbe73b6861668e17446990d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17290: [SPARK-16599][CORE] java.util.NoSuchElementException: No...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/17290 That's fine I can add a warning. I don't know if it is a bug situation but it sure could be. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17290: [SPARK-16599][CORE] java.util.NoSuchElementExcept...
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/17290#discussion_r106788142 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala --- @@ -340,7 +340,7 @@ private[storage] class BlockInfoManager extends Logging { val blocksWithReleasedLocks = mutable.ArrayBuffer[BlockId]() val readLocks = synchronized { - readLocksByTask.remove(taskAttemptId).get + readLocksByTask.remove(taskAttemptId).getOrElse(ImmutableMultiset.of[BlockId]()) --- End diff -- Missed this PR earlier .. sorry about that. Since this is an unexpected scenario (it really must not happen according to current code !), we should probably have logged the stack trace and warning message in the logs to help debug. It silently ignores a fairly bad bug otherwise. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...
Github user nsyca commented on the issue: https://github.com/apache/spark/pull/17286 Right. I misread it. if there is no join predicate between a table and any cluster of tables, we should not consider that table in the join enumeration at all. We can simply push that table to be the last join. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16982: [SPARK-19654][SPARKR][SS] Structured Streaming API for R
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16982 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org