date:20170318

[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-03-18 Thread dilipbiswal

Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r106795772
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -83,6 +116,19 @@ object SubqueryExpression {
   case _ => false
 }.isDefined
   }
+
+  /**
+   * Clean the outer references by normalizing them to BindReference in 
the same way
+   * we clean up the arguments during LogicalPlan.sameResult. This enables 
to compare two
--- End diff --

@gatorsmile OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-03-18 Thread dilipbiswal

Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r106795763
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -61,6 +63,37 @@ abstract class SubqueryExpression(
   }
 }
 
+/**
+ * This expression is used to represent any form of subquery expression 
namely
+ * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
+ * expression equality works properly when LogicalPlan.sameResult is called
+ * on plans containing SubqueryExpression(s). This is only a transient 
expression
+ * that only lives in the scope of sameResult function call. In other 
words, analyzer,
+ * optimizer or planner never sees this expression type during 
transformation of
+ * plans.
+ */
+case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
+  extends UnaryExpression with Unevaluable {
+  override def dataType: DataType = expr.dataType
+  override def nullable: Boolean = expr.nullable
+  override def child: Expression = expr
+  override def toString: String = 
s"CanonicalizedSubqueryExpr(${expr.toString})"
+
+  // Hashcode is generated conservatively for now i.e it does not include 
the
+  // sub query plan. Doing so causes issue when we canonicalize 
expressions to
+  // re-order them based on hashcode.
+  // TODO : improve the hashcode generation by considering the plan info.
+  override def hashCode(): Int = {
+val h = Objects.hashCode(expr.children)
+h * 31 + Objects.hashCode(this.getClass.getName)
+  }
+
+  override def equals(o: Any): Boolean = o match {
+case n: CanonicalizedSubqueryExpr => expr.semanticEquals(n.expr)
+case other => false
--- End diff --

@gatorsmile Will change. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-03-18 Thread dilipbiswal

Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r106795750
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -83,6 +116,19 @@ object SubqueryExpression {
   case _ => false
 }.isDefined
   }
+
+  /**
+   * Clean the outer references by normalizing them to BindReference in 
the same way
+   * we clean up the arguments during LogicalPlan.sameResult. This enables 
to compare two
+   * plans which has subquery expressions.
+   */
+  def canonicalize(e: SubqueryExpression, attrs: AttributeSeq): 
CanonicalizedSubqueryExpr = {
+// Normalize the outer references in the subquery plan.
+val subPlan = e.plan.transformAllExpressions {
+  case o @ OuterReference(e) => BindReferences.bindReference(e, attrs, 
allowFailures = true)
--- End diff --

@gatorsmile Will change. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17343
  
**[Test build #74802 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74802/testReport)**
 for PR 17343 at commit 
[`00da825`](https://github.com/apache/spark/commit/00da8254d060291fe6f2fdec3e30b2f30d5a69c8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16971
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74793/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16971
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16971
  
**[Test build #74793 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74793/testReport)**
 for PR 16971 at commit 
[`ed6dacd`](https://github.com/apache/spark/commit/ed6dacdb3e3bdfd4e9ccb5c57bf8b4118636b0c6).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17191
  
We have the same limitation. To do it in mySQL and Postgres, you need to 
use quotes/backticks 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17330: [SPARK-19993][SQL] Caching logical plans containing subq...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17330
  
Generally, it looks good to me. cc @hvanhovell @rxin @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r106795294
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ---
@@ -655,6 +663,148 @@ class CachedTableSuite extends QueryTest with 
SQLTestUtils with SharedSQLContext
 }
   }
 
+  test("SPARK-19993 subquery caching") {
+withTempView("t1", "t2") {
+  Seq(1).toDF("c1").createOrReplaceTempView("t1")
+  Seq(1).toDF("c1").createOrReplaceTempView("t2")
+
+  val ds1 =
+sql(
+  """
+|SELECT * FROM t1
+|WHERE
+|NOT EXISTS (SELECT * FROM t1)
+  """.stripMargin)
+  assert(getNumInMemoryRelations(ds1) == 0)
+
+  ds1.cache()
+
+  val cachedDs =
+sql(
+  """
+|SELECT * FROM t1
+|WHERE
+|NOT EXISTS (SELECT * FROM t1)
+  """.stripMargin)
+  assert(getNumInMemoryRelations(cachedDs) == 1)
+
+  // Additional predicate in the subquery plan should cause a cache 
miss
+  val cachedMissDs =
+sql(
+  """
+|SELECT * FROM t1
+|WHERE
+|NOT EXISTS (SELECT * FROM t1 where c1 = 0)
+  """.stripMargin)
+  assert(getNumInMemoryRelations(cachedMissDs) == 0)
+
+  // Simple correlated predicate in subquery
+  val ds2 =
+sql(
+  """
+|SELECT * FROM t1
+|WHERE
+|t1.c1 in (SELECT t2.c1 FROM t2 where t1.c1 = t2.c1)
+  """.stripMargin)
+  assert(getNumInMemoryRelations(ds2) == 0)
+
+  ds2.cache()
+
+  val cachedDs2 =
+sql(
+  """
+|SELECT * FROM t1
+|WHERE
+|t1.c1 in (SELECT t2.c1 FROM t2 where t1.c1 = t2.c1)
+  """.stripMargin)
+
+  assert(getNumInMemoryRelations(cachedDs2) == 1)
+
+  spark.catalog.cacheTable("t1")
+  ds1.unpersist()
--- End diff --

How about splitting the test cases to multiple individual ones? The cache 
will be cleaned for each test case. This can avoid any extra checking, like 
`assert(getNumInMemoryRelations(cachedMissDs) == 0)` or `ds1.unpersist()`
```Scala
  override def afterEach(): Unit = {
try {
  spark.catalog.clearCache()
} finally {
  super.afterEach()
}
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17342
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74792/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17342
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17342
  
**[Test build #74792 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74792/testReport)**
 for PR 17342 at commit 
[`04556c9`](https://github.com/apache/spark/commit/04556c9f2f4feb53e3f644d795a38de4a4e919ca).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r106795228
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -83,6 +116,19 @@ object SubqueryExpression {
   case _ => false
 }.isDefined
   }
+
+  /**
+   * Clean the outer references by normalizing them to BindReference in 
the same way
+   * we clean up the arguments during LogicalPlan.sameResult. This enables 
to compare two
--- End diff --

Also replace `SubqueryExpression ` by `CanonicalizedSubqueryExpr`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r106795140
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -61,6 +63,37 @@ abstract class SubqueryExpression(
   }
 }
 
+/**
+ * This expression is used to represent any form of subquery expression 
namely
+ * ListQuery, Exists and ScalarSubquery. This is only used to make sure the
+ * expression equality works properly when LogicalPlan.sameResult is called
+ * on plans containing SubqueryExpression(s). This is only a transient 
expression
+ * that only lives in the scope of sameResult function call. In other 
words, analyzer,
+ * optimizer or planner never sees this expression type during 
transformation of
+ * plans.
+ */
+case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)
+  extends UnaryExpression with Unevaluable {
+  override def dataType: DataType = expr.dataType
+  override def nullable: Boolean = expr.nullable
+  override def child: Expression = expr
+  override def toString: String = 
s"CanonicalizedSubqueryExpr(${expr.toString})"
+
+  // Hashcode is generated conservatively for now i.e it does not include 
the
+  // sub query plan. Doing so causes issue when we canonicalize 
expressions to
+  // re-order them based on hashcode.
+  // TODO : improve the hashcode generation by considering the plan info.
+  override def hashCode(): Int = {
+val h = Objects.hashCode(expr.children)
+h * 31 + Objects.hashCode(this.getClass.getName)
+  }
+
+  override def equals(o: Any): Boolean = o match {
+case n: CanonicalizedSubqueryExpr => expr.semanticEquals(n.expr)
+case other => false
--- End diff --

```Scala
case CanonicalizedSubqueryExpr(e) => expr.semanticEquals(e)
case _ => false
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r106795081
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -83,6 +116,19 @@ object SubqueryExpression {
   case _ => false
 }.isDefined
   }
+
+  /**
+   * Clean the outer references by normalizing them to BindReference in 
the same way
+   * we clean up the arguments during LogicalPlan.sameResult. This enables 
to compare two
--- End diff --

Nit: `QueryPlan.sameResult`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth

2017-03-18 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/17170
  
Could you update this PR to have the parameter  itemsCol
And remove predictionCol (if I recall, we don't expose that in the R API 
for other models either)





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17330#discussion_r106795011
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -83,6 +116,19 @@ object SubqueryExpression {
   case _ => false
 }.isDefined
   }
+
+  /**
+   * Clean the outer references by normalizing them to BindReference in 
the same way
+   * we clean up the arguments during LogicalPlan.sameResult. This enables 
to compare two
+   * plans which has subquery expressions.
+   */
+  def canonicalize(e: SubqueryExpression, attrs: AttributeSeq): 
CanonicalizedSubqueryExpr = {
+// Normalize the outer references in the subquery plan.
+val subPlan = e.plan.transformAllExpressions {
+  case o @ OuterReference(e) => BindReferences.bindReference(e, attrs, 
allowFailures = true)
--- End diff --

Nit: `case OuterReference(r) => BindReferences.bindReference(r, attrs, 
allowFailures = true)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17343
  
**[Test build #74800 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74800/testReport)**
 for PR 17343 at commit 
[`1834db6`](https://github.com/apache/spark/commit/1834db60b7f504862f6ef03bc828264c65bdabd3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16971
  
LGTM pending Jenkins 

cc @thunterdb @MLnick 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16596
  
**[Test build #74801 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74801/testReport)**
 for PR 16596 at commit 
[`e33b50a`](https://github.com/apache/spark/commit/e33b50aae78c79a425ab1e935498919eb0350c97).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method

2017-03-18 Thread sitalkedia

Github user sitalkedia commented on the issue:

https://github.com/apache/spark/pull/17343
  
cc -  @rxin, @squito, @zsxwing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16596
  
**[Test build #74799 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74799/testReport)**
 for PR 16596 at commit 
[`1821e21`](https://github.com/apache/spark/commit/1821e21483904cf2890e9c7ba420d72a20623a74).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream method

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17343
  
**[Test build #74798 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74798/testReport)**
 for PR 17343 at commit 
[`e9ac76e`](https://github.com/apache/spark/commit/e9ac76edb055d08699d9de7a5ff77b7ca8a7f5c6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17343: [SPARK-20014] Optimize mergeSpillsWithFileStream ...

2017-03-18 Thread sitalkedia

GitHub user sitalkedia opened a pull request:

https://github.com/apache/spark/pull/17343

[SPARK-20014] Optimize mergeSpillsWithFileStream method

## What changes were proposed in this pull request?

When the individual partition size in a spill is small, 
mergeSpillsWithTransferTo method does many small disk ios which is really 
inefficient. One way to improve the performance will be to use 
mergeSpillsWithFileStream method by turning off transfer to and using buffered 
file read/write to improve the io throughput. 
However, the current implementation of mergeSpillsWithFileStream does not 
do a buffer read/write of the files and in addition to that it unnecessarily 
flushes the output files for each partitions.

## How was this patch tested?

Unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sitalkedia/spark 
upstream_mergeSpillsWithFileStream

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17343.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17343


commit e9ac76edb055d08699d9de7a5ff77b7ca8a7f5c6
Author: Sital Kedia 
Date:   2017-03-19T00:24:10Z

[SPARK-20014] Optimize mergeSpillsWithFileStream method




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16596
  
**[Test build #74797 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74797/testReport)**
 for PR 16596 at commit 
[`cc44ae5`](https://github.com/apache/spark/commit/cc44ae577a972c26623e26d349aa2990d33b5b28).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16596
  
**[Test build #74796 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74796/testReport)**
 for PR 16596 at commit 
[`61d6ba6`](https://github.com/apache/spark/commit/61d6ba64774b4c65a4c05f69e1a97f4f978464db).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16596
  
updated. pretty sure this is an issue on Windows only


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE] On Windows spark-submit shou...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16596
  
**[Test build #74795 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74795/testReport)**
 for PR 16596 at commit 
[`f37c891`](https://github.com/apache/spark/commit/f37c891a8f38c244c8be7c452581778d1e2e180f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17341: [SPARK-20013][SQL]add a newTablePath parameter for renam...

2017-03-18 Thread windpiger

Github user windpiger commented on the issue:

https://github.com/apache/spark/pull/17341
  
cc @cloud-fan @gatorsmile


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17338: [SPARK-19990][SQL][test-maven]create a temp file for fil...

2017-03-18 Thread windpiger

Github user windpiger commented on the issue:

https://github.com/apache/spark/pull/17338
  
yes,it is. 
It (jar:file://) will be re-resolved by new Path later, and will throw an 
exception described in the jira.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16330: [SPARK-18817][SPARKR][SQL] change derby log output to te...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16330
  
**[Test build #74794 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74794/testReport)**
 for PR 16330 at commit 
[`ac9fbfc`](https://github.com/apache/spark/commit/ac9fbfc5d511877f7775c620ff8e1c672880ee50).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16330: [SPARK-18817][SPARKR][SQL] change derby log outpu...

2017-03-18 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16330#discussion_r106794245
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -2909,6 +2910,30 @@ test_that("Collect on DataFrame when NAs exists at 
the top of a timestamp column
   expect_equal(class(ldf3$col3), c("POSIXct", "POSIXt"))
 })
 
+compare_list <- function(list1, list2) {
+  # get testthat to show the diff by first making the 2 lists equal in 
length
+  expect_equal(length(list1), length(list2))
+  l <- max(length(list1), length(list2))
+  length(list1) <- l
+  length(list2) <- l
+  expect_equal(sort(list1, na.last = TRUE), sort(list2, na.last = TRUE))
+}
+
+# This should always be the last test in this test file.
+test_that("No extra files are created in SPARK_HOME by starting session 
and making calls", {
+  # Check that it is not creating any extra file.
+  # Does not check the tempdir which would be cleaned up after.
+  filesAfter <- list.files(path = sparkRDir, all.files = TRUE)
+
+  expect_true(length(sparkRFilesBefore) > 0)
+  # first, ensure derby.log is not there
+  expect_false("derby.log" %in% filesAfter)
+  # second, ensure only spark-warehouse is created when calling 
SparkSession, enableHiveSupport = F
--- End diff --

agreed. 
updated, hope it's better now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-18 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17192#discussion_r106793811
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 ---
@@ -425,8 +425,8 @@ object FunctionRegistry {
 expression[BitwiseXor]("^"),
 
 // json
-expression[StructToJson]("to_json"),
-expression[JsonToStruct]("from_json"),
+expression[StructsToJson]("to_json"),
+expression[JsonToStructs]("from_json"),
--- End diff --

+ @maropu @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16971
  
**[Test build #74793 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74793/testReport)**
 for PR 16971 at commit 
[`ed6dacd`](https://github.com/apache/spark/commit/ed6dacdb3e3bdfd4e9ccb5c57bf8b4118636b0c6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16971
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-18 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17192#discussion_r106794013
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -624,41 +627,58 @@ case class StructToJson(
   lazy val writer = new CharArrayWriter()
 
   @transient
-  lazy val gen =
-new JacksonGenerator(
-  child.dataType.asInstanceOf[StructType],
-  writer,
-  new JSONOptions(options, timeZoneId.get))
+  lazy val gen = new JacksonGenerator(
+rowSchema, writer, new JSONOptions(options, timeZoneId.get))
+
+  @transient
+  lazy val rowSchema = child.dataType match {
+case st: StructType => st
+case ArrayType(st: StructType, _) => st
+  }
+
+  // This converts rows to the JSON output according to the given schema.
+  @transient
+  lazy val converter: Any => UTF8String = {
+def getAndReset(): UTF8String = {
+  gen.flush()
+  val json = writer.toString
+  writer.reset()
+  UTF8String.fromString(json)
+}
+
+child.dataType match {
+  case _: StructType =>
+(row: Any) =>
+  gen.write(row.asInstanceOf[InternalRow])
+  getAndReset()
+  case ArrayType(_: StructType, _) =>
+(arr: Any) =>
+  gen.write(arr.asInstanceOf[ArrayData])
+  getAndReset()
+}
+  }
 
   override def dataType: DataType = StringType
 
-  override def checkInputDataTypes(): TypeCheckResult = {
-if (StructType.acceptsType(child.dataType)) {
+  override def checkInputDataTypes(): TypeCheckResult = child.dataType 
match {
+case _: StructType | ArrayType(_: StructType, _) =>
--- End diff --

this seems to become a bit more strict from before?
was: `StructType.acceptsType` - anything can be accepted as struct
now: `match { case _: StructType` - must be struct

am I understand this correctly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-18 Thread ioana-delaney

Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106794032
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,340 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import org.apache.spark.sql.catalyst.planning.{BaseTableAccess, 
ExtractFiltersAndInnerJoins}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class DetectStarSchemaJoin(conf: CatalystConf) extends 
PredicateHelper {
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
--- End diff --

@cloud-fan Yes, currently I am only generating the star join for the 
largest fact table. I can generalize the algorithm once we use CBO.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-18 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17192#discussion_r106793959
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -624,41 +627,58 @@ case class StructToJson(
   lazy val writer = new CharArrayWriter()
 
   @transient
-  lazy val gen =
-new JacksonGenerator(
-  child.dataType.asInstanceOf[StructType],
-  writer,
-  new JSONOptions(options, timeZoneId.get))
+  lazy val gen = new JacksonGenerator(
+rowSchema, writer, new JSONOptions(options, timeZoneId.get))
+
+  @transient
+  lazy val rowSchema = child.dataType match {
+case st: StructType => st
+case ArrayType(st: StructType, _) => st
--- End diff --

could we end up with `ArrayType(StructType, something_not_struct)` in this 
match?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-18 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17192#discussion_r106793800
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1774,10 +1774,11 @@ def json_tuple(col, *fields):
 def from_json(col, schema, options={}):
 """
 Parses a column containing a JSON string into a [[StructType]] or 
[[ArrayType]]
-with the specified schema. Returns `null`, in the case of an 
unparseable string.
+of [[StructType]]s with the specified schema. Returns `null`, in the 
case of an unparseable
+string.
--- End diff --

if we are updating this in python, we should update R too?

https://github.com/HyukjinKwon/spark/blob/185ea6003d60feed20c56de61c17bc304663d99a/R/pkg/R/functions.R#L2440



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106793947
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/StarJoinSuite.scala 
---
@@ -0,0 +1,488 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import org.apache.spark.sql.{Row, _}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SQLTestUtils
+
+
+class StarJoinSuite extends QueryTest with SQLTestUtils with 
TestHiveSingleton {
--- End diff --

Yes! This is a good idea. It might take times to rewrite the whole things. 
How about doing it as a follow-up PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-18 Thread ioana-delaney

Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106793898
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark pull request #16982: [SPARK-19654][SPARKR][SS] Structured Streaming AP...

2017-03-18 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16982


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16626
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74791/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16626
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16626
  
**[Test build #74791 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74791/testReport)**
 for PR 16626 at commit 
[`a28fc42`](https://github.com/apache/spark/commit/a28fc42cee6dc52516e074ced7c4351ee6baa45d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17250: [SPARK-19911][STREAMING] Add builder interface for Kines...

2017-03-18 Thread brkyvz

Github user brkyvz commented on the issue:

https://github.com/apache/spark/pull/17250
  
@budde Do you think you can update this PR? The 2.2 branch will be cut on 
Monday (2017-03-18).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16626
  
Adding a check in the existing test case to see if `HIVE_TYPE_STRING` is 
correctly populated in the metadata. 

LGTM except a few minor comments

cc @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16626#discussion_r106792988
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -2178,4 +2177,136 @@ abstract class DDLSuite extends QueryTest with 
SQLTestUtils {
   }
 }
   }
+
+  Seq("parquet", "json", "csv").foreach { provider =>
--- End diff --

Let us introduce another variable.

```Scala
val supportedNativeFileFormatsForAlterTableAddColumns = Seq("parquet", 
"json", "csv")
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16626#discussion_r106792941
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -175,6 +178,74 @@ case class AlterTableRenameCommand(
 }
 
 /**
+ * A command that add columns to a table
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   ALTER TABLE table_identifier
+ *   ADD COLUMNS (col_name data_type [COMMENT col_comment], ...);
+ * }}}
+*/
+case class AlterTableAddColumnsCommand(
+table: TableIdentifier,
+columns: Seq[StructField]) extends RunnableCommand {
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val catalog = sparkSession.sessionState.catalog
+val catalogTable = verifyAlterTableAddColumn(catalog, table)
+
+try {
+  sparkSession.catalog.uncacheTable(table.quotedString)
+} catch {
+  case NonFatal(e) =>
+log.warn(s"Exception when attempting to uncache table 
${table.quotedString}", e)
+}
+catalog.refreshTable(table)
+catalog.alterTableSchema(
+  table, catalogTable.schema.copy(fields = catalogTable.schema.fields 
++ columns))
+
+Seq.empty[Row]
+  }
+
+  /**
+   * ALTER TABLE ADD COLUMNS command does not support temporary view/table,
+   * view, or datasource table with text, orc formats or external provider.
+   * For datasource table, it currently only supports parquet, json, csv.
+   */
+  private def verifyAlterTableAddColumn(
+  catalog: SessionCatalog,
+  table: TableIdentifier): CatalogTable = {
+val catalogTable = catalog.getTempViewOrPermanentTableMetadata(table)
+
+if (catalogTable.tableType == CatalogTableType.VIEW) {
+  throw new AnalysisException(
+s"""
+  |ALTER ADD COLUMNS does not support views.
+  |You must drop and re-create the views for adding the new 
columns. Views: $table
+ """.stripMargin)
+}
+
+if (DDLUtils.isDatasourceTable(catalogTable)) {
+  DataSource.lookupDataSource(catalogTable.provider.get).newInstance() 
match {
+// For datasource table, this command can only support the 
following File format.
+// TextFileFormat only default to one column "value"
+// OrcFileFormat can not handle difference between user-specified 
schema and
+// inferred schema yet. TODO, once this issue is resolved , we can 
add Orc back.
+// Hive type is already considered as hive serde table, so the 
logic will not
+// come in here.
+case _: JsonFileFormat | _: CSVFileFormat | _: ParquetFileFormat =>
+case s =>
+  throw new AnalysisException(
+s"""
+   |ALTER ADD COLUMNS does not support datasource table with 
type $s.
+   |You must drop and re-create the table for adding the new 
columns. Tables: $table
+ """.stripMargin)
--- End diff --

Nit: fix the format


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16626#discussion_r106792952
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -165,7 +165,6 @@ class InMemoryCatalogedDDLSuite extends DDLSuite with 
SharedSQLContext with Befo
   assert(e.contains("Hive support is required to CREATE Hive TABLE (AS 
SELECT)"))
 }
   }
-
--- End diff --

revert it back


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16626#discussion_r106792918
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala
 ---
@@ -450,6 +451,21 @@ abstract class SessionCatalogSuite extends PlanTest {
 }
   }
 
+  test("alter table add columns") {
--- End diff --

Also add a negative test case for dropping columns, although we do not 
support it now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...

2017-03-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16626#discussion_r106792896
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala
 ---
@@ -450,6 +451,21 @@ abstract class SessionCatalogSuite extends PlanTest {
 }
   }
 
+  test("alter table add columns") {
+withBasicCatalog { sessionCatalog =>
+  sessionCatalog.createTable(newTable("t1", "default"), ignoreIfExists 
= false)
+  val oldTab = sessionCatalog.externalCatalog.getTable("default", "t1")
+  sessionCatalog.alterTableSchema(
+TableIdentifier("t1", Some("default")), oldTab.schema.add("c3", 
IntegerType))
+
+  val newTab = sessionCatalog.externalCatalog.getTable("default", "t1")
+  // construct the expected table schema
+  val oldTabSchema = StructType(oldTab.dataSchema.fields ++
--- End diff --

`oldTabSchema ` -> `expectedTabSchema `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17342
  
**[Test build #74792 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74792/testReport)**
 for PR 17342 at commit 
[`04556c9`](https://github.com/apache/spark/commit/04556c9f2f4feb53e3f644d795a38de4a4e919ca).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs

2017-03-18 Thread weiqingy

Github user weiqingy commented on the issue:

https://github.com/apache/spark/pull/17342
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs

2017-03-18 Thread weiqingy

Github user weiqingy commented on the issue:

https://github.com/apache/spark/pull/17342
  
`org.apache.spark.storage.BlockManagerProactiveReplicationSuite.proactive 
block replication - 3 replicas - 2 block manager deletions` failed, but it 
passed locally. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

2017-03-18 Thread sitalkedia

Github user sitalkedia commented on the issue:

https://github.com/apache/spark/pull/17297
  
>> I don't think its true that it relaunches all tasks that hadn't 
completed when the fetch failure occurred. it relaunches all the tasks haven't 
completed, by the time the stage gets resubmitted. More tasks can complete in 
between the time of the first failure, and the time the stage is resubmitted.

Actually, I realized that it's not true. If you looked at the code 
(https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1419),
 when the stage fails because of fetch failure, we remove the stage from the 
output commiter. So if any task completes between the time of first fetch 
failure and the time stage is resubmitted, will be denied to commit the output 
and so the scheduler re-launches all tasks in the stage with the fetch failure 
that hadn't completed when the fetch failure occurred.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17334: [SPARK-19998][Block Manager]BlockRDD block not found Exc...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17334
  
**[Test build #3603 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3603/testReport)**
 for PR 17334 at commit 
[`98393f8`](https://github.com/apache/spark/commit/98393f8e346be0cd1cbe73b6861668e17446990d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17342
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74790/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17342
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17342
  
**[Test build #74790 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74790/testReport)**
 for PR 17342 at commit 
[`04556c9`](https://github.com/apache/spark/commit/04556c9f2f4feb53e3f644d795a38de4a4e919ca).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-18 Thread ioana-delaney

Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106791067
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/StarJoinSuite.scala 
---
@@ -0,0 +1,488 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import org.apache.spark.sql.{Row, _}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SQLTestUtils
+
+
+class StarJoinSuite extends QueryTest with SQLTestUtils with 
TestHiveSingleton {
--- End diff --

@cloud-fan I will look at ```JoinReorderSuite```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-18 Thread ioana-delaney

Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106790932
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark issue #16330: [SPARK-18817][SPARKR][SQL] change derby log output to te...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16330
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16330: [SPARK-18817][SPARKR][SQL] change derby log output to te...

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16330
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74788/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16330: [SPARK-18817][SPARKR][SQL] change derby log output to te...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16330
  
**[Test build #74788 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74788/testReport)**
 for PR 16330 at commit 
[`2eb75f8`](https://github.com/apache/spark/commit/2eb75f8d73ec3ce1eb85d7c501e4c072499b8f44).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16626: [SPARK-19261][SQL] Alter add columns for Hive serde and ...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16626
  
**[Test build #74791 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74791/testReport)**
 for PR 16626 at commit 
[`a28fc42`](https://github.com/apache/spark/commit/a28fc42cee6dc52516e074ced7c4351ee6baa45d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-18 Thread ioana-delaney

Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106790507
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SimpleCatalystConf.scala
 ---
@@ -40,6 +40,9 @@ case class SimpleCatalystConf(
 override val cboEnabled: Boolean = false,
 override val joinReorderEnabled: Boolean = false,
 override val joinReorderDPThreshold: Int = 12,
+override val starSchemaDetection: Boolean = false,
+override val starSchemaFTRatio: Double = 0.9,
+override val ndvMaxError: Double = 0.05,
--- End diff --

@cloud-fan Thank you. I will update the file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-18 Thread ioana-delaney

Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106790475
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-18 Thread ioana-delaney

Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106790425
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark pull request #16626: [SPARK-19261][SQL] Alter add columns for Hive ser...

2017-03-18 Thread xwu0226

Github user xwu0226 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16626#discussion_r106790141
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -1860,4 +1861,119 @@ class HiveDDLSuite
   }
 }
   }
+
+  hiveFormats.foreach { tableType =>
+test(s"alter hive serde table add columns -- partitioned - 
$tableType") {
+  withTable("tab") {
+sql(
+  s"""
+ |CREATE TABLE tab (c1 int, c2 int)
+ |PARTITIONED BY (c3 int) STORED AS $tableType
+  """.stripMargin)
+
+sql("INSERT INTO tab PARTITION (c3=1) VALUES (1, 2)")
+sql("ALTER TABLE tab ADD COLUMNS (c4 int)")
+checkAnswer(
+  sql("SELECT * FROM tab WHERE c3 = 1"),
+  Seq(Row(1, 2, null, 1))
+)
+assert(sql("SELECT * FROM tab").schema
--- End diff --

maybe `spark.table("tab")` is shorter? because to use getTableMetaData, i 
need to use `spark.sessionState.catalog.getTableMetadata`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17334: [SPARK-19998][Block Manager]BlockRDD block not found Exc...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17334
  
**[Test build #3603 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3603/testReport)**
 for PR 17334 at commit 
[`98393f8`](https://github.com/apache/spark/commit/98393f8e346be0cd1cbe73b6861668e17446990d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17311: [SPARK-19970][SQL] Table owner should be USER instead of...

2017-03-18 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/17311
  
Hi, @vanzin and @srowen .
Could you review this PR (again) when you have sometime?
I feel guilty because this PR need to be verified by manually on kerberized 
clusters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17219
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17251: [SPARK-19910][SQL] `stack` should not reject NULL values...

2017-03-18 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/17251
  
Hi, @cloud-fan .
Is it possible that Spark 2.1.1 includes this fix?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17219
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74786/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17219: [SPARK-19876][SS][WIP] OneTime Trigger Executor

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17219
  
**[Test build #74786 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74786/testReport)**
 for PR 17219 at commit 
[`fd28ed7`](https://github.com/apache/spark/commit/fd28ed7c83896c310296238666cf68f0455b837b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16330: [SPARK-18817][SPARKR][SQL] change derby log output to te...

2017-03-18 Thread shivaram

Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/16330
  
Had a minor comment on the test case. LGTM otherwise and waiting for Jenkins


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16330: [SPARK-18817][SPARKR][SQL] change derby log outpu...

2017-03-18 Thread shivaram

Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/16330#discussion_r106789593
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -2909,6 +2910,30 @@ test_that("Collect on DataFrame when NAs exists at 
the top of a timestamp column
   expect_equal(class(ldf3$col3), c("POSIXct", "POSIXt"))
 })
 
+compare_list <- function(list1, list2) {
+  # get testthat to show the diff by first making the 2 lists equal in 
length
+  expect_equal(length(list1), length(list2))
+  l <- max(length(list1), length(list2))
+  length(list1) <- l
+  length(list2) <- l
+  expect_equal(sort(list1, na.last = TRUE), sort(list2, na.last = TRUE))
+}
+
+# This should always be the last test in this test file.
+test_that("No extra files are created in SPARK_HOME by starting session 
and making calls", {
+  # Check that it is not creating any extra file.
+  # Does not check the tempdir which would be cleaned up after.
+  filesAfter <- list.files(path = sparkRDir, all.files = TRUE)
+
+  expect_true(length(sparkRFilesBefore) > 0)
+  # first, ensure derby.log is not there
+  expect_false("derby.log" %in% filesAfter)
+  # second, ensure only spark-warehouse is created when calling 
SparkSession, enableHiveSupport = F
--- End diff --

I'm a little confused how these two setdiff commands map to with or without 
hive support. Can we make this a bit more easier to understand ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-18 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r106788877
  
--- Diff: core/src/main/scala/org/apache/spark/TaskEndReason.scala ---
@@ -212,8 +212,8 @@ case object TaskResultLost extends TaskFailedReason {
  * Task was killed intentionally and needs to be rescheduled.
  */
 @DeveloperApi
-case object TaskKilled extends TaskFailedReason {
-  override def toErrorString: String = "TaskKilled (killed intentionally)"
+case class TaskKilled(reason: String) extends TaskFailedReason {
+  override def toErrorString: String = s"TaskKilled ($reason)"
--- End diff --

Since this was part of DeveloperApi, what is the impact of making this 
change ?
In JsonProtocol, in mocked user code ?

If it does introduce backward incompatible changes, is there a way to 
mitigate this ?
Perhaps make TaskKilled a class with apply/unapply/serde/toString 
(essentially all that case class provides) and case object with apply with 
default reason = null (and logged when used as deprecated) ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-18 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r106789380
  
--- Diff: core/src/test/scala/org/apache/spark/SparkContextSuite.scala ---
@@ -540,6 +540,39 @@ class SparkContextSuite extends SparkFunSuite with 
LocalSparkContext with Eventu
 }
   }
 
+  // Launches one task that will run forever. Once the SparkListener 
detects the task has
+  // started, kill and re-schedule it. The second run of the task will 
complete immediately.
+  // If this test times out, then the first version of the task wasn't 
killed successfully.
+  test("Killing tasks") {
+sc = new SparkContext(new 
SparkConf().setAppName("test").setMaster("local"))
+
+SparkContextSuite.isTaskStarted = false
+SparkContextSuite.taskKilled = false
+
+val listener = new SparkListener {
+  override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = {
+eventually(timeout(10.seconds)) {
+  assert(SparkContextSuite.isTaskStarted)
+}
+if (!SparkContextSuite.taskKilled) {
+  SparkContextSuite.taskKilled = true
+  sc.killTaskAttempt(taskStart.taskInfo.taskId, true, "first 
attempt will hang")
+}
+  }
--- End diff --

A `onTaskEnd` validating failure of the first task and launch/success of 
second attempt should be added here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-18 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r106788727
  
--- Diff: core/src/main/scala/org/apache/spark/ui/UIUtils.scala ---
@@ -354,7 +354,7 @@ private[spark] object UIUtils extends Logging {
 {completed}/{total}
 { if (failed > 0) s"($failed failed)" }
 { if (skipped > 0) s"($skipped skipped)" }
-{ if (killed > 0) s"($killed killed)" }
+{ reasonToNumKilled.map { case (reason, count) => s"($count 
killed: $reason)" } }
--- End diff --

Would be good to sort it in some order and expose it.
Either by desc count or alphabetically : so that updates do not reorder 
arbitrarily.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-18 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r106789297
  
--- Diff: core/src/test/scala/org/apache/spark/SparkContextSuite.scala ---
@@ -540,6 +540,39 @@ class SparkContextSuite extends SparkFunSuite with 
LocalSparkContext with Eventu
 }
   }
 
+  // Launches one task that will run forever. Once the SparkListener 
detects the task has
+  // started, kill and re-schedule it. The second run of the task will 
complete immediately.
+  // If this test times out, then the first version of the task wasn't 
killed successfully.
+  test("Killing tasks") {
+sc = new SparkContext(new 
SparkConf().setAppName("test").setMaster("local"))
+
+SparkContextSuite.isTaskStarted = false
+SparkContextSuite.taskKilled = false
+
+val listener = new SparkListener {
+  override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = {
+eventually(timeout(10.seconds)) {
+  assert(SparkContextSuite.isTaskStarted)
+}
+if (!SparkContextSuite.taskKilled) {
+  SparkContextSuite.taskKilled = true
+  sc.killTaskAttempt(taskStart.taskInfo.taskId, true, "first 
attempt will hang")
+}
+  }
+}
+sc.addSparkListener(listener)
+eventually(timeout(20.seconds)) {
+  sc.parallelize(1 to 1).foreach { x =>
+// first attempt will hang
+if (!SparkContextSuite.isTaskStarted) {
+  SparkContextSuite.isTaskStarted = true
+  Thread.sleep(999)
+}
+// second attempt succeeds immediately
+  }
--- End diff --

Nothing actually runs in this test currently - we need to force the RDD's 
computation (a count() or runJob here)

Also, we need to validate that the second task actually ran (which is why 
the bug in the test was not detected).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17166: [SPARK-19820] [core] Allow reason to be specified...

2017-03-18 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/17166#discussion_r106789004
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -467,7 +474,7 @@ private[spark] class TaskSchedulerImpl 
private[scheduler](
   taskState: TaskState,
   reason: TaskFailedReason): Unit = synchronized {
 taskSetManager.handleFailedTask(tid, taskState, reason)
-if (!taskSetManager.isZombie && taskState != TaskState.KILLED) {
+if (!taskSetManager.isZombie) {
--- End diff --

@kayousterhout Are we making the change that killed tasks can/should be 
retried ?
If yes, this is a behavior change; and TSM.handleFailedTask(), we need to 
do the same.

This is what I mentioned w.r.t killed not resulting in task resubmission.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-18 Thread ioana-delaney

Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106789403
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-18 Thread ioana-delaney

Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106789422
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
--- End diff --

@cloud-fan Please see my comment above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-18 Thread ioana-delaney

Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106789293
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark issue #16596: [SPARK-19237][SPARKR][CORE][WIP] spark-submit should han...

2017-03-18 Thread shivaram

Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/16596
  
Ok sounds good. Since this touches spark-submit scripts that are shared 
across all languages it would be good to get loop in some other reviewers as 
well. I can do that once we have the new diff


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-18 Thread ioana-delaney

Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106789103
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -83,9 +411,19 @@ object ReorderJoin extends Rule[LogicalPlan] with 
PredicateHelper {
   }
 
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
-case j @ ExtractFiltersAndInnerJoins(input, conditions)
+case ExtractFiltersAndInnerJoins(input, conditions)
 if input.size > 2 && conditions.nonEmpty =>
-  createOrderedJoin(input, conditions)
+  if (conf.starSchemaDetection && !conf.cboEnabled) {
--- End diff --

@cloud-fan It doesn't conflict with CBO, but just to be safe.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-18 Thread ioana-delaney

Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106789008
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark issue #17342: [SPARK-18910][SPARK-12868] Allow adding jars from hdfs

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17342
  
**[Test build #74790 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74790/testReport)**
 for PR 17342 at commit 
[`04556c9`](https://github.com/apache/spark/commit/04556c9f2f4feb53e3f644d795a38de4a4e919ca).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17342: [SPARK-18910][SPARK-12868] Allow adding jars from...

2017-03-18 Thread weiqingy

GitHub user weiqingy opened a pull request:

https://github.com/apache/spark/pull/17342

[SPARK-18910][SPARK-12868] Allow adding jars from hdfs

## What changes were proposed in this pull request?
Spark 2.2 is going to be cut, it'll be great if SPARK-12868 can be resolved 
before that. There have been several PRs for this like 
[PR#16324](https://github.com/apache/spark/pull/16324) , but all of them are 
inactivity for a long time or have been closed. 

This PR added a SparkUrlStreamHandlerFactory, which relies on 'protocol' to 
choose the appropriate
UrlStreamHandlerFactory like FsUrlStreamHandlerFactory to create 
URLStreamHandler. 

## How was this patch tested?
1. Add a new unit test.
2. Check manually.
Before: throw an exception with " failed unknown protocol: hdfs"
https://cloud.githubusercontent.com/assets/8546874/24075277/5abe0a7c-0bd5-11e7-900e-ec3d3105da0b.png";>

After:
https://cloud.githubusercontent.com/assets/8546874/24075283/69382a60-0bd5-11e7-8d30-d9405c3aaaba.png";>

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/weiqingy/spark SPARK-18910

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17342.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17342


commit 04556c9f2f4feb53e3f644d795a38de4a4e919ca
Author: Weiqing Yang 
Date:   2017-03-18T18:55:28Z

[SPARK-18910] Allow adding jars from hdfs




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-18 Thread ioana-delaney

Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106788869
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-18 Thread ioana-delaney

Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106788720
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
+case PhysicalOperation(_, _, t: LeafNode) if 
t.stats(conf).rowCount.isDefined => true
+case _ => false
+  }
+
+  if (!foundEligibleJoin) {
+// Some plans don't have stats or are complex plans. 
Conservatively,
+// return an empty star join. This restriction can be lifted
+// once statistics are propagated in the plan.
+emptyStarJoinPlan
+  } else {
+// Find the fact table using cardinality based heuristics i.e.
+// the table with the largest number of rows.
+val sortedFactTables = input.map { plan =>
+

[GitHub] spark pull request #15363: [SPARK-17791][SQL] Join reordering using star sch...

2017-03-18 Thread ioana-delaney

Github user ioana-delaney commented on a diff in the pull request:

https://github.com/apache/spark/pull/15363#discussion_r106788630
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala 
---
@@ -20,19 +20,347 @@ package org.apache.spark.sql.catalyst.optimizer
 import scala.annotation.tailrec
 
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.planning.ExtractFiltersAndInnerJoins
+import 
org.apache.spark.sql.catalyst.planning.{ExtractFiltersAndInnerJoins, 
PhysicalOperation}
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical._
 import org.apache.spark.sql.catalyst.rules._
+import org.apache.spark.sql.catalyst.CatalystConf
+
+/**
+ * Encapsulates star-schema join detection.
+ */
+case class StarSchemaDetection(conf: CatalystConf) extends PredicateHelper 
{
+
+  /**
+   * Star schema consists of one or more fact tables referencing a number 
of dimension
+   * tables. In general, star-schema joins are detected using the 
following conditions:
+   *  1. Informational RI constraints (reliable detection)
+   *+ Dimension contains a primary key that is being joined to the 
fact table.
+   *+ Fact table contains foreign keys referencing multiple dimension 
tables.
+   *  2. Cardinality based heuristics
+   *+ Usually, the table with the highest cardinality is the fact 
table.
+   *+ Table being joined with the most number of tables is the fact 
table.
+   *
+   * To detect star joins, the algorithm uses a combination of the above 
two conditions.
+   * The fact table is chosen based on the cardinality heuristics, and the 
dimension
+   * tables are chosen based on the RI constraints. A star join will 
consist of the largest
+   * fact table joined with the dimension tables on their primary keys. To 
detect that a
+   * column is a primary key, the algorithm uses table and column 
statistics.
+   *
+   * Since Catalyst only supports left-deep tree plans, the algorithm 
currently returns only
+   * the star join with the largest fact table. Choosing the largest fact 
table on the
+   * driving arm to avoid large inners is in general a good heuristic. 
This restriction can
+   * be lifted with support for bushy tree plans.
+   *
+   * The highlights of the algorithm are the following:
+   *
+   * Given a set of joined tables/plans, the algorithm first verifies if 
they are eligible
+   * for star join detection. An eligible plan is a base table access with 
valid statistics.
+   * A base table access represents Project or Filter operators above a 
LeafNode. Conservatively,
+   * the algorithm only considers base table access as part of a star join 
since they provide
+   * reliable statistics.
+   *
+   * If some of the plans are not base table access, or statistics are not 
available, the algorithm
+   * returns an empty star join plan since, in the absence of statistics, 
it cannot make
+   * good planning decisions. Otherwise, the algorithm finds the table 
with the largest cardinality
+   * (number of rows), which is assumed to be a fact table.
+   *
+   * Next, it computes the set of dimension tables for the current fact 
table. A dimension table
+   * is assumed to be in a RI relationship with a fact table. To infer 
column uniqueness,
+   * the algorithm compares the number of distinct values with the total 
number of rows in the
+   * table. If their relative difference is within certain limits (i.e. 
ndvMaxError * 2, adjusted
+   * based on 1TB TPC-DS data), the column is assumed to be unique.
+   */
+  def findStarJoins(
+  input: Seq[LogicalPlan],
+  conditions: Seq[Expression]): Seq[Seq[LogicalPlan]] = {
+
+val emptyStarJoinPlan = Seq.empty[Seq[LogicalPlan]]
+
+if (!conf.starSchemaDetection || input.size < 2) {
+  emptyStarJoinPlan
+} else {
+  // Find if the input plans are eligible for star join detection.
+  // An eligible plan is a base table access with valid statistics.
+  val foundEligibleJoin = input.forall {
--- End diff --

@cloud-fan Star-join detection works with and without CBO. Project/Filter 
cardinality is computed only if cbo is enabled. Here, I am doing a simple 
validation to ensure statistics were run on the base relations. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubsc

[GitHub] spark issue #17334: [SPARK-19998][Block Manager]BlockRDD block not found Exc...

2017-03-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17334
  
**[Test build #3602 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3602/testReport)**
 for PR 17334 at commit 
[`98393f8`](https://github.com/apache/spark/commit/98393f8e346be0cd1cbe73b6861668e17446990d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17290: [SPARK-16599][CORE] java.util.NoSuchElementException: No...

2017-03-18 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/17290
  
That's fine I can add a warning. I don't know if it is a bug situation but 
it sure could be. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17290: [SPARK-16599][CORE] java.util.NoSuchElementExcept...

2017-03-18 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/17290#discussion_r106788142
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala ---
@@ -340,7 +340,7 @@ private[storage] class BlockInfoManager extends Logging 
{
 val blocksWithReleasedLocks = mutable.ArrayBuffer[BlockId]()
 
 val readLocks = synchronized {
-  readLocksByTask.remove(taskAttemptId).get
+  
readLocksByTask.remove(taskAttemptId).getOrElse(ImmutableMultiset.of[BlockId]())
--- End diff --

Missed this PR earlier .. sorry about that.
Since this is an unexpected scenario (it really must not happen according 
to current code !), we should probably have logged the stack trace and warning 
message in the logs to help debug.
It silently ignores a fairly bad bug otherwise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...

2017-03-18 Thread nsyca

Github user nsyca commented on the issue:

https://github.com/apache/spark/pull/17286
  
Right. I misread it. if there is no join predicate between a table and any 
cluster of tables, we should not consider that table in the join enumeration at 
all. We can simply push that table to be the last join.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16982: [SPARK-19654][SPARKR][SS] Structured Streaming API for R

2017-03-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16982
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 >

101 - 200 of 362 matches

Mail list logo