[GitHub] spark pull request: [SPARK-14858][SQL] Enable subquery pushdown

2016-04-27 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/12720#discussion_r61376668
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
 ---
@@ -337,6 +337,16 @@ case class PrettyAttribute(
   override def nullable: Boolean = true
 }
 
+/**
+ * A place holder used to hold a reference that has been resolved to a 
field outside of the current
+ * plan. This is used for correlated subqueries.
+ */
+case class OuterReference(e: NamedExpression) extends LeafExpression with 
Unevaluable {
+  override def dataType: DataType = e.dataType
+  override def nullable: Boolean = e.nullable
+  override def prettyName: String = "outer"
--- End diff --

Should include `e` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14858][SQL] Enable subquery pushdown

2016-04-27 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/12720#discussion_r61376609
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
 ---
@@ -337,6 +337,16 @@ case class PrettyAttribute(
   override def nullable: Boolean = true
 }
 
+/**
+ * A place holder used to hold a reference that has been resolved to a 
field outside of the current
+ * plan. This is used for correlated subqueries.
+ */
+case class OuterReference(e: NamedExpression) extends LeafExpression with 
Unevaluable {
--- End diff --

This is a good idea, it make us easier to find all the outer references.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...

2016-04-27 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/12655#issuecomment-215317033
  
I saw the PR #8427 now.
Both the #8427 approach and @markhamstra's approach (should we use 
`getOrElseUpdate` instead of `getOrElse`?) seem like the simplest way to fix 
the duplicate-stage issue except a risk of `StackOverflowError`.

I agree that we should fix the duplicate-stage issue first in the simplest 
way possible, by #8427 or @markhamstra's, and then discuss a risk of `SOE` 
after one of them is merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12750#issuecomment-215316922
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57215/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12750#issuecomment-215316921
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12750#issuecomment-215316797
  
**[Test build #57215 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57215/consoleFull)**
 for PR 12750 at commit 
[`5d34a64`](https://github.com/apache/spark/commit/5d34a646046727ffaf8f5932b2dfeae3a5c10d32).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14938][ML] replace some RDD.map with Da...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12718#issuecomment-215316563
  
**[Test build #57222 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57222/consoleFull)**
 for PR 12718 at commit 
[`e57332a`](https://github.com/apache/spark/commit/e57332a0b72f6c0bf58ee72c77eddf4f5904fe9b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14858][SQL] Enable subquery pushdown

2016-04-27 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/12720#discussion_r61376302
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -866,71 +867,189 @@ class Analyzer(
* Note: CTEs are handled in CTESubstitution.
*/
   object ResolveSubquery extends Rule[LogicalPlan] with PredicateHelper {
-
 /**
- * Resolve the correlated predicates in the clauses (e.g. WHERE or 
HAVING) of a
- * sub-query by using the plan the predicates should be correlated to.
+ * Resolve a subquery using the outer plan. This rule creates a 
dedicated analyzer which can
+ * also resolve outer plan references.
  */
-private def resolveCorrelatedSubquery(
-sub: LogicalPlan, outer: LogicalPlan,
-aliases: scala.collection.mutable.Map[Attribute, Alias]): 
LogicalPlan = {
-  // First resolve as much of the sub-query as possible
-  val analyzed = execute(sub)
-  if (analyzed.resolved) {
-analyzed
-  } else {
-// Only resolve the lowest plan that is not resolved by outer 
plan, otherwise it could be
-// resolved by itself
-val resolvedByOuter = analyzed transformDown {
-  case q: LogicalPlan if q.childrenResolved && !q.resolved =>
-q transformExpressions {
-  case u @ UnresolvedAttribute(nameParts) =>
-withPosition(u) {
-  try {
-val outerAttrOpt = outer.resolve(nameParts, resolver)
-if (outerAttrOpt.isDefined) {
-  val outerAttr = outerAttrOpt.get
-  if (q.inputSet.contains(outerAttr)) {
-// Got a conflict, create an alias for the 
attribute come from outer table
-val alias = Alias(outerAttr, outerAttr.toString)()
-val attr = alias.toAttribute
-aliases += attr -> alias
-attr
-  } else {
-outerAttr
-  }
-} else {
-  u
-}
-  } catch {
-case a: AnalysisException => u
-  }
-}
+def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+  // Only a few unary nodes (Project/Filter/Aggregate/Having) can 
contain subqueries.
+  case q: UnaryNode if q.childrenResolved =>
+q transformExpressions {
+  case e: SubqueryExpression if !e.query.resolved =>
+val analyzer = new Analyzer(catalog, conf) {
+  override val extendedCheckRules = self.extendedCheckRules
+  override val extendedResolutionRules = 
self.extendedResolutionRules :+
+ResolveOuterReferences(q.child, resolver)
 }
+e.withNewPlan(analyzer.execute(e.query))
--- End diff --

For example, `Filter('a > 'b.a, Aggregate(['b], ['sum('c)]))`,  b and c 
could be resolved first (in ResolveReferences), then sum could be resolved as 
aggregate function (in ResolveAggregateFunctions), the aggregate became 
resolved, so `a will be resolved as outer reference by `OuterReference`, but it 
should not be.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10001][Core] Don't short-circuit action...

2016-04-27 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/12745


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10001][Core] Don't short-circuit action...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12745#issuecomment-215315836
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57213/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10001][Core] Don't short-circuit action...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12745#issuecomment-215315834
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10001][Core] Don't short-circuit action...

2016-04-27 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/12745#issuecomment-215315768
  
Merging in master. Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10001][Core] Don't short-circuit action...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12745#issuecomment-215315711
  
**[Test build #57213 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57213/consoleFull)**
 for PR 12745 at commit 
[`06f83cc`](https://github.com/apache/spark/commit/06f83cc1de78aaf56942404fb24aca81d1b66d2e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14858][SQL] Enable subquery pushdown

2016-04-27 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/12720#discussion_r61375897
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -866,71 +867,189 @@ class Analyzer(
* Note: CTEs are handled in CTESubstitution.
*/
   object ResolveSubquery extends Rule[LogicalPlan] with PredicateHelper {
-
 /**
- * Resolve the correlated predicates in the clauses (e.g. WHERE or 
HAVING) of a
- * sub-query by using the plan the predicates should be correlated to.
+ * Resolve a subquery using the outer plan. This rule creates a 
dedicated analyzer which can
+ * also resolve outer plan references.
  */
-private def resolveCorrelatedSubquery(
-sub: LogicalPlan, outer: LogicalPlan,
-aliases: scala.collection.mutable.Map[Attribute, Alias]): 
LogicalPlan = {
-  // First resolve as much of the sub-query as possible
-  val analyzed = execute(sub)
-  if (analyzed.resolved) {
-analyzed
-  } else {
-// Only resolve the lowest plan that is not resolved by outer 
plan, otherwise it could be
-// resolved by itself
-val resolvedByOuter = analyzed transformDown {
-  case q: LogicalPlan if q.childrenResolved && !q.resolved =>
-q transformExpressions {
-  case u @ UnresolvedAttribute(nameParts) =>
-withPosition(u) {
-  try {
-val outerAttrOpt = outer.resolve(nameParts, resolver)
-if (outerAttrOpt.isDefined) {
-  val outerAttr = outerAttrOpt.get
-  if (q.inputSet.contains(outerAttr)) {
-// Got a conflict, create an alias for the 
attribute come from outer table
-val alias = Alias(outerAttr, outerAttr.toString)()
-val attr = alias.toAttribute
-aliases += attr -> alias
-attr
-  } else {
-outerAttr
-  }
-} else {
-  u
-}
-  } catch {
-case a: AnalysisException => u
-  }
-}
+def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+  // Only a few unary nodes (Project/Filter/Aggregate/Having) can 
contain subqueries.
+  case q: UnaryNode if q.childrenResolved =>
+q transformExpressions {
+  case e: SubqueryExpression if !e.query.resolved =>
+val analyzer = new Analyzer(catalog, conf) {
+  override val extendedCheckRules = self.extendedCheckRules
+  override val extendedResolutionRules = 
self.extendedResolutionRules :+
+ResolveOuterReferences(q.child, resolver)
 }
+e.withNewPlan(analyzer.execute(e.query))
--- End diff --

(updated comment)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12683#issuecomment-215315114
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57218/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12683#issuecomment-215315097
  
**[Test build #57218 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57218/consoleFull)**
 for PR 12683 at commit 
[`6650890`](https://github.com/apache/spark/commit/665089051aef4dd4ac189eed329ee55d3e8df9e3).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12683#issuecomment-215315113
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14858][SQL] Enable subquery pushdown

2016-04-27 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/12720#discussion_r61375586
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -866,71 +867,189 @@ class Analyzer(
* Note: CTEs are handled in CTESubstitution.
*/
   object ResolveSubquery extends Rule[LogicalPlan] with PredicateHelper {
-
 /**
- * Resolve the correlated predicates in the clauses (e.g. WHERE or 
HAVING) of a
- * sub-query by using the plan the predicates should be correlated to.
+ * Resolve a subquery using the outer plan. This rule creates a 
dedicated analyzer which can
+ * also resolve outer plan references.
  */
-private def resolveCorrelatedSubquery(
-sub: LogicalPlan, outer: LogicalPlan,
-aliases: scala.collection.mutable.Map[Attribute, Alias]): 
LogicalPlan = {
-  // First resolve as much of the sub-query as possible
-  val analyzed = execute(sub)
-  if (analyzed.resolved) {
-analyzed
-  } else {
-// Only resolve the lowest plan that is not resolved by outer 
plan, otherwise it could be
-// resolved by itself
-val resolvedByOuter = analyzed transformDown {
-  case q: LogicalPlan if q.childrenResolved && !q.resolved =>
-q transformExpressions {
-  case u @ UnresolvedAttribute(nameParts) =>
-withPosition(u) {
-  try {
-val outerAttrOpt = outer.resolve(nameParts, resolver)
-if (outerAttrOpt.isDefined) {
-  val outerAttr = outerAttrOpt.get
-  if (q.inputSet.contains(outerAttr)) {
-// Got a conflict, create an alias for the 
attribute come from outer table
-val alias = Alias(outerAttr, outerAttr.toString)()
-val attr = alias.toAttribute
-aliases += attr -> alias
-attr
-  } else {
-outerAttr
-  }
-} else {
-  u
-}
-  } catch {
-case a: AnalysisException => u
-  }
-}
+def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+  // Only a few unary nodes (Project/Filter/Aggregate/Having) can 
contain subqueries.
+  case q: UnaryNode if q.childrenResolved =>
+q transformExpressions {
+  case e: SubqueryExpression if !e.query.resolved =>
+val analyzer = new Analyzer(catalog, conf) {
+  override val extendedCheckRules = self.extendedCheckRules
+  override val extendedResolutionRules = 
self.extendedResolutionRules :+
+ResolveOuterReferences(q.child, resolver)
 }
+e.withNewPlan(analyzer.execute(e.query))
--- End diff --

The new rule will uses a special expression for all outer references calles 
`OuterReference`. The outer reference basically hides the outer reference and 
makes sure no-collisions happen. We resolve collisions when we pull out the 
predicates (by adding a single project if we need one).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12660] [SPARK-14967] [SQL] Implement Ex...

2016-04-27 Thread cloud-fan
Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/12736#issuecomment-215314522
  
LGTM. A unrelated question, how do we express the EXCEPT ALL semantic?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10001][Core] Don't short-circuit action...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12745#issuecomment-215314462
  
**[Test build #2898 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2898/consoleFull)**
 for PR 12745 at commit 
[`7fe0e54`](https://github.com/apache/spark/commit/7fe0e54c5e39ff942fd58a16d5e5e309887ee883).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14858][SQL] Enable subquery pushdown

2016-04-27 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/12720#discussion_r61375441
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala
 ---
@@ -75,76 +77,63 @@ case class ScalarSubquery(
   override def foldable: Boolean = false
   override def nullable: Boolean = true
 
-  override def withNewPlan(plan: LogicalPlan): ScalarSubquery = 
ScalarSubquery(plan, exprId)
+  override def conditions: Seq[Expression] = conditionOption.toSeq.flatten
 
-  override def toString: String = s"subquery#${exprId.id}"
+  override def withNewPlan(plan: LogicalPlan): ScalarSubquery = copy(query 
= plan)
+
+  override def toString: String = s"subquery#${exprId.id} $conditionString"
 }
 
 /**
  * A predicate subquery checks the existence of a value in a sub-query. We 
currently only allow
  * [[PredicateSubquery]] expressions within a Filter plan (i.e. WHERE or a 
HAVING clause). This will
  * be rewritten into a left semi/anti join during analysis.
  */
-abstract class PredicateSubquery extends SubqueryExpression with 
Unevaluable with Predicate {
+case class PredicateSubquery(
+query: LogicalPlan,
+override val children: Seq[Expression] = Seq.empty,
+nullAware: Boolean = false,
+exprId: ExprId = NamedExpression.newExprId)
+  extends SubqueryExpression with Predicate with Unevaluable {
+  override lazy val resolved = childrenResolved && query.resolved
+  override lazy val references: AttributeSet = super.references -- 
query.outputSet
   override def nullable: Boolean = false
+  override def conditions: Seq[Expression] = children
+  override def plan: LogicalPlan = SubqueryAlias(toString, query)
+  override def withNewPlan(plan: LogicalPlan): PredicateSubquery = 
copy(query = plan)
+  override def toString: String = s"predicate-subquery#${exprId.id} 
$conditionString"
 }
 
 object PredicateSubquery {
   def hasPredicateSubquery(e: Expression): Boolean = {
-e.find(_.isInstanceOf[PredicateSubquery]).isDefined
+e.find {
+  case _: PredicateSubquery | _: ListQuery | _: Exists => true
+  case _ => false
+}.isDefined
   }
 }
 
 /**
- * The [[InSubQuery]] predicate checks the existence of a value in a 
sub-query. For example (SQL):
+ * A [[ListQuery]] expression defines the query which we want to search in 
an IN subquery
+ * expression. It should and can only be used in conjunction with a IN 
expression.
+ *
+ * For example (SQL):
  * {{{
  *   SELECT  *
  *   FROMa
  *   WHERE   a.id IN (SELECT  id
  *FROMb)
  * }}}
  */
-case class InSubQuery(
-value: Expression,
-query: LogicalPlan,
-exprId: ExprId = NamedExpression.newExprId) extends PredicateSubquery {
-  override def children: Seq[Expression] = value :: Nil
-  override lazy val resolved: Boolean = value.resolved && query.resolved
-  override def withNewPlan(plan: LogicalPlan): InSubQuery = 
InSubQuery(value, plan, exprId)
-  override def plan: LogicalPlan = SubqueryAlias(s"subquery#${exprId.id}", 
query)
-
-  /**
-   * The unwrapped value side expressions.
-   */
-  lazy val expressions: Seq[Expression] = value match {
-case CreateStruct(cols) => cols
-case col => Seq(col)
-  }
-
-  /**
-   * Check if the number of columns and the data types on both sides match.
-   */
-  override def checkInputDataTypes(): TypeCheckResult = {
-// Check the number of arguments.
-if (expressions.length != query.output.length) {
-  return TypeCheckResult.TypeCheckFailure(
-s"The number of fields in the value (${expressions.length}) does 
not match with " +
-  s"the number of columns in the subquery 
(${query.output.length})")
-}
-
-// Check the argument types.
-expressions.zip(query.output).zipWithIndex.foreach {
-  case ((e, a), i) if e.dataType != a.dataType =>
-return TypeCheckResult.TypeCheckFailure(
-  s"The data type of value[$i] (${e.dataType}) does not match " +
-s"subquery column '${a.name}' (${a.dataType}).")
-  case _ =>
-}
-
-TypeCheckResult.TypeCheckSuccess
-  }
-
-  override def toString: String = s"$value IN subquery#${exprId.id}"
+case class ListQuery(query: LogicalPlan, exprId: ExprId = 
NamedExpression.newExprId)
+  extends SubqueryExpression with Unevaluable {
+  override lazy val resolved = false
+  override def dataType: DataType = ArrayType(NullType)
--- End diff --

So `ListQuery` is converted into `PredicateSubquery` in the analyzer. 
`PredicateSubquery` exposes its 'join' expressions as it 

[GitHub] spark pull request: [SPARK-14783] [SPARK-14786] [BRANCH-1.6] Prese...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12724#issuecomment-215314235
  
**[Test build #57221 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57221/consoleFull)**
 for PR 12724 at commit 
[`49b7b52`](https://github.com/apache/spark/commit/49b7b528a8928335e16bf6081f7dad1e819c77e2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14850][ML] specialize array data for Ve...

2016-04-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/12640#discussion_r61375365
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala 
---
@@ -29,6 +29,82 @@ abstract class ArrayData extends SpecializedGetters with 
Serializable {
 
   def array: Array[Any]
 
+  override def equals(o: Any): Boolean = {
+if (!o.isInstanceOf[ArrayData]) {
+  return false
+}
+
+val other = o.asInstanceOf[ArrayData]
+if (other eq null) {
+  return false
+}
+
+val len = numElements()
+if (len != other.numElements()) {
+  return false
+}
+
+var i = 0
+while (i < len) {
+  if (isNullAt(i) != other.isNullAt(i)) {
+return false
+  }
+  if (!isNullAt(i)) {
+val o1 = array(i)
+val o2 = other.array(i)
+o1 match {
+  case b1: Array[Byte] =>
+if (!o2.isInstanceOf[Array[Byte]] ||
+  !java.util.Arrays.equals(b1, o2.asInstanceOf[Array[Byte]])) {
+  return false
+}
+  case f1: Float if java.lang.Float.isNaN(f1) =>
+if (!o2.isInstanceOf[Float] || ! 
java.lang.Float.isNaN(o2.asInstanceOf[Float])) {
+  return false
+}
+  case d1: Double if java.lang.Double.isNaN(d1) =>
+if (!o2.isInstanceOf[Double] || ! 
java.lang.Double.isNaN(o2.asInstanceOf[Double])) {
+  return false
+}
+  case _ => if (o1 != o2) {
+return false
+  }
+}
+  }
+  i += 1
+}
+true
+  }
+
+  override def hashCode: Int = {
+var result: Int = 37
+var i = 0
+val len = numElements()
+while (i < len) {
--- End diff --

This could be very expensive for large arrays because it scans all 
elements, which is unnecessary to generate the hashCode.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14858][SQL] Enable subquery pushdown

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12720#issuecomment-215314230
  
**[Test build #2899 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2899/consoleFull)**
 for PR 12720 at commit 
[`62c5c2f`](https://github.com/apache/spark/commit/62c5c2f6c628593d860a42baa35c7f6d3cdd9305).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14850][ML] specialize array data for Ve...

2016-04-27 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/12640#issuecomment-215314086
  
@cloud-fan This is still much slower than 1.4 and adding more subclasses of 
ArrayData may prevent JIT inline methods like `getInt` and `getDouble`. Is it 
easy to convert to `UnsafeArrayData` directly with memory copy?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14783] [SPARK-14786] [BRANCH-1.6] Prese...

2016-04-27 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/12724#issuecomment-215314116
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12660] [SPARK-14967] [SQL] Implement Ex...

2016-04-27 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/12736#discussion_r61375198
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -398,6 +398,66 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
   Row(4, "d") :: Nil)
 checkAnswer(lowerCaseData.except(lowerCaseData), Nil)
 checkAnswer(upperCaseData.except(upperCaseData), Nil)
+
+// check null equality
+checkAnswer(
+  nullInts.except(nullInts.filter("0 = 1")),
+  nullInts)
+checkAnswer(
+  nullInts.except(nullInts),
+  Nil)
+
+// check if values are de-duplicated
+checkAnswer(
+  allNulls.except(allNulls.filter("0 = 1")),
+  Row(null) :: Nil)
+checkAnswer(
+  allNulls.except(allNulls),
+  Nil)
+
+// check if values are de-duplicated
+val df = Seq(("id1", 1), ("id1", 1), ("id", 1), ("id1", 2)).toDF("id", 
"value")
+checkAnswer(
+  df.except(df.filter("0 = 1")),
+  Row("id1", 1) ::
+  Row("id", 1) ::
+  Row("id1", 2) :: Nil)
+
+// check if the empty set on the left side works
+checkAnswer(
+  allNulls.filter("0 = 1").except(allNulls),
+  Nil)
+  }
+
+  test("except distinct - SQL compliance") {
+val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id")
+val df_right = Seq(1, 3).toDF("id")
+
+checkAnswer(
+  df_left.except(df_right),
+  Row(2) :: Row(4) :: Nil
+)
+  }
+
+  test("except - nullability") {
+val nonNullableInts = Seq(Tuple1(11), Tuple1(3)).toDF()
+assert(nonNullableInts.schema.forall(_.nullable == false))
--- End diff --

nit: `forall(!_.nullable)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12660] [SPARK-14967] [SQL] Implement Ex...

2016-04-27 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/12736#discussion_r61375067
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercionSuite.scala
 ---
@@ -488,14 +488,6 @@ class HiveTypeCoercionSuite extends PlanTest {
 assert(r1.right.isInstanceOf[Project])
 assert(r2.left.isInstanceOf[Project])
 assert(r2.right.isInstanceOf[Project])
-
-val r3 = wt(Except(firstTable, firstTable)).asInstanceOf[Except]
-checkOutput(r3.left, Seq(IntegerType, DecimalType.SYSTEM_DEFAULT, 
ByteType, DoubleType))
-checkOutput(r3.right, Seq(IntegerType, DecimalType.SYSTEM_DEFAULT, 
ByteType, DoubleType))
-
-// Check if no Project is added
-assert(r3.left.isInstanceOf[LocalRelation])
-assert(r3.right.isInstanceOf[LocalRelation])
--- End diff --

why remove these? We didn't change the analysis of `Except` right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14654][CORE] New accumulator API

2016-04-27 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/12612#discussion_r61374828
  
--- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala 
---
@@ -175,124 +172,143 @@ class TaskMetrics private[spark] () extends 
Serializable {
   }
 
   // Only used for test
-  private[spark] val testAccum =
-sys.props.get("spark.testing").map(_ => 
TaskMetrics.createLongAccum(TEST_ACCUM))
-
-  @transient private[spark] lazy val internalAccums: Seq[Accumulable[_, 
_]] = {
-val in = inputMetrics
-val out = outputMetrics
-val sr = shuffleReadMetrics
-val sw = shuffleWriteMetrics
-Seq(_executorDeserializeTime, _executorRunTime, _resultSize, 
_jvmGCTime,
-  _resultSerializationTime, _memoryBytesSpilled, _diskBytesSpilled, 
_peakExecutionMemory,
-  _updatedBlockStatuses, sr._remoteBlocksFetched, 
sr._localBlocksFetched, sr._remoteBytesRead,
-  sr._localBytesRead, sr._fetchWaitTime, sr._recordsRead, 
sw._bytesWritten, sw._recordsWritten,
-  sw._writeTime, in._bytesRead, in._recordsRead, out._bytesWritten, 
out._recordsWritten) ++
-  testAccum
-  }
+  private[spark] val testAccum = sys.props.get("spark.testing").map(_ => 
new LongAccumulator)
+
+
+  import InternalAccumulator._
+  @transient private[spark] lazy val nameToAccums = LinkedHashMap(
+EXECUTOR_DESERIALIZE_TIME -> _executorDeserializeTime,
+EXECUTOR_RUN_TIME -> _executorRunTime,
+RESULT_SIZE -> _resultSize,
+JVM_GC_TIME -> _jvmGCTime,
+RESULT_SERIALIZATION_TIME -> _resultSerializationTime,
+MEMORY_BYTES_SPILLED -> _memoryBytesSpilled,
+DISK_BYTES_SPILLED -> _diskBytesSpilled,
+PEAK_EXECUTION_MEMORY -> _peakExecutionMemory,
+UPDATED_BLOCK_STATUSES -> _updatedBlockStatuses,
+shuffleRead.REMOTE_BLOCKS_FETCHED -> 
shuffleReadMetrics._remoteBlocksFetched,
+shuffleRead.LOCAL_BLOCKS_FETCHED -> 
shuffleReadMetrics._localBlocksFetched,
+shuffleRead.REMOTE_BYTES_READ -> shuffleReadMetrics._remoteBytesRead,
+shuffleRead.LOCAL_BYTES_READ -> shuffleReadMetrics._localBytesRead,
+shuffleRead.FETCH_WAIT_TIME -> shuffleReadMetrics._fetchWaitTime,
+shuffleRead.RECORDS_READ -> shuffleReadMetrics._recordsRead,
+shuffleWrite.BYTES_WRITTEN -> shuffleWriteMetrics._bytesWritten,
+shuffleWrite.RECORDS_WRITTEN -> shuffleWriteMetrics._recordsWritten,
+shuffleWrite.WRITE_TIME -> shuffleWriteMetrics._writeTime,
+input.BYTES_READ -> inputMetrics._bytesRead,
+input.RECORDS_READ -> inputMetrics._recordsRead,
+output.BYTES_WRITTEN -> outputMetrics._bytesWritten,
+output.RECORDS_WRITTEN -> outputMetrics._recordsWritten
+  ) ++ testAccum.map(TEST_ACCUM -> _)
+
+  @transient private[spark] lazy val internalAccums: Seq[NewAccumulator[_, 
_]] =
+nameToAccums.values.toIndexedSeq
 
   /* == *
|OTHER THINGS|
* == */
 
-  private[spark] def registerForCleanup(sc: SparkContext): Unit = {
-internalAccums.foreach { accum =>
-  sc.cleaner.foreach(_.registerAccumulatorForCleanup(accum))
+  private[spark] def register(sc: SparkContext): Unit = {
+nameToAccums.foreach {
+  case (name, acc) => acc.register(sc, name = Some(name), 
countFailedValues = true)
 }
   }
 
   /**
* External accumulators registered with this task.
*/
-  @transient private lazy val externalAccums = new 
ArrayBuffer[Accumulable[_, _]]
+  @transient private lazy val externalAccums = new 
ArrayBuffer[NewAccumulator[_, _]]
 
-  private[spark] def registerAccumulator(a: Accumulable[_, _]): Unit = {
+  private[spark] def registerAccumulator(a: NewAccumulator[_, _]): Unit = {
 externalAccums += a
   }
 
-  /**
-   * Return the latest updates of accumulators in this task.
-   *
-   * The [[AccumulableInfo.update]] field is always defined and the 
[[AccumulableInfo.value]]
-   * field is always empty, since this represents the partial updates 
recorded in this task,
-   * not the aggregated value across multiple tasks.
-   */
-  def accumulatorUpdates(): Seq[AccumulableInfo] = {
-(internalAccums ++ externalAccums).map { a => 
a.toInfo(Some(a.localValue), None) }
-  }
+  private[spark] def accumulators(): Seq[NewAccumulator[_, _]] = 
internalAccums ++ externalAccums
 }
 
-/**
- * Internal subclass of [[TaskMetrics]] which is used only for posting 
events to listeners.
- * Its purpose is to obviate the need for the driver to reconstruct the 
original accumulators,
- * which might have been garbage-collected. See SPARK-13407 for more 
details.
- *
- * Instances of this class 

[GitHub] spark pull request: [SPARK-12660] [SPARK-14967] [SQL] Implement Ex...

2016-04-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/12736#discussion_r61374712
  
--- Diff: 
sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java ---
@@ -291,7 +291,7 @@ public void testSetOperation() {
   unioned.collectAsList());
 
 Dataset subtracted = ds.except(ds2);
-Assert.assertEquals(Arrays.asList("abc", "abc"), 
subtracted.collectAsList());
+Assert.assertEquals(Arrays.asList("abc"), subtracted.collectAsList());
--- End diff --

Yeah. After this PR, the current behavior of `EXCEPT` is the same as the 
standard `EXCEPT DISTINCT`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14654][CORE] New accumulator API

2016-04-27 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/12612#discussion_r61374697
  
--- Diff: core/src/main/scala/org/apache/spark/NewAccumulator.scala ---
@@ -0,0 +1,391 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import java.{lang => jl}
+import java.io.ObjectInputStream
+import java.util.concurrent.atomic.AtomicLong
+import javax.annotation.concurrent.GuardedBy
+
+import org.apache.spark.scheduler.AccumulableInfo
+import org.apache.spark.util.Utils
+
+
+private[spark] case class AccumulatorMetadata(
+id: Long,
+name: Option[String],
+countFailedValues: Boolean) extends Serializable
+
+
+/**
+ * The base class for accumulators, that can accumulate inputs of type 
`IN`, and produce output of
+ * type `OUT`.
+ */
+abstract class NewAccumulator[IN, OUT] extends Serializable {
+  private[spark] var metadata: AccumulatorMetadata = _
+  private[this] var atDriverSide = true
+
+  private[spark] def register(
+  sc: SparkContext,
+  name: Option[String] = None,
+  countFailedValues: Boolean = false): Unit = {
+if (this.metadata != null) {
+  throw new IllegalStateException("Cannot register an Accumulator 
twice.")
+}
+this.metadata = AccumulatorMetadata(AccumulatorContext.newId(), name, 
countFailedValues)
+AccumulatorContext.register(this)
+sc.cleaner.foreach(_.registerAccumulatorForCleanup(this))
+  }
+
+  /**
+   * Returns true if this accumulator has been registered.  Note that all 
accumulators must be
+   * registered before ues, or it will throw exception.
+   */
+  final def isRegistered: Boolean =
+metadata != null && 
AccumulatorContext.originals.containsKey(metadata.id)
+
+  private def assertMetadataNotNull(): Unit = {
+if (metadata == null) {
+  throw new IllegalAccessError("The metadata of this accumulator has 
not been assigned yet.")
+}
+  }
+
+  /**
+   * Returns the id of this accumulator, can only be called after 
registration.
+   */
+  final def id: Long = {
+assertMetadataNotNull()
+metadata.id
+  }
+
+  /**
+   * Returns the name of this accumulator, can only be called after 
registration.
+   */
+  final def name: Option[String] = {
+assertMetadataNotNull()
+metadata.name
+  }
+
+  /**
+   * Whether to accumulate values from failed tasks. This is set to true 
for system and time
+   * metrics like serialization time or bytes spilled, and false for 
things with absolute values
+   * like number of input rows.  This should be used for internal metrics 
only.
+   */
+  private[spark] final def countFailedValues: Boolean = {
+assertMetadataNotNull()
+metadata.countFailedValues
+  }
+
+  /**
+   * Creates an [[AccumulableInfo]] representation of this 
[[NewAccumulator]] with the provided
+   * values.
+   */
+  private[spark] def toInfo(update: Option[Any], value: Option[Any]): 
AccumulableInfo = {
+val isInternal = 
name.exists(_.startsWith(InternalAccumulator.METRICS_PREFIX))
+new AccumulableInfo(id, name, update, value, isInternal, 
countFailedValues)
+  }
+
+  final private[spark] def isAtDriverSide: Boolean = atDriverSide
+
+  /**
+   * Tells if this accumulator is zero value or not. e.g. for a counter 
accumulator, 0 is zero
+   * value; for a list accumulator, Nil is zero value.
+   */
+  def isZero(): Boolean
+
+  /**
+   * Creates a new copy of this accumulator, which is zero value. i.e. 
call `isZero` on the copy
+   * must return true.
+   */
+  def copyAndReset(): NewAccumulator[IN, OUT]
+
+  /**
+   * Takes the inputs and accumulates. e.g. it can be a simple `+=` for 
counter accumulator.
+   */
+  def add(v: IN): Unit
+
+  /**
+   * Merges another 

[GitHub] spark pull request: [SPARK-14654][CORE] New accumulator API

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12612#issuecomment-215312526
  
**[Test build #57220 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57220/consoleFull)**
 for PR 12612 at commit 
[`124568b`](https://github.com/apache/spark/commit/124568b3eeb7e0a657b2fbe4f54bb85543b7ffa3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14487][SQL] User Defined Type registrat...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12259#issuecomment-215311747
  
**[Test build #57219 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57219/consoleFull)**
 for PR 12259 at commit 
[`1c230ae`](https://github.com/apache/spark/commit/1c230ae23e17905af9ea9655cbcfb5a948e627a9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12235][SPARKR] Enhance mutate() to supp...

2016-04-27 Thread sun-rui
Github user sun-rui commented on a diff in the pull request:

https://github.com/apache/spark/pull/10220#discussion_r61374122
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -1451,17 +1451,54 @@ setMethod("mutate",
   function(.data, ...) {
 x <- .data
 cols <- list(...)
-stopifnot(length(cols) > 0)
-stopifnot(class(cols[[1]]) == "Column")
+if (length(cols) <= 0) {
+  return(x)
+}
+
+lapply(cols, function(col) {
+  stopifnot(class(col) == "Column")
+})
+
+# Check if there is any duplicated column name in the DataFrame
+dfCols <- columns(x)
+if (length(unique(dfCols)) != length(dfCols)) {
+  stop("Error: found duplicated column name in the DataFrame")
+}
+
+# TODO: simplify the implementation of this method after 
SPARK-12225 is resolved.
+
+# For named arguments, use the names for arguments as the 
column names
+# For unnamed arguments, use the argument symbols as the 
column names
+args <- sapply(substitute(list(...))[-1], deparse)
--- End diff --

I did use cols. But the result is not correct. I have to use list(...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14487][SQL] User Defined Type registrat...

2016-04-27 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/12259#issuecomment-215310697
  
LGTM pending Jenkins


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14487][SQL] User Defined Type registrat...

2016-04-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/12259#discussion_r61373781
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -117,6 +117,7 @@ class SQLContext private[sql](
*
* @since 1.6.0
*/
+
--- End diff --

minor: remove empty line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12683#issuecomment-215310647
  
**[Test build #57218 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57218/consoleFull)**
 for PR 12683 at commit 
[`6650890`](https://github.com/apache/spark/commit/665089051aef4dd4ac189eed329ee55d3e8df9e3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14346][SQL] Add PARTITIONED BY and CLUS...

2016-04-27 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12734#issuecomment-215310529
  
@jodersky Oh sorry, pasted the JIRA ticket summary to the PR title but 
forgot to add the tags. Updated!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14729][Scheduler] Refactored YARN sched...

2016-04-27 Thread hbhanawat
Github user hbhanawat commented on the pull request:

https://github.com/apache/spark/pull/12641#issuecomment-215310532
  
Hmm. 

@vanzin I think you have a point. There are few things that can be done but 
not sure if they will simplify without reducing the flexibility. I will think 
more on it and get back. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14487][SQL] User Defined Type registrat...

2016-04-27 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/12259#discussion_r61373671
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/util/MLlibTestSparkContext.scala ---
@@ -24,14 +24,18 @@ import org.scalatest.Suite
 
 import org.apache.spark.{SparkConf, SparkContext}
 import org.apache.spark.ml.util.TempDirectory
-import org.apache.spark.sql.SQLContext
+import org.apache.spark.sql.{SQLContext, SQLImplicits}
 import org.apache.spark.util.Utils
 
 trait MLlibTestSparkContext extends TempDirectory { self: Suite =>
   @transient var sc: SparkContext = _
   @transient var sqlContext: SQLContext = _
   @transient var checkpointDir: String = _
 
+  protected object testImplicits extends SQLImplicits {
--- End diff --

Removed it now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12750#issuecomment-215310239
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57212/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61373605
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the approximate median 
value of the feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  final val strategy: Param[String] = new Param(this, "strategy", 
"strategy for imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  final val missingValue: DoubleParam = new DoubleParam(this, 
"missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, 
FloatType))
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+SchemaUtils.appendColumn(schema, $(outputCol), inputType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean") or the
+ * median("median") of the column in which the missing values are located.
+ *
+ * Note that all the null values will be imputed as well.
--- End diff --

Yes. That's better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61373554
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the approximate median 
value of the feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  final val strategy: Param[String] = new Param(this, "strategy", 
"strategy for imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  final val missingValue: DoubleParam = new DoubleParam(this, 
"missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, 
FloatType))
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+SchemaUtils.appendColumn(schema, $(outputCol), inputType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean") or the
+ * median("median") of the column in which the missing values are located.
--- End diff --

@sethah, I tried yet I'm afraid it draws more confusion than the help. 
After all, the current behavior is not out of expectation and works for most ( 
if not all) users. I'd prefer to skip the detailed explanation in API document. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12750#issuecomment-215310152
  
**[Test build #57212 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57212/consoleFull)**
 for PR 12750 at commit 
[`4bbf429`](https://github.com/apache/spark/commit/4bbf4292802e475d84ec55994a4ebae3ddc2f4da).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12750#issuecomment-215310236
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61373571
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the approximate median 
value of the feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  final val strategy: Param[String] = new Param(this, "strategy", 
"strategy for imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  final val missingValue: DoubleParam = new DoubleParam(this, 
"missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, 
FloatType))
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+SchemaUtils.appendColumn(schema, $(outputCol), inputType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean") or the
+ * median("median") of the column in which the missing values are located.
+ *
--- End diff --

Cool. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14487][SQL] User Defined Type registrat...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12259#issuecomment-215310117
  
**[Test build #57217 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57217/consoleFull)**
 for PR 12259 at commit 
[`9ed0f30`](https://github.com/apache/spark/commit/9ed0f304a81a30a4737ca5864fbe4960e7797145).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14706][ML][PySpark] Python ML persisten...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12604#issuecomment-215309644
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57216/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14706][ML][PySpark] Python ML persisten...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12604#issuecomment-215309599
  
**[Test build #57216 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57216/consoleFull)**
 for PR 12604 at commit 
[`fa8a05c`](https://github.com/apache/spark/commit/fa8a05c6acd955670af64637caa9c1f38c5e9f09).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14706][ML][PySpark] Python ML persisten...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12604#issuecomment-215309643
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...

2016-04-27 Thread yanboliang
Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/12467#issuecomment-215308986
  
Look good overall, I have my last inline comment. After that, it should be 
ready to go.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12660] [SPARK-14967] [SQL] Implement Ex...

2016-04-27 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/12736#discussion_r61372903
  
--- Diff: 
sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java ---
@@ -291,7 +291,7 @@ public void testSetOperation() {
   unioned.collectAsList());
 
 Dataset subtracted = ds.except(ds2);
-Assert.assertEquals(Arrays.asList("abc", "abc"), 
subtracted.collectAsList());
+Assert.assertEquals(Arrays.asList("abc"), subtracted.collectAsList());
--- End diff --

Did this PR also fixed the semantic of `Except`, or it only added the 
optimization?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14706][ML][PySpark] Python ML persisten...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12604#issuecomment-215308604
  
**[Test build #57216 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57216/consoleFull)**
 for PR 12604 at commit 
[`fa8a05c`](https://github.com/apache/spark/commit/fa8a05c6acd955670af64637caa9c1f38c5e9f09).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12750#issuecomment-215308136
  
**[Test build #57215 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57215/consoleFull)**
 for PR 12750 at commit 
[`5d34a64`](https://github.com/apache/spark/commit/5d34a646046727ffaf8f5932b2dfeae3a5c10d32).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...

2016-04-27 Thread yanboliang
Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/12683#issuecomment-215308071
  
@GayathriMurali You should modify 
[```RWrappers.load```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/RWrappers.scala#L44)
 at Scala side to make it support loading GLM wrapper. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...

2016-04-27 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/12467#discussion_r61372578
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
@@ -290,4 +291,18 @@ class RFormulaSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defaul
 val newModel = testDefaultReadWrite(model)
 checkModelData(model, newModel)
   }
+
+  test("should support all NumericType labels") {
--- End diff --

@BenFradet Sorry for late response. 
FYI: 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/RWrappers.scala#L44


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14487][SQL] User Defined Type registrat...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12259#issuecomment-215307961
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14487][SQL] User Defined Type registrat...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12259#issuecomment-215307962
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57207/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14487][SQL] User Defined Type registrat...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12259#issuecomment-215307877
  
**[Test build #57207 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57207/consoleFull)**
 for PR 12259 at commit 
[`45e87d9`](https://github.com/apache/spark/commit/45e87d9a6d47350b97ce25758f58409b1b56623b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class MatrixUDTSuite extends SparkFunSuite `
  * `class VectorUDTSuite extends SparkFunSuite `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12683#issuecomment-215307044
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12683#issuecomment-215307046
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57211/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12683#issuecomment-215307026
  
**[Test build #57211 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57211/consoleFull)**
 for PR 12683 at commit 
[`55523f7`](https://github.com/apache/spark/commit/55523f7713615292c427c4000cee36b86c4fe7a2).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14961] Build HashedRelation larger than...

2016-04-27 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/12740


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14961] Build HashedRelation larger than...

2016-04-27 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/12740#issuecomment-215306302
  
When profiling the performance for BytesToBytesMap, it does have difference.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14961] Build HashedRelation larger than...

2016-04-27 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/12740#issuecomment-215306341
  
Merging this into master, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10780][ML][WIP] Add initial model to km...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9#issuecomment-215305404
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57210/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10780][ML][WIP] Add initial model to km...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9#issuecomment-215305403
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14654][CORE] New accumulator API

2016-04-27 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12612#discussion_r61371630
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SortBasedAggregateExec.scala
 ---
@@ -46,7 +46,7 @@ case class SortBasedAggregateExec(
   AttributeSet(aggregateBufferAttributes)
 
   override private[sql] lazy val metrics = Map(
-"numOutputRows" -> SQLMetrics.createLongMetric(sparkContext, "number 
of output rows"))
+"numOutputRows" -> SQLMetrics.createSumMetric(sparkContext, "number of 
output rows"))
--- End diff --

maybe just createMetric?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10780][ML][WIP] Add initial model to km...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9#issuecomment-215305352
  
**[Test build #57210 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57210/consoleFull)**
 for PR 9 at commit 
[`c40192b`](https://github.com/apache/spark/commit/c40192b0579080f4af572cf6d12bf37942c03866).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14654][CORE] New accumulator API

2016-04-27 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/12612#discussion_r61371318
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SortBasedAggregateExec.scala
 ---
@@ -46,7 +46,7 @@ case class SortBasedAggregateExec(
   AttributeSet(aggregateBufferAttributes)
 
   override private[sql] lazy val metrics = Map(
-"numOutputRows" -> SQLMetrics.createLongMetric(sparkContext, "number 
of output rows"))
+"numOutputRows" -> SQLMetrics.createSumMetric(sparkContext, "number of 
output rows"))
--- End diff --

`SQLMetric` are all long accumulator, so I chose the name `sum` to describe 
the behaviour, not type...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14935][CORE] DistributedSuite "local-cl...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12744#issuecomment-215304506
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14935][CORE] DistributedSuite "local-cl...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12744#issuecomment-215304507
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57202/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14935][CORE] DistributedSuite "local-cl...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12744#issuecomment-215304423
  
**[Test build #57202 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57202/consoleFull)**
 for PR 12744 at commit 
[`0f14ad2`](https://github.com/apache/spark/commit/0f14ad2698bb9b2fbf582c44e3bb8f796b873750).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14654][CORE] New accumulator API

2016-04-27 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/12612#issuecomment-215304306
  
This looks pretty good to me. We should get it to pass tests and then merge 
it asap. Some of the comments can be addressed later.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14783] [SPARK-14786] [BRANCH-1.6] Prese...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12724#issuecomment-215304302
  
**[Test build #57214 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57214/consoleFull)**
 for PR 12724 at commit 
[`49b7b52`](https://github.com/apache/spark/commit/49b7b528a8928335e16bf6081f7dad1e819c77e2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14654][CORE] New accumulator API

2016-04-27 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12612#discussion_r61371102
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SortBasedAggregateExec.scala
 ---
@@ -46,7 +46,7 @@ case class SortBasedAggregateExec(
   AttributeSet(aggregateBufferAttributes)
 
   override private[sql] lazy val metrics = Map(
-"numOutputRows" -> SQLMetrics.createLongMetric(sparkContext, "number 
of output rows"))
+"numOutputRows" -> SQLMetrics.createSumMetric(sparkContext, "number of 
output rows"))
--- End diff --

hm I think createLongMetric makes more sense than createSumMetric here ...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14783] [SPARK-14786] [BRANCH-1.6] Prese...

2016-04-27 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/12724#issuecomment-215303991
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14654][CORE] New accumulator API

2016-04-27 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12612#discussion_r61371062
  
--- Diff: project/MimaExcludes.scala ---
@@ -674,6 +674,19 @@ object MimaExcludes {
   ) ++ Seq(
 // [SPARK-4452][Core]Shuffle data structures can starve others on 
the same thread for memory
 
ProblemFilters.exclude[IncompatibleTemplateDefProblem]("org.apache.spark.util.collection.Spillable")
+  ) ++ Seq(
+// SPARK-14654: New accumulator API
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.Accumulable.zero"),
--- End diff --

do we have to remove this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14970][SQL] Prevent DataSource from enu...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12748#issuecomment-215303740
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14346][SQL] Show Create Table (Native)

2016-04-27 Thread xwu0226
Github user xwu0226 commented on the pull request:

https://github.com/apache/spark/pull/12579#issuecomment-215303758
  
@yhuai @liancheng , I see PR 
[#12734](https://github.com/apache/spark/pull/12734) takes care of the 
PARTITIONED BY and CLUSTERED BY (with SORTED BY) clause for CTAS syntax, but 
not for non-CTAS syntax.  Now I need to change my PR to adapt to this change, 
which means that the generated DDL will be something like `create table t1 (c1 
int, ...) using .. options (..) partitioned by (..) clustered by (...) sorted 
by (...) in ... buckets`. But there won't be a "select clause" following it 
since we do not have the original query. But such generated query will not run 
because [#12734](https://github.com/apache/spark/pull/12734) does not support 
it.  Can we add a fake select clause with a warning message?

Also DataFrameWriter.saveAsTable case is like CTAS. Can we then generate 
the DDL as a regular CTAS syntax? This will change my current implementation in 
this PR. 
Please advice, thanks a lot!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14970][SQL] Prevent DataSource from enu...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12748#issuecomment-215303742
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57205/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14970][SQL] Prevent DataSource from enu...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12748#issuecomment-215303655
  
**[Test build #57205 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57205/consoleFull)**
 for PR 12748 at commit 
[`4e4e8db`](https://github.com/apache/spark/commit/4e4e8dba8209513d105f0e195ca0b06a3eb6c70e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10001][Core] Don't short-circuit action...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12745#issuecomment-215303593
  
**[Test build #57213 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57213/consoleFull)**
 for PR 12745 at commit 
[`06f83cc`](https://github.com/apache/spark/commit/06f83cc1de78aaf56942404fb24aca81d1b66d2e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12919][SPARKR] Implement dapply() on Da...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12493#issuecomment-215303408
  
**[Test build #57201 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57201/consoleFull)**
 for PR 12493 at commit 
[`2264b57`](https://github.com/apache/spark/commit/2264b57a2d5f375eae6520b492a2152be259ccaa).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12919][SPARKR] Implement dapply() on Da...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12493#issuecomment-215303502
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12919][SPARKR] Implement dapply() on Da...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12493#issuecomment-215303503
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57201/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10780][ML][WIP] Add initial model to km...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9#issuecomment-215303378
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10780][ML][WIP] Add initial model to km...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9#issuecomment-215303382
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57209/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10780][ML][WIP] Add initial model to km...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9#issuecomment-215303315
  
**[Test build #57209 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57209/consoleFull)**
 for PR 9 at commit 
[`914d319`](https://github.com/apache/spark/commit/914d31991e10dd0d77f03e167f19147ff696834c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12750#issuecomment-215303075
  
**[Test build #57212 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57212/consoleFull)**
 for PR 12750 at commit 
[`4bbf429`](https://github.com/apache/spark/commit/4bbf4292802e475d84ec55994a4ebae3ddc2f4da).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14654][CORE] New accumulator API

2016-04-27 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12612#discussion_r61370631
  
--- Diff: core/src/main/scala/org/apache/spark/NewAccumulator.scala ---
@@ -0,0 +1,356 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import java.{lang => jl}
+import java.io.ObjectInputStream
+import java.util.concurrent.atomic.AtomicLong
+import javax.annotation.concurrent.GuardedBy
+
+import org.apache.spark.scheduler.AccumulableInfo
+import org.apache.spark.util.Utils
+
+
+private[spark] case class AccumulatorMetadata(
+id: Long,
+name: Option[String],
+countFailedValues: Boolean) extends Serializable
+
+
+/**
+ * The base class for accumulators, that can accumulate inputs of type 
`IN`, and produce output of
+ * type `OUT`.  Implementations must define following methods:
+ *  - isZero:   tell if this accumulator is zero value or not. e.g. 
for a counter accumulator,
+ *  0 is zero value; for a list accumulator, Nil is zero 
value.
+ *  - copyAndReset: create a new copy of this accumulator, which is zero 
value. i.e. call `isZero`
+ *  on the copy must return true.
+ *  - add:  defines how to accumulate the inputs. e.g. it can be a 
simple `+=` for counter
+ *  accumulator
+ *  - merge:defines how to merge another accumulator of same type.
+ *  - localValue:   defines how to produce the output by the current state 
of this accumulator.
+ *
+ * The implementations decide how to store intermediate values, e.g. a 
long field for a counter
+ * accumulator, a double and a long field for a average 
accumulator(storing the sum and count).
+ */
+abstract class NewAccumulator[IN, OUT] extends Serializable {
+  private[spark] var metadata: AccumulatorMetadata = _
+  private[this] var atDriverSide = true
+
+  private[spark] def register(
+  sc: SparkContext,
+  name: Option[String] = None,
+  countFailedValues: Boolean = false): Unit = {
+if (this.metadata != null) {
+  throw new IllegalStateException("Cannot register an Accumulator 
twice.")
+}
+this.metadata = AccumulatorMetadata(AccumulatorContext.newId(), name, 
countFailedValues)
+AccumulatorContext.register(this)
+sc.cleaner.foreach(_.registerAccumulatorForCleanup(this))
+  }
+
+  final def isRegistered: Boolean =
+metadata != null && 
AccumulatorContext.originals.containsKey(metadata.id)
+
+  private def assertMetadataNotNull(): Unit = {
+if (metadata == null) {
+  throw new IllegalAccessError("The metadata of this accumulator has 
not been assigned yet.")
+}
+  }
+
+  final def id: Long = {
+assertMetadataNotNull()
+metadata.id
+  }
+
+  final def name: Option[String] = {
+assertMetadataNotNull()
+metadata.name
+  }
+
+  final def countFailedValues: Boolean = {
+assertMetadataNotNull()
+metadata.countFailedValues
+  }
+
+  private[spark] def toInfo(update: Option[Any], value: Option[Any]): 
AccumulableInfo = {
+val isInternal = 
name.exists(_.startsWith(InternalAccumulator.METRICS_PREFIX))
+new AccumulableInfo(id, name, update, value, isInternal, 
countFailedValues)
+  }
+
+  final private[spark] def isAtDriverSide: Boolean = atDriverSide
+
+  def isZero(): Boolean
+
+  def copyAndReset(): NewAccumulator[IN, OUT]
+
+  def add(v: IN): Unit
+
+  def +=(v: IN): Unit = add(v)
+
+  def merge(other: NewAccumulator[IN, OUT]): Unit
--- End diff --

need to document that this merges in place


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at 

[GitHub] spark pull request: [SPARK-10001][Core] Don't short-circuit action...

2016-04-27 Thread jodersky
Github user jodersky commented on a diff in the pull request:

https://github.com/apache/spark/pull/12745#discussion_r61370492
  
--- Diff: core/src/main/scala/org/apache/spark/util/SignalUtils.scala ---
@@ -94,7 +94,7 @@ private[spark] object SignalUtils extends Logging {
 
   // run all actions, escalate to parent handler if no action catches 
the signal
   // (i.e. all actions return false)
-  val escalate = actions.asScala.forall { action => !action() }
+  val escalate = actions.asScala.map(action => action()).forall(_ == 
false)
--- End diff --

good idea


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

2016-04-27 Thread JoshRosen
GitHub user JoshRosen opened a pull request:

https://github.com/apache/spark/pull/12750

[SPARK-14972] Improve performance of JSON schema inference's compatibleType 
method

This patch improves the performance of `InferSchema.compatibleType` and 
`inferField`. The net result of this patch is a 2x-4x speedup in local 
benchmarks running against cached data with a massive nested schema.

The key idea is to remove unnecessary sorting in `compatibleType`'s 
`StructType` merging code. This code takes two structs, merges the fields with 
matching names, and copies over the unique fields, producing a new schema which 
is the union of the two structs' schemas. Previously, this code performed a 
very inefficient `groupBy()` to match up fields with the same name, but this is 
unnecessary because `inferField` already sorts structs' fields by name: since 
both lists of fields are sorted, we can simply merge them in a single pass.

This patch also speeds up the existing field sorting in `inferField`: the 
old sorting code allocated unnecessary intermediate collections, while the new 
code uses mutable collects and performs in-place sorting.

Finally, I replaced a `treeAggregate` call with `fold`: I doubt that 
`treeAggregate` will benefit us very much because the schemas would have to be 
enormous to realize large savings in network traffic. Since most schemas are 
probably fairly small in serialized form, they should typically fit within a 
direct task result and therefore can be incrementally merged at the driver as 
individual tasks finish. This change eliminates an entire (short) scheduler 
stage. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/JoshRosen/spark schema-inference-speedups

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12750.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12750


commit 5406092f5c16189f1b6c46c1ed324f13dadd57b1
Author: Josh Rosen 
Date:   2016-04-28T02:12:29Z

Stop using groupByKey to merge struct schemas.

commit 5319a6abfb37cb43c500385b21e9d905e831d24d
Author: Josh Rosen 
Date:   2016-04-28T02:38:21Z

Take advantage of fact that structs' fields are already sorted by name.

commit d9793f71cf5d9e2ff5bf7d62b561e9d6e377d1a1
Author: Josh Rosen 
Date:   2016-04-28T03:01:47Z

Perform in-place sort without additional allocation.

commit 4bbf4292802e475d84ec55994a4ebae3ddc2f4da
Author: Josh Rosen 
Date:   2016-04-28T03:04:27Z

Replace treeAggregate with fold.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14783] [SPARK-14786] [BRANCH-1.6] Prese...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12724#issuecomment-215302424
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14783] [SPARK-14786] [BRANCH-1.6] Prese...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12724#issuecomment-215302427
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57193/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14783] [SPARK-14786] [BRANCH-1.6] Prese...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12724#issuecomment-215302234
  
**[Test build #57193 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57193/consoleFull)**
 for PR 12724 at commit 
[`49b7b52`](https://github.com/apache/spark/commit/49b7b528a8928335e16bf6081f7dad1e819c77e2).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12683#issuecomment-215302215
  
**[Test build #57211 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57211/consoleFull)**
 for PR 12683 at commit 
[`55523f7`](https://github.com/apache/spark/commit/55523f7713615292c427c4000cee36b86c4fe7a2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10001][Core] Don't short-circuit action...

2016-04-27 Thread jodersky
Github user jodersky commented on the pull request:

https://github.com/apache/spark/pull/12745#issuecomment-215302197
  
removed the label, sorry about that


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10001][Core][Hotfix] Don't short-circui...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12745#issuecomment-215302102
  
**[Test build #2898 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2898/consoleFull)**
 for PR 12745 at commit 
[`7fe0e54`](https://github.com/apache/spark/commit/7fe0e54c5e39ff942fd58a16d5e5e309887ee883).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10001][Core][Hotfix] Don't short-circui...

2016-04-27 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12745#discussion_r61370370
  
--- Diff: core/src/main/scala/org/apache/spark/util/SignalUtils.scala ---
@@ -94,7 +94,7 @@ private[spark] object SignalUtils extends Logging {
 
   // run all actions, escalate to parent handler if no action catches 
the signal
   // (i.e. all actions return false)
-  val escalate = actions.asScala.forall { action => !action() }
+  val escalate = actions.asScala.map(action => action()).forall(_ == 
false)
--- End diff --

we should probably leave a comment here saying why we are doing map and 
then forall, to make sure nobody would come in and "optimize" the map.forall 
into just a forall.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >