[GitHub] spark pull request: [SPARK-14858][SQL] Enable subquery pushdown
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/12720#discussion_r61376668 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala --- @@ -337,6 +337,16 @@ case class PrettyAttribute( override def nullable: Boolean = true } +/** + * A place holder used to hold a reference that has been resolved to a field outside of the current + * plan. This is used for correlated subqueries. + */ +case class OuterReference(e: NamedExpression) extends LeafExpression with Unevaluable { + override def dataType: DataType = e.dataType + override def nullable: Boolean = e.nullable + override def prettyName: String = "outer" --- End diff -- Should include `e` here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14858][SQL] Enable subquery pushdown
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/12720#discussion_r61376609 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala --- @@ -337,6 +337,16 @@ case class PrettyAttribute( override def nullable: Boolean = true } +/** + * A place holder used to hold a reference that has been resolved to a field outside of the current + * plan. This is used for correlated subqueries. + */ +case class OuterReference(e: NamedExpression) extends LeafExpression with Unevaluable { --- End diff -- This is a good idea, it make us easier to find all the outer references. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...
Github user ueshin commented on the pull request: https://github.com/apache/spark/pull/12655#issuecomment-215317033 I saw the PR #8427 now. Both the #8427 approach and @markhamstra's approach (should we use `getOrElseUpdate` instead of `getOrElse`?) seem like the simplest way to fix the duplicate-stage issue except a risk of `StackOverflowError`. I agree that we should fix the duplicate-stage issue first in the simplest way possible, by #8427 or @markhamstra's, and then discuss a risk of `SOE` after one of them is merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12750#issuecomment-215316922 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57215/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12750#issuecomment-215316921 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12750#issuecomment-215316797 **[Test build #57215 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57215/consoleFull)** for PR 12750 at commit [`5d34a64`](https://github.com/apache/spark/commit/5d34a646046727ffaf8f5932b2dfeae3a5c10d32). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14938][ML] replace some RDD.map with Da...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12718#issuecomment-215316563 **[Test build #57222 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57222/consoleFull)** for PR 12718 at commit [`e57332a`](https://github.com/apache/spark/commit/e57332a0b72f6c0bf58ee72c77eddf4f5904fe9b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14858][SQL] Enable subquery pushdown
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/12720#discussion_r61376302 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -866,71 +867,189 @@ class Analyzer( * Note: CTEs are handled in CTESubstitution. */ object ResolveSubquery extends Rule[LogicalPlan] with PredicateHelper { - /** - * Resolve the correlated predicates in the clauses (e.g. WHERE or HAVING) of a - * sub-query by using the plan the predicates should be correlated to. + * Resolve a subquery using the outer plan. This rule creates a dedicated analyzer which can + * also resolve outer plan references. */ -private def resolveCorrelatedSubquery( -sub: LogicalPlan, outer: LogicalPlan, -aliases: scala.collection.mutable.Map[Attribute, Alias]): LogicalPlan = { - // First resolve as much of the sub-query as possible - val analyzed = execute(sub) - if (analyzed.resolved) { -analyzed - } else { -// Only resolve the lowest plan that is not resolved by outer plan, otherwise it could be -// resolved by itself -val resolvedByOuter = analyzed transformDown { - case q: LogicalPlan if q.childrenResolved && !q.resolved => -q transformExpressions { - case u @ UnresolvedAttribute(nameParts) => -withPosition(u) { - try { -val outerAttrOpt = outer.resolve(nameParts, resolver) -if (outerAttrOpt.isDefined) { - val outerAttr = outerAttrOpt.get - if (q.inputSet.contains(outerAttr)) { -// Got a conflict, create an alias for the attribute come from outer table -val alias = Alias(outerAttr, outerAttr.toString)() -val attr = alias.toAttribute -aliases += attr -> alias -attr - } else { -outerAttr - } -} else { - u -} - } catch { -case a: AnalysisException => u - } -} +def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { + // Only a few unary nodes (Project/Filter/Aggregate/Having) can contain subqueries. + case q: UnaryNode if q.childrenResolved => +q transformExpressions { + case e: SubqueryExpression if !e.query.resolved => +val analyzer = new Analyzer(catalog, conf) { + override val extendedCheckRules = self.extendedCheckRules + override val extendedResolutionRules = self.extendedResolutionRules :+ +ResolveOuterReferences(q.child, resolver) } +e.withNewPlan(analyzer.execute(e.query)) --- End diff -- For example, `Filter('a > 'b.a, Aggregate(['b], ['sum('c)]))`, b and c could be resolved first (in ResolveReferences), then sum could be resolved as aggregate function (in ResolveAggregateFunctions), the aggregate became resolved, so `a will be resolved as outer reference by `OuterReference`, but it should not be. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10001][Core] Don't short-circuit action...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/12745 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10001][Core] Don't short-circuit action...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12745#issuecomment-215315836 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57213/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10001][Core] Don't short-circuit action...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12745#issuecomment-215315834 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10001][Core] Don't short-circuit action...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/12745#issuecomment-215315768 Merging in master. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10001][Core] Don't short-circuit action...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12745#issuecomment-215315711 **[Test build #57213 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57213/consoleFull)** for PR 12745 at commit [`06f83cc`](https://github.com/apache/spark/commit/06f83cc1de78aaf56942404fb24aca81d1b66d2e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14858][SQL] Enable subquery pushdown
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/12720#discussion_r61375897 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -866,71 +867,189 @@ class Analyzer( * Note: CTEs are handled in CTESubstitution. */ object ResolveSubquery extends Rule[LogicalPlan] with PredicateHelper { - /** - * Resolve the correlated predicates in the clauses (e.g. WHERE or HAVING) of a - * sub-query by using the plan the predicates should be correlated to. + * Resolve a subquery using the outer plan. This rule creates a dedicated analyzer which can + * also resolve outer plan references. */ -private def resolveCorrelatedSubquery( -sub: LogicalPlan, outer: LogicalPlan, -aliases: scala.collection.mutable.Map[Attribute, Alias]): LogicalPlan = { - // First resolve as much of the sub-query as possible - val analyzed = execute(sub) - if (analyzed.resolved) { -analyzed - } else { -// Only resolve the lowest plan that is not resolved by outer plan, otherwise it could be -// resolved by itself -val resolvedByOuter = analyzed transformDown { - case q: LogicalPlan if q.childrenResolved && !q.resolved => -q transformExpressions { - case u @ UnresolvedAttribute(nameParts) => -withPosition(u) { - try { -val outerAttrOpt = outer.resolve(nameParts, resolver) -if (outerAttrOpt.isDefined) { - val outerAttr = outerAttrOpt.get - if (q.inputSet.contains(outerAttr)) { -// Got a conflict, create an alias for the attribute come from outer table -val alias = Alias(outerAttr, outerAttr.toString)() -val attr = alias.toAttribute -aliases += attr -> alias -attr - } else { -outerAttr - } -} else { - u -} - } catch { -case a: AnalysisException => u - } -} +def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { + // Only a few unary nodes (Project/Filter/Aggregate/Having) can contain subqueries. + case q: UnaryNode if q.childrenResolved => +q transformExpressions { + case e: SubqueryExpression if !e.query.resolved => +val analyzer = new Analyzer(catalog, conf) { + override val extendedCheckRules = self.extendedCheckRules + override val extendedResolutionRules = self.extendedResolutionRules :+ +ResolveOuterReferences(q.child, resolver) } +e.withNewPlan(analyzer.execute(e.query)) --- End diff -- (updated comment) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12683#issuecomment-215315114 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57218/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12683#issuecomment-215315097 **[Test build #57218 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57218/consoleFull)** for PR 12683 at commit [`6650890`](https://github.com/apache/spark/commit/665089051aef4dd4ac189eed329ee55d3e8df9e3). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12683#issuecomment-215315113 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14858][SQL] Enable subquery pushdown
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/12720#discussion_r61375586 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -866,71 +867,189 @@ class Analyzer( * Note: CTEs are handled in CTESubstitution. */ object ResolveSubquery extends Rule[LogicalPlan] with PredicateHelper { - /** - * Resolve the correlated predicates in the clauses (e.g. WHERE or HAVING) of a - * sub-query by using the plan the predicates should be correlated to. + * Resolve a subquery using the outer plan. This rule creates a dedicated analyzer which can + * also resolve outer plan references. */ -private def resolveCorrelatedSubquery( -sub: LogicalPlan, outer: LogicalPlan, -aliases: scala.collection.mutable.Map[Attribute, Alias]): LogicalPlan = { - // First resolve as much of the sub-query as possible - val analyzed = execute(sub) - if (analyzed.resolved) { -analyzed - } else { -// Only resolve the lowest plan that is not resolved by outer plan, otherwise it could be -// resolved by itself -val resolvedByOuter = analyzed transformDown { - case q: LogicalPlan if q.childrenResolved && !q.resolved => -q transformExpressions { - case u @ UnresolvedAttribute(nameParts) => -withPosition(u) { - try { -val outerAttrOpt = outer.resolve(nameParts, resolver) -if (outerAttrOpt.isDefined) { - val outerAttr = outerAttrOpt.get - if (q.inputSet.contains(outerAttr)) { -// Got a conflict, create an alias for the attribute come from outer table -val alias = Alias(outerAttr, outerAttr.toString)() -val attr = alias.toAttribute -aliases += attr -> alias -attr - } else { -outerAttr - } -} else { - u -} - } catch { -case a: AnalysisException => u - } -} +def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { + // Only a few unary nodes (Project/Filter/Aggregate/Having) can contain subqueries. + case q: UnaryNode if q.childrenResolved => +q transformExpressions { + case e: SubqueryExpression if !e.query.resolved => +val analyzer = new Analyzer(catalog, conf) { + override val extendedCheckRules = self.extendedCheckRules + override val extendedResolutionRules = self.extendedResolutionRules :+ +ResolveOuterReferences(q.child, resolver) } +e.withNewPlan(analyzer.execute(e.query)) --- End diff -- The new rule will uses a special expression for all outer references calles `OuterReference`. The outer reference basically hides the outer reference and makes sure no-collisions happen. We resolve collisions when we pull out the predicates (by adding a single project if we need one). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12660] [SPARK-14967] [SQL] Implement Ex...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/12736#issuecomment-215314522 LGTM. A unrelated question, how do we express the EXCEPT ALL semantic? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10001][Core] Don't short-circuit action...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12745#issuecomment-215314462 **[Test build #2898 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2898/consoleFull)** for PR 12745 at commit [`7fe0e54`](https://github.com/apache/spark/commit/7fe0e54c5e39ff942fd58a16d5e5e309887ee883). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14858][SQL] Enable subquery pushdown
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/12720#discussion_r61375441 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala --- @@ -75,76 +77,63 @@ case class ScalarSubquery( override def foldable: Boolean = false override def nullable: Boolean = true - override def withNewPlan(plan: LogicalPlan): ScalarSubquery = ScalarSubquery(plan, exprId) + override def conditions: Seq[Expression] = conditionOption.toSeq.flatten - override def toString: String = s"subquery#${exprId.id}" + override def withNewPlan(plan: LogicalPlan): ScalarSubquery = copy(query = plan) + + override def toString: String = s"subquery#${exprId.id} $conditionString" } /** * A predicate subquery checks the existence of a value in a sub-query. We currently only allow * [[PredicateSubquery]] expressions within a Filter plan (i.e. WHERE or a HAVING clause). This will * be rewritten into a left semi/anti join during analysis. */ -abstract class PredicateSubquery extends SubqueryExpression with Unevaluable with Predicate { +case class PredicateSubquery( +query: LogicalPlan, +override val children: Seq[Expression] = Seq.empty, +nullAware: Boolean = false, +exprId: ExprId = NamedExpression.newExprId) + extends SubqueryExpression with Predicate with Unevaluable { + override lazy val resolved = childrenResolved && query.resolved + override lazy val references: AttributeSet = super.references -- query.outputSet override def nullable: Boolean = false + override def conditions: Seq[Expression] = children + override def plan: LogicalPlan = SubqueryAlias(toString, query) + override def withNewPlan(plan: LogicalPlan): PredicateSubquery = copy(query = plan) + override def toString: String = s"predicate-subquery#${exprId.id} $conditionString" } object PredicateSubquery { def hasPredicateSubquery(e: Expression): Boolean = { -e.find(_.isInstanceOf[PredicateSubquery]).isDefined +e.find { + case _: PredicateSubquery | _: ListQuery | _: Exists => true + case _ => false +}.isDefined } } /** - * The [[InSubQuery]] predicate checks the existence of a value in a sub-query. For example (SQL): + * A [[ListQuery]] expression defines the query which we want to search in an IN subquery + * expression. It should and can only be used in conjunction with a IN expression. + * + * For example (SQL): * {{{ * SELECT * * FROMa * WHERE a.id IN (SELECT id *FROMb) * }}} */ -case class InSubQuery( -value: Expression, -query: LogicalPlan, -exprId: ExprId = NamedExpression.newExprId) extends PredicateSubquery { - override def children: Seq[Expression] = value :: Nil - override lazy val resolved: Boolean = value.resolved && query.resolved - override def withNewPlan(plan: LogicalPlan): InSubQuery = InSubQuery(value, plan, exprId) - override def plan: LogicalPlan = SubqueryAlias(s"subquery#${exprId.id}", query) - - /** - * The unwrapped value side expressions. - */ - lazy val expressions: Seq[Expression] = value match { -case CreateStruct(cols) => cols -case col => Seq(col) - } - - /** - * Check if the number of columns and the data types on both sides match. - */ - override def checkInputDataTypes(): TypeCheckResult = { -// Check the number of arguments. -if (expressions.length != query.output.length) { - return TypeCheckResult.TypeCheckFailure( -s"The number of fields in the value (${expressions.length}) does not match with " + - s"the number of columns in the subquery (${query.output.length})") -} - -// Check the argument types. -expressions.zip(query.output).zipWithIndex.foreach { - case ((e, a), i) if e.dataType != a.dataType => -return TypeCheckResult.TypeCheckFailure( - s"The data type of value[$i] (${e.dataType}) does not match " + -s"subquery column '${a.name}' (${a.dataType}).") - case _ => -} - -TypeCheckResult.TypeCheckSuccess - } - - override def toString: String = s"$value IN subquery#${exprId.id}" +case class ListQuery(query: LogicalPlan, exprId: ExprId = NamedExpression.newExprId) + extends SubqueryExpression with Unevaluable { + override lazy val resolved = false + override def dataType: DataType = ArrayType(NullType) --- End diff -- So `ListQuery` is converted into `PredicateSubquery` in the analyzer. `PredicateSubquery` exposes its 'join' expressions as it
[GitHub] spark pull request: [SPARK-14783] [SPARK-14786] [BRANCH-1.6] Prese...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12724#issuecomment-215314235 **[Test build #57221 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57221/consoleFull)** for PR 12724 at commit [`49b7b52`](https://github.com/apache/spark/commit/49b7b528a8928335e16bf6081f7dad1e819c77e2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14850][ML] specialize array data for Ve...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/12640#discussion_r61375365 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala --- @@ -29,6 +29,82 @@ abstract class ArrayData extends SpecializedGetters with Serializable { def array: Array[Any] + override def equals(o: Any): Boolean = { +if (!o.isInstanceOf[ArrayData]) { + return false +} + +val other = o.asInstanceOf[ArrayData] +if (other eq null) { + return false +} + +val len = numElements() +if (len != other.numElements()) { + return false +} + +var i = 0 +while (i < len) { + if (isNullAt(i) != other.isNullAt(i)) { +return false + } + if (!isNullAt(i)) { +val o1 = array(i) +val o2 = other.array(i) +o1 match { + case b1: Array[Byte] => +if (!o2.isInstanceOf[Array[Byte]] || + !java.util.Arrays.equals(b1, o2.asInstanceOf[Array[Byte]])) { + return false +} + case f1: Float if java.lang.Float.isNaN(f1) => +if (!o2.isInstanceOf[Float] || ! java.lang.Float.isNaN(o2.asInstanceOf[Float])) { + return false +} + case d1: Double if java.lang.Double.isNaN(d1) => +if (!o2.isInstanceOf[Double] || ! java.lang.Double.isNaN(o2.asInstanceOf[Double])) { + return false +} + case _ => if (o1 != o2) { +return false + } +} + } + i += 1 +} +true + } + + override def hashCode: Int = { +var result: Int = 37 +var i = 0 +val len = numElements() +while (i < len) { --- End diff -- This could be very expensive for large arrays because it scans all elements, which is unnecessary to generate the hashCode. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14858][SQL] Enable subquery pushdown
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12720#issuecomment-215314230 **[Test build #2899 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2899/consoleFull)** for PR 12720 at commit [`62c5c2f`](https://github.com/apache/spark/commit/62c5c2f6c628593d860a42baa35c7f6d3cdd9305). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14850][ML] specialize array data for Ve...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/12640#issuecomment-215314086 @cloud-fan This is still much slower than 1.4 and adding more subclasses of ArrayData may prevent JIT inline methods like `getInt` and `getDouble`. Is it easy to convert to `UnsafeArrayData` directly with memory copy? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14783] [SPARK-14786] [BRANCH-1.6] Prese...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/12724#issuecomment-215314116 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12660] [SPARK-14967] [SQL] Implement Ex...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/12736#discussion_r61375198 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -398,6 +398,66 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { Row(4, "d") :: Nil) checkAnswer(lowerCaseData.except(lowerCaseData), Nil) checkAnswer(upperCaseData.except(upperCaseData), Nil) + +// check null equality +checkAnswer( + nullInts.except(nullInts.filter("0 = 1")), + nullInts) +checkAnswer( + nullInts.except(nullInts), + Nil) + +// check if values are de-duplicated +checkAnswer( + allNulls.except(allNulls.filter("0 = 1")), + Row(null) :: Nil) +checkAnswer( + allNulls.except(allNulls), + Nil) + +// check if values are de-duplicated +val df = Seq(("id1", 1), ("id1", 1), ("id", 1), ("id1", 2)).toDF("id", "value") +checkAnswer( + df.except(df.filter("0 = 1")), + Row("id1", 1) :: + Row("id", 1) :: + Row("id1", 2) :: Nil) + +// check if the empty set on the left side works +checkAnswer( + allNulls.filter("0 = 1").except(allNulls), + Nil) + } + + test("except distinct - SQL compliance") { +val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id") +val df_right = Seq(1, 3).toDF("id") + +checkAnswer( + df_left.except(df_right), + Row(2) :: Row(4) :: Nil +) + } + + test("except - nullability") { +val nonNullableInts = Seq(Tuple1(11), Tuple1(3)).toDF() +assert(nonNullableInts.schema.forall(_.nullable == false)) --- End diff -- nit: `forall(!_.nullable)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12660] [SPARK-14967] [SQL] Implement Ex...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/12736#discussion_r61375067 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercionSuite.scala --- @@ -488,14 +488,6 @@ class HiveTypeCoercionSuite extends PlanTest { assert(r1.right.isInstanceOf[Project]) assert(r2.left.isInstanceOf[Project]) assert(r2.right.isInstanceOf[Project]) - -val r3 = wt(Except(firstTable, firstTable)).asInstanceOf[Except] -checkOutput(r3.left, Seq(IntegerType, DecimalType.SYSTEM_DEFAULT, ByteType, DoubleType)) -checkOutput(r3.right, Seq(IntegerType, DecimalType.SYSTEM_DEFAULT, ByteType, DoubleType)) - -// Check if no Project is added -assert(r3.left.isInstanceOf[LocalRelation]) -assert(r3.right.isInstanceOf[LocalRelation]) --- End diff -- why remove these? We didn't change the analysis of `Except` right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14654][CORE] New accumulator API
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/12612#discussion_r61374828 --- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala --- @@ -175,124 +172,143 @@ class TaskMetrics private[spark] () extends Serializable { } // Only used for test - private[spark] val testAccum = -sys.props.get("spark.testing").map(_ => TaskMetrics.createLongAccum(TEST_ACCUM)) - - @transient private[spark] lazy val internalAccums: Seq[Accumulable[_, _]] = { -val in = inputMetrics -val out = outputMetrics -val sr = shuffleReadMetrics -val sw = shuffleWriteMetrics -Seq(_executorDeserializeTime, _executorRunTime, _resultSize, _jvmGCTime, - _resultSerializationTime, _memoryBytesSpilled, _diskBytesSpilled, _peakExecutionMemory, - _updatedBlockStatuses, sr._remoteBlocksFetched, sr._localBlocksFetched, sr._remoteBytesRead, - sr._localBytesRead, sr._fetchWaitTime, sr._recordsRead, sw._bytesWritten, sw._recordsWritten, - sw._writeTime, in._bytesRead, in._recordsRead, out._bytesWritten, out._recordsWritten) ++ - testAccum - } + private[spark] val testAccum = sys.props.get("spark.testing").map(_ => new LongAccumulator) + + + import InternalAccumulator._ + @transient private[spark] lazy val nameToAccums = LinkedHashMap( +EXECUTOR_DESERIALIZE_TIME -> _executorDeserializeTime, +EXECUTOR_RUN_TIME -> _executorRunTime, +RESULT_SIZE -> _resultSize, +JVM_GC_TIME -> _jvmGCTime, +RESULT_SERIALIZATION_TIME -> _resultSerializationTime, +MEMORY_BYTES_SPILLED -> _memoryBytesSpilled, +DISK_BYTES_SPILLED -> _diskBytesSpilled, +PEAK_EXECUTION_MEMORY -> _peakExecutionMemory, +UPDATED_BLOCK_STATUSES -> _updatedBlockStatuses, +shuffleRead.REMOTE_BLOCKS_FETCHED -> shuffleReadMetrics._remoteBlocksFetched, +shuffleRead.LOCAL_BLOCKS_FETCHED -> shuffleReadMetrics._localBlocksFetched, +shuffleRead.REMOTE_BYTES_READ -> shuffleReadMetrics._remoteBytesRead, +shuffleRead.LOCAL_BYTES_READ -> shuffleReadMetrics._localBytesRead, +shuffleRead.FETCH_WAIT_TIME -> shuffleReadMetrics._fetchWaitTime, +shuffleRead.RECORDS_READ -> shuffleReadMetrics._recordsRead, +shuffleWrite.BYTES_WRITTEN -> shuffleWriteMetrics._bytesWritten, +shuffleWrite.RECORDS_WRITTEN -> shuffleWriteMetrics._recordsWritten, +shuffleWrite.WRITE_TIME -> shuffleWriteMetrics._writeTime, +input.BYTES_READ -> inputMetrics._bytesRead, +input.RECORDS_READ -> inputMetrics._recordsRead, +output.BYTES_WRITTEN -> outputMetrics._bytesWritten, +output.RECORDS_WRITTEN -> outputMetrics._recordsWritten + ) ++ testAccum.map(TEST_ACCUM -> _) + + @transient private[spark] lazy val internalAccums: Seq[NewAccumulator[_, _]] = +nameToAccums.values.toIndexedSeq /* == * |OTHER THINGS| * == */ - private[spark] def registerForCleanup(sc: SparkContext): Unit = { -internalAccums.foreach { accum => - sc.cleaner.foreach(_.registerAccumulatorForCleanup(accum)) + private[spark] def register(sc: SparkContext): Unit = { +nameToAccums.foreach { + case (name, acc) => acc.register(sc, name = Some(name), countFailedValues = true) } } /** * External accumulators registered with this task. */ - @transient private lazy val externalAccums = new ArrayBuffer[Accumulable[_, _]] + @transient private lazy val externalAccums = new ArrayBuffer[NewAccumulator[_, _]] - private[spark] def registerAccumulator(a: Accumulable[_, _]): Unit = { + private[spark] def registerAccumulator(a: NewAccumulator[_, _]): Unit = { externalAccums += a } - /** - * Return the latest updates of accumulators in this task. - * - * The [[AccumulableInfo.update]] field is always defined and the [[AccumulableInfo.value]] - * field is always empty, since this represents the partial updates recorded in this task, - * not the aggregated value across multiple tasks. - */ - def accumulatorUpdates(): Seq[AccumulableInfo] = { -(internalAccums ++ externalAccums).map { a => a.toInfo(Some(a.localValue), None) } - } + private[spark] def accumulators(): Seq[NewAccumulator[_, _]] = internalAccums ++ externalAccums } -/** - * Internal subclass of [[TaskMetrics]] which is used only for posting events to listeners. - * Its purpose is to obviate the need for the driver to reconstruct the original accumulators, - * which might have been garbage-collected. See SPARK-13407 for more details. - * - * Instances of this class
[GitHub] spark pull request: [SPARK-12660] [SPARK-14967] [SQL] Implement Ex...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/12736#discussion_r61374712 --- Diff: sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java --- @@ -291,7 +291,7 @@ public void testSetOperation() { unioned.collectAsList()); Dataset subtracted = ds.except(ds2); -Assert.assertEquals(Arrays.asList("abc", "abc"), subtracted.collectAsList()); +Assert.assertEquals(Arrays.asList("abc"), subtracted.collectAsList()); --- End diff -- Yeah. After this PR, the current behavior of `EXCEPT` is the same as the standard `EXCEPT DISTINCT`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14654][CORE] New accumulator API
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/12612#discussion_r61374697 --- Diff: core/src/main/scala/org/apache/spark/NewAccumulator.scala --- @@ -0,0 +1,391 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import java.{lang => jl} +import java.io.ObjectInputStream +import java.util.concurrent.atomic.AtomicLong +import javax.annotation.concurrent.GuardedBy + +import org.apache.spark.scheduler.AccumulableInfo +import org.apache.spark.util.Utils + + +private[spark] case class AccumulatorMetadata( +id: Long, +name: Option[String], +countFailedValues: Boolean) extends Serializable + + +/** + * The base class for accumulators, that can accumulate inputs of type `IN`, and produce output of + * type `OUT`. + */ +abstract class NewAccumulator[IN, OUT] extends Serializable { + private[spark] var metadata: AccumulatorMetadata = _ + private[this] var atDriverSide = true + + private[spark] def register( + sc: SparkContext, + name: Option[String] = None, + countFailedValues: Boolean = false): Unit = { +if (this.metadata != null) { + throw new IllegalStateException("Cannot register an Accumulator twice.") +} +this.metadata = AccumulatorMetadata(AccumulatorContext.newId(), name, countFailedValues) +AccumulatorContext.register(this) +sc.cleaner.foreach(_.registerAccumulatorForCleanup(this)) + } + + /** + * Returns true if this accumulator has been registered. Note that all accumulators must be + * registered before ues, or it will throw exception. + */ + final def isRegistered: Boolean = +metadata != null && AccumulatorContext.originals.containsKey(metadata.id) + + private def assertMetadataNotNull(): Unit = { +if (metadata == null) { + throw new IllegalAccessError("The metadata of this accumulator has not been assigned yet.") +} + } + + /** + * Returns the id of this accumulator, can only be called after registration. + */ + final def id: Long = { +assertMetadataNotNull() +metadata.id + } + + /** + * Returns the name of this accumulator, can only be called after registration. + */ + final def name: Option[String] = { +assertMetadataNotNull() +metadata.name + } + + /** + * Whether to accumulate values from failed tasks. This is set to true for system and time + * metrics like serialization time or bytes spilled, and false for things with absolute values + * like number of input rows. This should be used for internal metrics only. + */ + private[spark] final def countFailedValues: Boolean = { +assertMetadataNotNull() +metadata.countFailedValues + } + + /** + * Creates an [[AccumulableInfo]] representation of this [[NewAccumulator]] with the provided + * values. + */ + private[spark] def toInfo(update: Option[Any], value: Option[Any]): AccumulableInfo = { +val isInternal = name.exists(_.startsWith(InternalAccumulator.METRICS_PREFIX)) +new AccumulableInfo(id, name, update, value, isInternal, countFailedValues) + } + + final private[spark] def isAtDriverSide: Boolean = atDriverSide + + /** + * Tells if this accumulator is zero value or not. e.g. for a counter accumulator, 0 is zero + * value; for a list accumulator, Nil is zero value. + */ + def isZero(): Boolean + + /** + * Creates a new copy of this accumulator, which is zero value. i.e. call `isZero` on the copy + * must return true. + */ + def copyAndReset(): NewAccumulator[IN, OUT] + + /** + * Takes the inputs and accumulates. e.g. it can be a simple `+=` for counter accumulator. + */ + def add(v: IN): Unit + + /** + * Merges another
[GitHub] spark pull request: [SPARK-14654][CORE] New accumulator API
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12612#issuecomment-215312526 **[Test build #57220 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57220/consoleFull)** for PR 12612 at commit [`124568b`](https://github.com/apache/spark/commit/124568b3eeb7e0a657b2fbe4f54bb85543b7ffa3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14487][SQL] User Defined Type registrat...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12259#issuecomment-215311747 **[Test build #57219 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57219/consoleFull)** for PR 12259 at commit [`1c230ae`](https://github.com/apache/spark/commit/1c230ae23e17905af9ea9655cbcfb5a948e627a9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12235][SPARKR] Enhance mutate() to supp...
Github user sun-rui commented on a diff in the pull request: https://github.com/apache/spark/pull/10220#discussion_r61374122 --- Diff: R/pkg/R/DataFrame.R --- @@ -1451,17 +1451,54 @@ setMethod("mutate", function(.data, ...) { x <- .data cols <- list(...) -stopifnot(length(cols) > 0) -stopifnot(class(cols[[1]]) == "Column") +if (length(cols) <= 0) { + return(x) +} + +lapply(cols, function(col) { + stopifnot(class(col) == "Column") +}) + +# Check if there is any duplicated column name in the DataFrame +dfCols <- columns(x) +if (length(unique(dfCols)) != length(dfCols)) { + stop("Error: found duplicated column name in the DataFrame") +} + +# TODO: simplify the implementation of this method after SPARK-12225 is resolved. + +# For named arguments, use the names for arguments as the column names +# For unnamed arguments, use the argument symbols as the column names +args <- sapply(substitute(list(...))[-1], deparse) --- End diff -- I did use cols. But the result is not correct. I have to use list(...) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14487][SQL] User Defined Type registrat...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/12259#issuecomment-215310697 LGTM pending Jenkins --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14487][SQL] User Defined Type registrat...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/12259#discussion_r61373781 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -117,6 +117,7 @@ class SQLContext private[sql]( * * @since 1.6.0 */ + --- End diff -- minor: remove empty line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12683#issuecomment-215310647 **[Test build #57218 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57218/consoleFull)** for PR 12683 at commit [`6650890`](https://github.com/apache/spark/commit/665089051aef4dd4ac189eed329ee55d3e8df9e3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14346][SQL] Add PARTITIONED BY and CLUS...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12734#issuecomment-215310529 @jodersky Oh sorry, pasted the JIRA ticket summary to the PR title but forgot to add the tags. Updated! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14729][Scheduler] Refactored YARN sched...
Github user hbhanawat commented on the pull request: https://github.com/apache/spark/pull/12641#issuecomment-215310532 Hmm. @vanzin I think you have a point. There are few things that can be done but not sure if they will simplify without reducing the flexibility. I will think more on it and get back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14487][SQL] User Defined Type registrat...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/12259#discussion_r61373671 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/MLlibTestSparkContext.scala --- @@ -24,14 +24,18 @@ import org.scalatest.Suite import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.ml.util.TempDirectory -import org.apache.spark.sql.SQLContext +import org.apache.spark.sql.{SQLContext, SQLImplicits} import org.apache.spark.util.Utils trait MLlibTestSparkContext extends TempDirectory { self: Suite => @transient var sc: SparkContext = _ @transient var sqlContext: SQLContext = _ @transient var checkpointDir: String = _ + protected object testImplicits extends SQLImplicits { --- End diff -- Removed it now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12750#issuecomment-215310239 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57212/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61373605 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the approximate median value of the feature. + * Default: mean + * + * @group param + */ + final val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + final val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, FloatType)) +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +SchemaUtils.appendColumn(schema, $(outputCol), inputType) + } +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean") or the + * median("median") of the column in which the missing values are located. + * + * Note that all the null values will be imputed as well. --- End diff -- Yes. That's better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61373554 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the approximate median value of the feature. + * Default: mean + * + * @group param + */ + final val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + final val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, FloatType)) +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +SchemaUtils.appendColumn(schema, $(outputCol), inputType) + } +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean") or the + * median("median") of the column in which the missing values are located. --- End diff -- @sethah, I tried yet I'm afraid it draws more confusion than the help. After all, the current behavior is not out of expectation and works for most ( if not all) users. I'd prefer to skip the detailed explanation in API document. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12750#issuecomment-215310152 **[Test build #57212 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57212/consoleFull)** for PR 12750 at commit [`4bbf429`](https://github.com/apache/spark/commit/4bbf4292802e475d84ec55994a4ebae3ddc2f4da). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12750#issuecomment-215310236 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61373571 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the approximate median value of the feature. + * Default: mean + * + * @group param + */ + final val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + final val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, FloatType)) +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +SchemaUtils.appendColumn(schema, $(outputCol), inputType) + } +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean") or the + * median("median") of the column in which the missing values are located. + * --- End diff -- Cool. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14487][SQL] User Defined Type registrat...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12259#issuecomment-215310117 **[Test build #57217 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57217/consoleFull)** for PR 12259 at commit [`9ed0f30`](https://github.com/apache/spark/commit/9ed0f304a81a30a4737ca5864fbe4960e7797145). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14706][ML][PySpark] Python ML persisten...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12604#issuecomment-215309644 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57216/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14706][ML][PySpark] Python ML persisten...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12604#issuecomment-215309599 **[Test build #57216 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57216/consoleFull)** for PR 12604 at commit [`fa8a05c`](https://github.com/apache/spark/commit/fa8a05c6acd955670af64637caa9c1f38c5e9f09). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14706][ML][PySpark] Python ML persisten...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12604#issuecomment-215309643 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-215308986 Look good overall, I have my last inline comment. After that, it should be ready to go. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12660] [SPARK-14967] [SQL] Implement Ex...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/12736#discussion_r61372903 --- Diff: sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java --- @@ -291,7 +291,7 @@ public void testSetOperation() { unioned.collectAsList()); Dataset subtracted = ds.except(ds2); -Assert.assertEquals(Arrays.asList("abc", "abc"), subtracted.collectAsList()); +Assert.assertEquals(Arrays.asList("abc"), subtracted.collectAsList()); --- End diff -- Did this PR also fixed the semantic of `Except`, or it only added the optimization? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14706][ML][PySpark] Python ML persisten...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12604#issuecomment-215308604 **[Test build #57216 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57216/consoleFull)** for PR 12604 at commit [`fa8a05c`](https://github.com/apache/spark/commit/fa8a05c6acd955670af64637caa9c1f38c5e9f09). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12750#issuecomment-215308136 **[Test build #57215 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57215/consoleFull)** for PR 12750 at commit [`5d34a64`](https://github.com/apache/spark/commit/5d34a646046727ffaf8f5932b2dfeae3a5c10d32). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/12683#issuecomment-215308071 @GayathriMurali You should modify [```RWrappers.load```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/RWrappers.scala#L44) at Scala side to make it support loading GLM wrapper. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/12467#discussion_r61372578 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -290,4 +291,18 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul val newModel = testDefaultReadWrite(model) checkModelData(model, newModel) } + + test("should support all NumericType labels") { --- End diff -- @BenFradet Sorry for late response. FYI: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/RWrappers.scala#L44 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14487][SQL] User Defined Type registrat...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12259#issuecomment-215307961 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14487][SQL] User Defined Type registrat...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12259#issuecomment-215307962 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57207/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14487][SQL] User Defined Type registrat...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12259#issuecomment-215307877 **[Test build #57207 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57207/consoleFull)** for PR 12259 at commit [`45e87d9`](https://github.com/apache/spark/commit/45e87d9a6d47350b97ce25758f58409b1b56623b). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class MatrixUDTSuite extends SparkFunSuite ` * `class VectorUDTSuite extends SparkFunSuite ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12683#issuecomment-215307044 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12683#issuecomment-215307046 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57211/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12683#issuecomment-215307026 **[Test build #57211 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57211/consoleFull)** for PR 12683 at commit [`55523f7`](https://github.com/apache/spark/commit/55523f7713615292c427c4000cee36b86c4fe7a2). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14961] Build HashedRelation larger than...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/12740 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14961] Build HashedRelation larger than...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/12740#issuecomment-215306302 When profiling the performance for BytesToBytesMap, it does have difference. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14961] Build HashedRelation larger than...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/12740#issuecomment-215306341 Merging this into master, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10780][ML][WIP] Add initial model to km...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9#issuecomment-215305404 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57210/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10780][ML][WIP] Add initial model to km...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9#issuecomment-215305403 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14654][CORE] New accumulator API
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/12612#discussion_r61371630 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SortBasedAggregateExec.scala --- @@ -46,7 +46,7 @@ case class SortBasedAggregateExec( AttributeSet(aggregateBufferAttributes) override private[sql] lazy val metrics = Map( -"numOutputRows" -> SQLMetrics.createLongMetric(sparkContext, "number of output rows")) +"numOutputRows" -> SQLMetrics.createSumMetric(sparkContext, "number of output rows")) --- End diff -- maybe just createMetric? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10780][ML][WIP] Add initial model to km...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9#issuecomment-215305352 **[Test build #57210 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57210/consoleFull)** for PR 9 at commit [`c40192b`](https://github.com/apache/spark/commit/c40192b0579080f4af572cf6d12bf37942c03866). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14654][CORE] New accumulator API
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/12612#discussion_r61371318 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SortBasedAggregateExec.scala --- @@ -46,7 +46,7 @@ case class SortBasedAggregateExec( AttributeSet(aggregateBufferAttributes) override private[sql] lazy val metrics = Map( -"numOutputRows" -> SQLMetrics.createLongMetric(sparkContext, "number of output rows")) +"numOutputRows" -> SQLMetrics.createSumMetric(sparkContext, "number of output rows")) --- End diff -- `SQLMetric` are all long accumulator, so I chose the name `sum` to describe the behaviour, not type... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14935][CORE] DistributedSuite "local-cl...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12744#issuecomment-215304506 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14935][CORE] DistributedSuite "local-cl...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12744#issuecomment-215304507 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57202/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14935][CORE] DistributedSuite "local-cl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12744#issuecomment-215304423 **[Test build #57202 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57202/consoleFull)** for PR 12744 at commit [`0f14ad2`](https://github.com/apache/spark/commit/0f14ad2698bb9b2fbf582c44e3bb8f796b873750). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14654][CORE] New accumulator API
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/12612#issuecomment-215304306 This looks pretty good to me. We should get it to pass tests and then merge it asap. Some of the comments can be addressed later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14783] [SPARK-14786] [BRANCH-1.6] Prese...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12724#issuecomment-215304302 **[Test build #57214 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57214/consoleFull)** for PR 12724 at commit [`49b7b52`](https://github.com/apache/spark/commit/49b7b528a8928335e16bf6081f7dad1e819c77e2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14654][CORE] New accumulator API
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/12612#discussion_r61371102 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/SortBasedAggregateExec.scala --- @@ -46,7 +46,7 @@ case class SortBasedAggregateExec( AttributeSet(aggregateBufferAttributes) override private[sql] lazy val metrics = Map( -"numOutputRows" -> SQLMetrics.createLongMetric(sparkContext, "number of output rows")) +"numOutputRows" -> SQLMetrics.createSumMetric(sparkContext, "number of output rows")) --- End diff -- hm I think createLongMetric makes more sense than createSumMetric here ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14783] [SPARK-14786] [BRANCH-1.6] Prese...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/12724#issuecomment-215303991 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14654][CORE] New accumulator API
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/12612#discussion_r61371062 --- Diff: project/MimaExcludes.scala --- @@ -674,6 +674,19 @@ object MimaExcludes { ) ++ Seq( // [SPARK-4452][Core]Shuffle data structures can starve others on the same thread for memory ProblemFilters.exclude[IncompatibleTemplateDefProblem]("org.apache.spark.util.collection.Spillable") + ) ++ Seq( +// SPARK-14654: New accumulator API + ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.Accumulable.zero"), --- End diff -- do we have to remove this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14970][SQL] Prevent DataSource from enu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12748#issuecomment-215303740 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14346][SQL] Show Create Table (Native)
Github user xwu0226 commented on the pull request: https://github.com/apache/spark/pull/12579#issuecomment-215303758 @yhuai @liancheng , I see PR [#12734](https://github.com/apache/spark/pull/12734) takes care of the PARTITIONED BY and CLUSTERED BY (with SORTED BY) clause for CTAS syntax, but not for non-CTAS syntax. Now I need to change my PR to adapt to this change, which means that the generated DDL will be something like `create table t1 (c1 int, ...) using .. options (..) partitioned by (..) clustered by (...) sorted by (...) in ... buckets`. But there won't be a "select clause" following it since we do not have the original query. But such generated query will not run because [#12734](https://github.com/apache/spark/pull/12734) does not support it. Can we add a fake select clause with a warning message? Also DataFrameWriter.saveAsTable case is like CTAS. Can we then generate the DDL as a regular CTAS syntax? This will change my current implementation in this PR. Please advice, thanks a lot! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14970][SQL] Prevent DataSource from enu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12748#issuecomment-215303742 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57205/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14970][SQL] Prevent DataSource from enu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12748#issuecomment-215303655 **[Test build #57205 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57205/consoleFull)** for PR 12748 at commit [`4e4e8db`](https://github.com/apache/spark/commit/4e4e8dba8209513d105f0e195ca0b06a3eb6c70e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10001][Core] Don't short-circuit action...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12745#issuecomment-215303593 **[Test build #57213 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57213/consoleFull)** for PR 12745 at commit [`06f83cc`](https://github.com/apache/spark/commit/06f83cc1de78aaf56942404fb24aca81d1b66d2e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12919][SPARKR] Implement dapply() on Da...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12493#issuecomment-215303408 **[Test build #57201 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57201/consoleFull)** for PR 12493 at commit [`2264b57`](https://github.com/apache/spark/commit/2264b57a2d5f375eae6520b492a2152be259ccaa). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12919][SPARKR] Implement dapply() on Da...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12493#issuecomment-215303502 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12919][SPARKR] Implement dapply() on Da...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12493#issuecomment-215303503 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57201/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10780][ML][WIP] Add initial model to km...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9#issuecomment-215303378 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10780][ML][WIP] Add initial model to km...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/9#issuecomment-215303382 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57209/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10780][ML][WIP] Add initial model to km...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/9#issuecomment-215303315 **[Test build #57209 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57209/consoleFull)** for PR 9 at commit [`914d319`](https://github.com/apache/spark/commit/914d31991e10dd0d77f03e167f19147ff696834c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12750#issuecomment-215303075 **[Test build #57212 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57212/consoleFull)** for PR 12750 at commit [`4bbf429`](https://github.com/apache/spark/commit/4bbf4292802e475d84ec55994a4ebae3ddc2f4da). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14654][CORE] New accumulator API
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/12612#discussion_r61370631 --- Diff: core/src/main/scala/org/apache/spark/NewAccumulator.scala --- @@ -0,0 +1,356 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import java.{lang => jl} +import java.io.ObjectInputStream +import java.util.concurrent.atomic.AtomicLong +import javax.annotation.concurrent.GuardedBy + +import org.apache.spark.scheduler.AccumulableInfo +import org.apache.spark.util.Utils + + +private[spark] case class AccumulatorMetadata( +id: Long, +name: Option[String], +countFailedValues: Boolean) extends Serializable + + +/** + * The base class for accumulators, that can accumulate inputs of type `IN`, and produce output of + * type `OUT`. Implementations must define following methods: + * - isZero: tell if this accumulator is zero value or not. e.g. for a counter accumulator, + * 0 is zero value; for a list accumulator, Nil is zero value. + * - copyAndReset: create a new copy of this accumulator, which is zero value. i.e. call `isZero` + * on the copy must return true. + * - add: defines how to accumulate the inputs. e.g. it can be a simple `+=` for counter + * accumulator + * - merge:defines how to merge another accumulator of same type. + * - localValue: defines how to produce the output by the current state of this accumulator. + * + * The implementations decide how to store intermediate values, e.g. a long field for a counter + * accumulator, a double and a long field for a average accumulator(storing the sum and count). + */ +abstract class NewAccumulator[IN, OUT] extends Serializable { + private[spark] var metadata: AccumulatorMetadata = _ + private[this] var atDriverSide = true + + private[spark] def register( + sc: SparkContext, + name: Option[String] = None, + countFailedValues: Boolean = false): Unit = { +if (this.metadata != null) { + throw new IllegalStateException("Cannot register an Accumulator twice.") +} +this.metadata = AccumulatorMetadata(AccumulatorContext.newId(), name, countFailedValues) +AccumulatorContext.register(this) +sc.cleaner.foreach(_.registerAccumulatorForCleanup(this)) + } + + final def isRegistered: Boolean = +metadata != null && AccumulatorContext.originals.containsKey(metadata.id) + + private def assertMetadataNotNull(): Unit = { +if (metadata == null) { + throw new IllegalAccessError("The metadata of this accumulator has not been assigned yet.") +} + } + + final def id: Long = { +assertMetadataNotNull() +metadata.id + } + + final def name: Option[String] = { +assertMetadataNotNull() +metadata.name + } + + final def countFailedValues: Boolean = { +assertMetadataNotNull() +metadata.countFailedValues + } + + private[spark] def toInfo(update: Option[Any], value: Option[Any]): AccumulableInfo = { +val isInternal = name.exists(_.startsWith(InternalAccumulator.METRICS_PREFIX)) +new AccumulableInfo(id, name, update, value, isInternal, countFailedValues) + } + + final private[spark] def isAtDriverSide: Boolean = atDriverSide + + def isZero(): Boolean + + def copyAndReset(): NewAccumulator[IN, OUT] + + def add(v: IN): Unit + + def +=(v: IN): Unit = add(v) + + def merge(other: NewAccumulator[IN, OUT]): Unit --- End diff -- need to document that this merges in place --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at
[GitHub] spark pull request: [SPARK-10001][Core] Don't short-circuit action...
Github user jodersky commented on a diff in the pull request: https://github.com/apache/spark/pull/12745#discussion_r61370492 --- Diff: core/src/main/scala/org/apache/spark/util/SignalUtils.scala --- @@ -94,7 +94,7 @@ private[spark] object SignalUtils extends Logging { // run all actions, escalate to parent handler if no action catches the signal // (i.e. all actions return false) - val escalate = actions.asScala.forall { action => !action() } + val escalate = actions.asScala.map(action => action()).forall(_ == false) --- End diff -- good idea --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...
GitHub user JoshRosen opened a pull request: https://github.com/apache/spark/pull/12750 [SPARK-14972] Improve performance of JSON schema inference's compatibleType method This patch improves the performance of `InferSchema.compatibleType` and `inferField`. The net result of this patch is a 2x-4x speedup in local benchmarks running against cached data with a massive nested schema. The key idea is to remove unnecessary sorting in `compatibleType`'s `StructType` merging code. This code takes two structs, merges the fields with matching names, and copies over the unique fields, producing a new schema which is the union of the two structs' schemas. Previously, this code performed a very inefficient `groupBy()` to match up fields with the same name, but this is unnecessary because `inferField` already sorts structs' fields by name: since both lists of fields are sorted, we can simply merge them in a single pass. This patch also speeds up the existing field sorting in `inferField`: the old sorting code allocated unnecessary intermediate collections, while the new code uses mutable collects and performs in-place sorting. Finally, I replaced a `treeAggregate` call with `fold`: I doubt that `treeAggregate` will benefit us very much because the schemas would have to be enormous to realize large savings in network traffic. Since most schemas are probably fairly small in serialized form, they should typically fit within a direct task result and therefore can be incrementally merged at the driver as individual tasks finish. This change eliminates an entire (short) scheduler stage. You can merge this pull request into a Git repository by running: $ git pull https://github.com/JoshRosen/spark schema-inference-speedups Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12750.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12750 commit 5406092f5c16189f1b6c46c1ed324f13dadd57b1 Author: Josh RosenDate: 2016-04-28T02:12:29Z Stop using groupByKey to merge struct schemas. commit 5319a6abfb37cb43c500385b21e9d905e831d24d Author: Josh Rosen Date: 2016-04-28T02:38:21Z Take advantage of fact that structs' fields are already sorted by name. commit d9793f71cf5d9e2ff5bf7d62b561e9d6e377d1a1 Author: Josh Rosen Date: 2016-04-28T03:01:47Z Perform in-place sort without additional allocation. commit 4bbf4292802e475d84ec55994a4ebae3ddc2f4da Author: Josh Rosen Date: 2016-04-28T03:04:27Z Replace treeAggregate with fold. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14783] [SPARK-14786] [BRANCH-1.6] Prese...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12724#issuecomment-215302424 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14783] [SPARK-14786] [BRANCH-1.6] Prese...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12724#issuecomment-215302427 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57193/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14783] [SPARK-14786] [BRANCH-1.6] Prese...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12724#issuecomment-215302234 **[Test build #57193 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57193/consoleFull)** for PR 12724 at commit [`49b7b52`](https://github.com/apache/spark/commit/49b7b528a8928335e16bf6081f7dad1e819c77e2). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14315][SparkR]Add model persistence to ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12683#issuecomment-215302215 **[Test build #57211 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57211/consoleFull)** for PR 12683 at commit [`55523f7`](https://github.com/apache/spark/commit/55523f7713615292c427c4000cee36b86c4fe7a2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10001][Core] Don't short-circuit action...
Github user jodersky commented on the pull request: https://github.com/apache/spark/pull/12745#issuecomment-215302197 removed the label, sorry about that --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10001][Core][Hotfix] Don't short-circui...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12745#issuecomment-215302102 **[Test build #2898 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2898/consoleFull)** for PR 12745 at commit [`7fe0e54`](https://github.com/apache/spark/commit/7fe0e54c5e39ff942fd58a16d5e5e309887ee883). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10001][Core][Hotfix] Don't short-circui...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/12745#discussion_r61370370 --- Diff: core/src/main/scala/org/apache/spark/util/SignalUtils.scala --- @@ -94,7 +94,7 @@ private[spark] object SignalUtils extends Logging { // run all actions, escalate to parent handler if no action catches the signal // (i.e. all actions return false) - val escalate = actions.asScala.forall { action => !action() } + val escalate = actions.asScala.map(action => action()).forall(_ == false) --- End diff -- we should probably leave a comment here saying why we are doing map and then forall, to make sure nobody would come in and "optimize" the map.forall into just a forall. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org