[GitHub] spark issue #19201: [SPARK-21979][SQL]Improve QueryPlanConstraints framework
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19201 **[Test build #81689 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81689/testReport)** for PR 19201 at commit [`d456876`](https://github.com/apache/spark/commit/d45687653a35431e88d5fbe1ccbb5684fa2794cc). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19208 **[Test build #81686 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81686/testReport)** for PR 19208 at commit [`ae13440`](https://github.com/apache/spark/commit/ae13440fd2220e28b58df52836f55fe5ed77c43f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19208 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19208 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81686/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19202: [SPARK-21980][SQL]References in grouping functions shoul...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19202 **[Test build #81688 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81688/testReport)** for PR 19202 at commit [`b08fd93`](https://github.com/apache/spark/commit/b08fd9301cdbd4c1a29d5eb322eacd1cf2ffc546). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18281: [SPARK-21027][ML][PYTHON] Added tunable paralleli...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18281 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18704: [SPARK-20783][SQL] Create ColumnVector to abstrac...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/18704#discussion_r138409856 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java --- @@ -147,6 +147,11 @@ private void throwUnsupportedException(int requiredCapacity, Throwable cause) { public abstract void putShorts(int rowId, int count, short[] src, int srcIndex); /** + * Sets values from [rowId, rowId + count) to [src[srcIndex], src[srcIndex + count]) --- End diff -- @ueshin Line 145 may make a mistake in comment `Sets values from [rowId, rowId + count) to [src + srcIndex, src + srcIndex + count)` It should be `Sets values from [src + srcIndex, src + srcIndex + count) to [rowId, rowId + count)` What do you think? If so, should we update them in this PR? Or, is it better to create another PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16422: [SPARK-17642] [SQL] support DESC EXTENDED/FORMATT...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16422 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19201: [SPARK-21979][SQL]Improve QueryPlanConstraints fr...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/19201#discussion_r138409695 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala --- @@ -106,91 +106,48 @@ trait QueryPlanConstraints { self: LogicalPlan => * Infers an additional set of constraints from a given set of equality constraints. * For e.g., if an operator has constraints of the form (`a = 5`, `a = b`), this returns an * additional constraint of the form `b = 5`. - * - * [SPARK-17733] We explicitly prevent producing recursive constraints of the form `a = f(a, b)` - * as they are often useless and can lead to a non-converging set of constraints. */ private def inferAdditionalConstraints(constraints: Set[Expression]): Set[Expression] = { -val constraintClasses = generateEquivalentConstraintClasses(constraints) - +val aliasedConstraints = eliminateAliasedExpressionInConstraints(constraints) var inferredConstraints = Set.empty[Expression] -constraints.foreach { +aliasedConstraints.foreach { case eq @ EqualTo(l: Attribute, r: Attribute) => -val candidateConstraints = constraints - eq -inferredConstraints ++= candidateConstraints.map(_ transform { - case a: Attribute if a.semanticEquals(l) && -!isRecursiveDeduction(r, constraintClasses) => r -}) -inferredConstraints ++= candidateConstraints.map(_ transform { - case a: Attribute if a.semanticEquals(r) && -!isRecursiveDeduction(l, constraintClasses) => l -}) +val candidateConstraints = aliasedConstraints - eq +inferredConstraints ++= replaceConstraints(candidateConstraints, l, r) +inferredConstraints ++= replaceConstraints(candidateConstraints, r, l) case _ => // No inference } inferredConstraints -- constraints } /** - * Generate a sequence of expression sets from constraints, where each set stores an equivalence - * class of expressions. For example, Set(`a = b`, `b = c`, `e = f`) will generate the following - * expression sets: (Set(a, b, c), Set(e, f)). This will be used to search all expressions equal - * to an selected attribute. + * Replace the aliased expression in [[Alias]] with the alias name if both exist in constraints. + * Thus non-converging inference can be prevented. + * E.g. `a = f(a, b)`, `a = f(b, c) && c = g(a, b)`. + * Also, the size of constraints is reduced without losing any information. + * When the inferred filters are pushed down the operators that generate the alias, + * the alias names used in filters are replaced by the aliased expressions. */ - private def generateEquivalentConstraintClasses( - constraints: Set[Expression]): Seq[Set[Expression]] = { -var constraintClasses = Seq.empty[Set[Expression]] -constraints.foreach { - case eq @ EqualTo(l: Attribute, r: Attribute) => -// Transform [[Alias]] to its child. -val left = aliasMap.getOrElse(l, l) -val right = aliasMap.getOrElse(r, r) -// Get the expression set for an equivalence constraint class. -val leftConstraintClass = getConstraintClass(left, constraintClasses) -val rightConstraintClass = getConstraintClass(right, constraintClasses) -if (leftConstraintClass.nonEmpty && rightConstraintClass.nonEmpty) { - // Combine the two sets. - constraintClasses = constraintClasses -.diff(leftConstraintClass :: rightConstraintClass :: Nil) :+ -(leftConstraintClass ++ rightConstraintClass) -} else if (leftConstraintClass.nonEmpty) { // && rightConstraintClass.isEmpty - // Update equivalence class of `left` expression. - constraintClasses = constraintClasses -.diff(leftConstraintClass :: Nil) :+ (leftConstraintClass + right) -} else if (rightConstraintClass.nonEmpty) { // && leftConstraintClass.isEmpty - // Update equivalence class of `right` expression. - constraintClasses = constraintClasses -.diff(rightConstraintClass :: Nil) :+ (rightConstraintClass + left) -} else { // leftConstraintClass.isEmpty && rightConstraintClass.isEmpty - // Create new equivalence constraint class since neither expression presents - // in any classes. - constraintClasses = constraintClasses :+ Set(left, right) -} - case _ => // Skip + private def eliminateAliasedExpressionInConstraints(constraints: Set[Expression]) +: Set[Expression] = { +val attributes
[GitHub] spark pull request #19110: [SPARK-21027][ML][PYTHON] Added tunable paralleli...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19110 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19202: [SPARK-21980][SQL]References in grouping functions shoul...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19202 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18704: [SPARK-20783][SQL] Create ColumnVector to abstrac...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/18704#discussion_r138409265 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java --- @@ -147,6 +147,11 @@ private void throwUnsupportedException(int requiredCapacity, Throwable cause) { public abstract void putShorts(int rowId, int count, short[] src, int srcIndex); /** + * Sets values from [rowId, rowId + count) to [src[srcIndex], src[srcIndex + count]) --- End diff -- I see. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19110: [SPARK-21027][ML][PYTHON] Added tunable parallelism to o...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/19110 LGTM Merging with master Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19202: [SPARK-21980][SQL]References in grouping function...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19202#discussion_r138408545 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,7 +314,7 @@ class Analyzer( s"grouping columns (${groupByExprs.mkString(",")})") } case e @ Grouping(col: Expression) => - val idx = groupByExprs.indexOf(col) + val idx = groupByExprs.indexWhere(x => resolver(x.toString, col.toString)) --- End diff -- `indexWhere(_.semanticEquals(col))` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19106: [SPARK-21770][ML] ProbabilisticClassificationModel fix c...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19106 @sethah I haven't heard of anybody hitting this issue in practice, but it did seem best to ensure that valid probability distributions would be produced regardless of input. There was some discussion of this in the JIRA: https://issues.apache.org/jira/browse/SPARK-21770 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19201: [SPARK-21979][SQL]Improve QueryPlanConstraints fr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19201#discussion_r138401827 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala --- @@ -106,91 +106,48 @@ trait QueryPlanConstraints { self: LogicalPlan => * Infers an additional set of constraints from a given set of equality constraints. * For e.g., if an operator has constraints of the form (`a = 5`, `a = b`), this returns an * additional constraint of the form `b = 5`. - * - * [SPARK-17733] We explicitly prevent producing recursive constraints of the form `a = f(a, b)` - * as they are often useless and can lead to a non-converging set of constraints. */ private def inferAdditionalConstraints(constraints: Set[Expression]): Set[Expression] = { -val constraintClasses = generateEquivalentConstraintClasses(constraints) - +val aliasedConstraints = eliminateAliasedExpressionInConstraints(constraints) var inferredConstraints = Set.empty[Expression] -constraints.foreach { +aliasedConstraints.foreach { case eq @ EqualTo(l: Attribute, r: Attribute) => -val candidateConstraints = constraints - eq -inferredConstraints ++= candidateConstraints.map(_ transform { - case a: Attribute if a.semanticEquals(l) && -!isRecursiveDeduction(r, constraintClasses) => r -}) -inferredConstraints ++= candidateConstraints.map(_ transform { - case a: Attribute if a.semanticEquals(r) && -!isRecursiveDeduction(l, constraintClasses) => l -}) +val candidateConstraints = aliasedConstraints - eq +inferredConstraints ++= replaceConstraints(candidateConstraints, l, r) +inferredConstraints ++= replaceConstraints(candidateConstraints, r, l) case _ => // No inference } inferredConstraints -- constraints } /** - * Generate a sequence of expression sets from constraints, where each set stores an equivalence - * class of expressions. For example, Set(`a = b`, `b = c`, `e = f`) will generate the following - * expression sets: (Set(a, b, c), Set(e, f)). This will be used to search all expressions equal - * to an selected attribute. + * Replace the aliased expression in [[Alias]] with the alias name if both exist in constraints. + * Thus non-converging inference can be prevented. + * E.g. `a = f(a, b)`, `a = f(b, c) && c = g(a, b)`. --- End diff -- This example doesn't even have an alias... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19201: [SPARK-21979][SQL]Improve QueryPlanConstraints fr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19201#discussion_r138384197 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala --- @@ -106,91 +106,48 @@ trait QueryPlanConstraints { self: LogicalPlan => * Infers an additional set of constraints from a given set of equality constraints. * For e.g., if an operator has constraints of the form (`a = 5`, `a = b`), this returns an * additional constraint of the form `b = 5`. - * - * [SPARK-17733] We explicitly prevent producing recursive constraints of the form `a = f(a, b)` - * as they are often useless and can lead to a non-converging set of constraints. */ private def inferAdditionalConstraints(constraints: Set[Expression]): Set[Expression] = { -val constraintClasses = generateEquivalentConstraintClasses(constraints) - +val aliasedConstraints = eliminateAliasedExpressionInConstraints(constraints) var inferredConstraints = Set.empty[Expression] -constraints.foreach { +aliasedConstraints.foreach { case eq @ EqualTo(l: Attribute, r: Attribute) => -val candidateConstraints = constraints - eq -inferredConstraints ++= candidateConstraints.map(_ transform { - case a: Attribute if a.semanticEquals(l) && -!isRecursiveDeduction(r, constraintClasses) => r -}) -inferredConstraints ++= candidateConstraints.map(_ transform { - case a: Attribute if a.semanticEquals(r) && -!isRecursiveDeduction(l, constraintClasses) => l -}) +val candidateConstraints = aliasedConstraints - eq +inferredConstraints ++= replaceConstraints(candidateConstraints, l, r) +inferredConstraints ++= replaceConstraints(candidateConstraints, r, l) case _ => // No inference } inferredConstraints -- constraints } /** - * Generate a sequence of expression sets from constraints, where each set stores an equivalence - * class of expressions. For example, Set(`a = b`, `b = c`, `e = f`) will generate the following - * expression sets: (Set(a, b, c), Set(e, f)). This will be used to search all expressions equal - * to an selected attribute. + * Replace the aliased expression in [[Alias]] with the alias name if both exist in constraints. + * Thus non-converging inference can be prevented. + * E.g. `a = f(a, b)`, `a = f(b, c) && c = g(a, b)`. + * Also, the size of constraints is reduced without losing any information. + * When the inferred filters are pushed down the operators that generate the alias, + * the alias names used in filters are replaced by the aliased expressions. */ - private def generateEquivalentConstraintClasses( - constraints: Set[Expression]): Seq[Set[Expression]] = { -var constraintClasses = Seq.empty[Set[Expression]] -constraints.foreach { - case eq @ EqualTo(l: Attribute, r: Attribute) => -// Transform [[Alias]] to its child. -val left = aliasMap.getOrElse(l, l) -val right = aliasMap.getOrElse(r, r) -// Get the expression set for an equivalence constraint class. -val leftConstraintClass = getConstraintClass(left, constraintClasses) -val rightConstraintClass = getConstraintClass(right, constraintClasses) -if (leftConstraintClass.nonEmpty && rightConstraintClass.nonEmpty) { - // Combine the two sets. - constraintClasses = constraintClasses -.diff(leftConstraintClass :: rightConstraintClass :: Nil) :+ -(leftConstraintClass ++ rightConstraintClass) -} else if (leftConstraintClass.nonEmpty) { // && rightConstraintClass.isEmpty - // Update equivalence class of `left` expression. - constraintClasses = constraintClasses -.diff(leftConstraintClass :: Nil) :+ (leftConstraintClass + right) -} else if (rightConstraintClass.nonEmpty) { // && leftConstraintClass.isEmpty - // Update equivalence class of `right` expression. - constraintClasses = constraintClasses -.diff(rightConstraintClass :: Nil) :+ (rightConstraintClass + left) -} else { // leftConstraintClass.isEmpty && rightConstraintClass.isEmpty - // Create new equivalence constraint class since neither expression presents - // in any classes. - constraintClasses = constraintClasses :+ Set(left, right) -} - case _ => // Skip + private def eliminateAliasedExpressionInConstraints(constraints: Set[Expression]) +: Set[Expression] = { +val attributesInEq
[GitHub] spark issue #18253: [SPARK-18838][CORE] Introduce multiple queues in LiveLis...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/18253 You commented on my code, not on the idea. My code was hacked together quickly, it can be cleaned up a lot. Your comments don't disprove that separating the refactoring of the listener bus hierarchy from the introduction of queues is impossible or undesirable. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18253: [SPARK-18838][CORE] Introduce multiple queues in LiveLis...
Github user bOOm-X commented on the issue: https://github.com/apache/spark/pull/18253 @vanzin I pushed some comments on your code. I think that trying to keep the exact same class hierarchy leads to a very complex code, with many drawbacks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19141: [SPARK-21384] [YARN] Spark + YARN fails with Loca...
Github user devaraj-kavali commented on a diff in the pull request: https://github.com/apache/spark/pull/19141#discussion_r138402807 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -565,7 +565,6 @@ private[spark] class Client( distribute(jarsArchive.toURI.getPath, resType = LocalResourceType.ARCHIVE, destName = Some(LOCALIZED_LIB_DIR)) - jarsArchive.delete() --- End diff -- Thanks @jerryshao for the comment. > What if your scenario and SPARK-20741's scenario are both encountered? Looks like your approach above cannot be worked. Can you provide some information why you think it doesn't work? If we delete the spark_libs.zip after completing the application(similar to staging dir deletion), it would not stack up till the process exit which solves SPARK-20741 and also becomes available during the execution for this current issue. > I'm wondering if we can copy or move this spark_libs.zip temp file to another non-temp file and add that file to the dist cache. That non-temp file will not be deleted and can be overwritten during another launching, so we will always have only one copy. If there are multiple jobs submitted/running concurrently, we would be overwriting the existing with the latest spark_libs.zip which may lead to apps failure during the copy-in-progress and also would become ambiguous to delete the file by which application. > Besides, I think we have several workarounds to handle this issue like spark.yarn.jars or spark.yarn.archive, so looks like this corner case is not so necessary to fix (just my thinking, normally people will not use local FS in a real cluster). I agree, this is a corner case and can be handled with workaround. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19132: [SPARK-21922] Fix duration always updating when task fai...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19132 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81678/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19132: [SPARK-21922] Fix duration always updating when task fai...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19132 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19132: [SPARK-21922] Fix duration always updating when task fai...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19132 **[Test build #81678 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81678/testReport)** for PR 19132 at commit [`25fe22c`](https://github.com/apache/spark/commit/25fe22cddde276f846fd4808de1b575a87b1c059). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19207: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user pgandhi999 commented on the issue: https://github.com/apache/spark/pull/19207 The error logs for test build #81683 state that method this(Long,Int,Int,Long,Long,Long,Long,Long,Long)Unit in class org.apache.spark.status.api.v1.ExecutorStageSummary does not have a correspondent in current version. All I have done is add new fields in the api ExecutorStageSummary and have not modified any existing ones. It should be fine but please let me know if it is not. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19195: [DOCS] Fix unreachable links in the document
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19195 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19195: [DOCS] Fix unreachable links in the document
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19195 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81680/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19195: [DOCS] Fix unreachable links in the document
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19195 **[Test build #81680 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81680/testReport)** for PR 19195 at commit [`bec41c8`](https://github.com/apache/spark/commit/bec41c8702b11654202b179769e291ac6bfa9894). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19106: [SPARK-21770][ML] ProbabilisticClassificationModel fix c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19106 **[Test build #81687 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81687/testReport)** for PR 19106 at commit [`53891ed`](https://github.com/apache/spark/commit/53891ed5c16daebce40e37cc9109b71299a33aca). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squared_hin...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/17862 Tested with several larger data set with Hinge Loss function, to compare l-bfgs and owlqn solvers. Run until converged or exceed maxIter (2000). dataset | numRecords | numFeatures | l-bfgs iterations | owlqn iterations | l-bfgs final loss | owlqn final loss | ---|---|---|---|---|--- url_combined | 2396130 | 3231961 | 317 (952 sec) | 287 (1661 sec) | 9.71E-5| 1.64E-4 kdda | 8407752 | 20216830 | 2000+ (29729 sec) | 288 13664 (sec) | 0.0068 | 0.0135 webspam | 35 | 254 | 344 (67 sec) | 1502 (714 sec) | 0.18273 | 0.18273 SUSY | 500 | 18 | 152 (145 sec) | 1242 (3357 sec) | 0.499 | 0.499 l-bfgs does not always take fewer iterations, but it converges to a smaller final loss. For each iteration, owlqn takes longer time ( 2 or 3 times) than l-bfgs. Logistic Regression also exhibits the similar behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19208 **[Test build #81686 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81686/testReport)** for PR 19208 at commit [`ae13440`](https://github.com/apache/spark/commit/ae13440fd2220e28b58df52836f55fe5ed77c43f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19201: [SPARK-21979][SQL]Improve QueryPlanConstraints framework
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19201 LGTM except a minor comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19201: [SPARK-21979][SQL]Improve QueryPlanConstraints fr...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19201#discussion_r138393814 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala --- @@ -106,91 +106,48 @@ trait QueryPlanConstraints { self: LogicalPlan => * Infers an additional set of constraints from a given set of equality constraints. * For e.g., if an operator has constraints of the form (`a = 5`, `a = b`), this returns an * additional constraint of the form `b = 5`. - * - * [SPARK-17733] We explicitly prevent producing recursive constraints of the form `a = f(a, b)` - * as they are often useless and can lead to a non-converging set of constraints. */ private def inferAdditionalConstraints(constraints: Set[Expression]): Set[Expression] = { -val constraintClasses = generateEquivalentConstraintClasses(constraints) - +val aliasedConstraints = eliminateAliasedExpressionInConstraints(constraints) var inferredConstraints = Set.empty[Expression] -constraints.foreach { +aliasedConstraints.foreach { case eq @ EqualTo(l: Attribute, r: Attribute) => -val candidateConstraints = constraints - eq -inferredConstraints ++= candidateConstraints.map(_ transform { - case a: Attribute if a.semanticEquals(l) && -!isRecursiveDeduction(r, constraintClasses) => r -}) -inferredConstraints ++= candidateConstraints.map(_ transform { - case a: Attribute if a.semanticEquals(r) && -!isRecursiveDeduction(l, constraintClasses) => l -}) +val candidateConstraints = aliasedConstraints - eq +inferredConstraints ++= replaceConstraints(candidateConstraints, l, r) +inferredConstraints ++= replaceConstraints(candidateConstraints, r, l) case _ => // No inference } inferredConstraints -- constraints } /** - * Generate a sequence of expression sets from constraints, where each set stores an equivalence - * class of expressions. For example, Set(`a = b`, `b = c`, `e = f`) will generate the following - * expression sets: (Set(a, b, c), Set(e, f)). This will be used to search all expressions equal - * to an selected attribute. + * Replace the aliased expression in [[Alias]] with the alias name if both exist in constraints. + * Thus non-converging inference can be prevented. + * E.g. `a = f(a, b)`, `a = f(b, c) && c = g(a, b)`. + * Also, the size of constraints is reduced without losing any information. + * When the inferred filters are pushed down the operators that generate the alias, + * the alias names used in filters are replaced by the aliased expressions. */ - private def generateEquivalentConstraintClasses( - constraints: Set[Expression]): Seq[Set[Expression]] = { -var constraintClasses = Seq.empty[Set[Expression]] -constraints.foreach { - case eq @ EqualTo(l: Attribute, r: Attribute) => -// Transform [[Alias]] to its child. -val left = aliasMap.getOrElse(l, l) -val right = aliasMap.getOrElse(r, r) -// Get the expression set for an equivalence constraint class. -val leftConstraintClass = getConstraintClass(left, constraintClasses) -val rightConstraintClass = getConstraintClass(right, constraintClasses) -if (leftConstraintClass.nonEmpty && rightConstraintClass.nonEmpty) { - // Combine the two sets. - constraintClasses = constraintClasses -.diff(leftConstraintClass :: rightConstraintClass :: Nil) :+ -(leftConstraintClass ++ rightConstraintClass) -} else if (leftConstraintClass.nonEmpty) { // && rightConstraintClass.isEmpty - // Update equivalence class of `left` expression. - constraintClasses = constraintClasses -.diff(leftConstraintClass :: Nil) :+ (leftConstraintClass + right) -} else if (rightConstraintClass.nonEmpty) { // && leftConstraintClass.isEmpty - // Update equivalence class of `right` expression. - constraintClasses = constraintClasses -.diff(rightConstraintClass :: Nil) :+ (rightConstraintClass + left) -} else { // leftConstraintClass.isEmpty && rightConstraintClass.isEmpty - // Create new equivalence constraint class since neither expression presents - // in any classes. - constraintClasses = constraintClasses :+ Set(left, right) -} - case _ => // Skip + private def eliminateAliasedExpressionInConstraints(constraints: Set[Expression]) +: Set[Expression] = { +val attributesInE
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r138391134 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala --- @@ -150,20 +150,14 @@ private[ml] object ValidatorParams { }.toSeq )) -val validatorSpecificParams = instance match { - case cv: CrossValidatorParams => -List("numFolds" -> parse(cv.numFolds.jsonEncode(cv.getNumFolds))) - case tvs: TrainValidationSplitParams => -List("trainRatio" -> parse(tvs.trainRatio.jsonEncode(tvs.getTrainRatio))) - case _ => -// This should not happen. -throw new NotImplementedError("ValidatorParams.saveImpl does not handle type: " + - instance.getClass.getCanonicalName) -} - -val jsonParams = validatorSpecificParams ++ List( - "estimatorParamMaps" -> parse(estimatorParamMapsJson), - "seed" -> parse(instance.seed.jsonEncode(instance.getSeed))) +val params = instance.extractParamMap().toSeq +val skipParams = List("estimator", "evaluator", "estimatorParamMaps") +val jsonParams = render(params + .filter { case ParamPair(p, v) => !skipParams.contains(p.name)} + .map { case ParamPair(p, v) => +p.name -> parse(p.jsonEncode(v)) + }.toList ++ List("estimatorParamMaps" -> parse(estimatorParamMapsJson)) +) --- End diff -- Improve code here. So that we don't need to add code for each parameter. Now we have 3 new added parameter: (parallelism, collectSubModels, persistSubModelPath), all added only in CV/TVS estimator. The old code here is easy to cause bugs if we forgot to update it when we add new params. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r138393318 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -212,14 +238,12 @@ object CrossValidator extends MLReadable[CrossValidator] { val (metadata, estimator, evaluator, estimatorParamMaps) = ValidatorParams.loadImpl(path, sc, className) - val numFolds = (metadata.params \ "numFolds").extract[Int] - val seed = (metadata.params \ "seed").extract[Long] - new CrossValidator(metadata.uid) + val cv = new CrossValidator(metadata.uid) .setEstimator(estimator) .setEvaluator(evaluator) .setEstimatorParamMaps(estimatorParamMaps) -.setNumFolds(numFolds) -.setSeed(seed) + DefaultParamsReader.getAndSetParams(cv, metadata, skipParams = List("estimatorParamMaps")) --- End diff -- Use `getAndSetParams` instead of setting all params manually. This simplify code, and it can keep read/write compatibility. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19208#discussion_r138389265 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -261,17 +290,40 @@ class CrossValidatorModel private[ml] ( val copied = new CrossValidatorModel( uid, bestModel.copy(extra).asInstanceOf[Model[_]], - avgMetrics.clone()) + avgMetrics.clone(), + CrossValidatorModel.copySubModels(subModels)) copyValues(copied, extra).setParent(parent) } @Since("1.6.0") override def write: MLWriter = new CrossValidatorModel.CrossValidatorModelWriter(this) + + @Since("2.3.0") + @throws[IOException]("If the input path already exists but overwrite is not enabled.") + def save(path: String, persistSubModels: Boolean): Unit = { +write.asInstanceOf[CrossValidatorModel.CrossValidatorModelWriter] + .persistSubModels(persistSubModels).save(path) + } --- End diff -- I add this method because the `CrossValidatorModelWriter` is private. User cannot use it. But I don't know whether there is better solution. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16422: [SPARK-17642] [SQL] support DESC EXTENDED/FORMATTED tabl...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16422 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16422: [SPARK-17642] [SQL] support DESC EXTENDED/FORMATTED tabl...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16422 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19205: [SPARK-21982] Set locale to US
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19205 **[Test build #3918 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3918/testReport)** for PR 19205 at commit [`22bbb92`](https://github.com/apache/spark/commit/22bbb924eae20b8d3f899008317f5d623c6a49ef). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19208 cc @jkbradley --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18313: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/18313 @hhbyyh I apologize to you that your PR is valuable (in the case model list is very big). But now your PR is stale and I integrate it into my new PR #19208 Would you mind to take a look? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19208 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19208 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81685/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19208 **[Test build #81685 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81685/testReport)** for PR 19208 at commit [`46d3ab3`](https://github.com/apache/spark/commit/46d3ab3899c196311368b3383338b3d4e6d5aeaa). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16774: [SPARK-19357][ML] Adding parallel model evaluation in ML...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/16774 @BryanCutler @MLnick I found a bug in this PR: after save estimator (CV or TVS) and then load again, the "Parallelism" setting will be lost. But I fix this in #19208 by the way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19208: [SPARK-21087] [ML] CrossValidator, TrainValidationSplit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19208 **[Test build #81685 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81685/testReport)** for PR 19208 at commit [`46d3ab3`](https://github.com/apache/spark/commit/46d3ab3899c196311368b3383338b3d4e6d5aeaa). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19208: [SPARK-21087] [ML] CrossValidator, TrainValidatio...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/19208 [SPARK-21087] [ML] CrossValidator, TrainValidationSplit should preserve all models after fitting: Scala ## What changes were proposed in this pull request? 1. We add a parameter whether to collect the full model list when CrossValidator/TrainValidationSplit training (Default is NOT, avoid the change cause OOM) - Add a method in CrossValidatorModel/TrainValidationSplitModel, allow user to get the model list - CrossValidatorModelWriter add a âoptionâ, allow user to control whether to persist the model list to disk. - Note: when persisting the model list, use indices as the sub-model path 2. We add a parameter indicating whether to persist models to disk during training (default = off). - This will use ML persistence to dump models to a directory so they are available later but do not consume memory. - Note: when persisting the model list, use indices as the sub-model path ## How was this patch tested? Test cases added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark expose-model-list Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19208.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19208 commit 46d3ab3899c196311368b3383338b3d4e6d5aeaa Author: WeichenXu Date: 2017-09-11T13:28:53Z init pr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19175: [SPARK-21964][SQL]Enable splitting the Aggregate (on Exp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19175 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19175: [SPARK-21964][SQL]Enable splitting the Aggregate (on Exp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19175 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81676/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15544: [SPARK-17997] [SQL] Add an aggregation function for coun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15544 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19175: [SPARK-21964][SQL]Enable splitting the Aggregate (on Exp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19175 **[Test build #81676 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81676/testReport)** for PR 19175 at commit [`709c2d3`](https://github.com/apache/spark/commit/709c2d3d81e331d6f69d8ed7ecdabe035142d296). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15544: [SPARK-17997] [SQL] Add an aggregation function for coun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15544 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81677/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15544: [SPARK-17997] [SQL] Add an aggregation function for coun...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15544 **[Test build #81677 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81677/testReport)** for PR 15544 at commit [`cd61382`](https://github.com/apache/spark/commit/cd61382aa7f5ef54059edead709da6b818267801). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19207: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19207 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19207: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19207 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81683/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19207: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19207 **[Test build #81683 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81683/testReport)** for PR 19207 at commit [`20e04fa`](https://github.com/apache/spark/commit/20e04fa5e45556b7945203e332e6c4bb2f719e3a). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user goldmedal commented on the issue: https://github.com/apache/spark/pull/18875 @HyukjinKwon @viirya Sorry for updating this PR so late. Please take a look when you are available. Thanks :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18875 **[Test build #81684 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81684/testReport)** for PR 18875 at commit [`bddf283`](https://github.com/apache/spark/commit/bddf2838868b2b676ae9eb3c595b53f56de07468). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19207: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19207 **[Test build #81683 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81683/testReport)** for PR 19207 at commit [`20e04fa`](https://github.com/apache/spark/commit/20e04fa5e45556b7945203e332e6c4bb2f719e3a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18592: [SPARK-21368][SQL] TPCDSQueryBenchmark can't refer query...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18592 **[Test build #81682 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81682/testReport)** for PR 18592 at commit [`d2d22d4`](https://github.com/apache/spark/commit/d2d22d4502b8d1bc3ff6c0af207a2b64bc1bb5f6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19203: [BUILD] Close stale PRs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19203 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81675/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19203: [BUILD] Close stale PRs
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19203 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19203: [BUILD] Close stale PRs
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19203 **[Test build #81675 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81675/testReport)** for PR 19203 at commit [`6386e0c`](https://github.com/apache/spark/commit/6386e0c6ef027d2858d0860c6f9dd472e8ede6aa). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19207: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19207 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81681/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19207: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19207 **[Test build #81681 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81681/testReport)** for PR 19207 at commit [`d95d69b`](https://github.com/apache/spark/commit/d95d69b110f27e409b9185d694cde13a472762c2). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19207: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19207 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19207: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19207 **[Test build #81681 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81681/testReport)** for PR 19207 at commit [`d95d69b`](https://github.com/apache/spark/commit/d95d69b110f27e409b9185d694cde13a472762c2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19181: [SPARK-21907][CORE] oom during spill
Github user eyalfa commented on a diff in the pull request: https://github.com/apache/spark/pull/19181#discussion_r138373142 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java --- @@ -170,6 +170,10 @@ public void free() { public void reset() { if (consumer != null) { consumer.freeArray(array); + array = LongArray.empty; --- End diff -- @hvanhovell , I'm starting to have second thoughts about the special `empty` instance here, I'm afraid the the nested call might trigger `freeArray` or something similar on it. perhaps using null is a better option here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19207: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/19207 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19207: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19207 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19207: [SPARK-21809] : Change Stage Page to use datatabl...
GitHub user pgandhi999 opened a pull request: https://github.com/apache/spark/pull/19207 [SPARK-21809] : Change Stage Page to use datatables to support sorting columns and searching Support column sort and search for Stage Server using jQuery DataTable and REST API. Before this commit, the Stage page was generated hard-coded HTML and can not support search, also, the sorting was disabled if there is any application that has more than one attempt. Supporting search and sort (over all applications rather than the 20 entries in the current page) in any case will greatly improve the user experience. Created the stagespage-template.html for displaying application information in datables. Added REST api endpoint and javascript code to fetch data from the endpoint and display it on the data table. ## How was this patch tested? I have attached the screenshots of the Stage Page UI before and after the fix. Before: https://user-images.githubusercontent.com/8190/30331985-d773d9ac-979e-11e7-8920-5d11fdf8766a.png";> After: https://user-images.githubusercontent.com/8190/30331998-dd22a1d0-979e-11e7-860e-3694e45cd782.png";> You can merge this pull request into a Git repository by running: $ git pull https://github.com/pgandhi999/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19207.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19207 commit 172fc20898896058b7288360eb5292ed9df9d79c Author: pgandhi Date: 2017-07-21T21:00:22Z [SPARK-21503]: Fixed the issue Added the case ExecutorLostFailure which was previously not there, thus, the default case would be executed in which case, task would be marked as completed. commit 81422e0f634c0f06eb2ea29fba4281176a1ab528 Author: pgandhi Date: 2017-07-25T14:54:41Z [SPARK-21503][UI]: Adding changes as per comments commit 55c6c37d09b41ae6914edb5d067e7f2c252ac92a Author: pgandhi999 Date: 2017-07-26T21:26:27Z Merge pull request #1 from apache/master Apache Spark Pull Request - July 26, 2017 commit f454c8933e07967548095e068063bd313ae4845c Author: pgandhi Date: 2017-07-26T21:41:16Z [SPARK-21541]: Spark Logs show incorrect job status for a job that does not create SparkContext Added a flag to check whether user has initialized Spark Context. If it is true, then we let Application Master unregister with Resource Manager else we do not. commit 6b7d5c6e2565c7c4dd97f31fe404c59e73c7474c Author: pgandhi Date: 2017-07-26T21:58:27Z Revert "[SPARK-21541]: Spark Logs show incorrect job status for a job that does not create SparkContext" This reverts commit f454c8933e07967548095e068063bd313ae4845c. "Merged another issue to this one by mistake" commit bc4166490d2ff68898c00fae4c1ca1b8abe1e795 Author: pgandhi999 Date: 2017-07-28T15:24:55Z Merge pull request #2 from apache/master Spark - July 28, 2017 commit e46126fe0f3d8d6f92f7f51c30d8c2154bddc126 Author: pgandhi Date: 2017-07-28T16:08:08Z [SPARK-21503]- Making Changes as per comments [SPARK-21503]- Making Changes as per comments: Removed match case statement and replaced it with an if clause. commit 9b3cebc6b65d2da835f02efaa27015cfd1b0ccae Author: pgandhi999 Date: 2017-08-01T13:58:12Z Merge pull request #4 from apache/master Spark - August 1, 2017 commit 7f03341093c843086920e8218463b5d2ba6e37d2 Author: pgandhi Date: 2017-08-01T15:52:13Z [SPARK-21503]: Reverting Unit Test Code [SPARK-21503]: Reverting Unit Test Code - Not needed. commit 2d01cab45ae269db9044815970dd008c851a46cc Author: pgandhi999 Date: 2017-08-24T21:59:52Z Merge pull request #5 from apache/master SPARK - August 24, 2017 commit eaf63e6bd4dddc726cf57fda080b9b5d6341e2f8 Author: pgandhi Date: 2017-08-24T22:03:29Z [SPARK-21798]: No config to replace deprecated SPARK_CLASSPATH config for launching daemons like History Server Adding new env variable SPARK_DAEMON_CLASSPATH to set classpath for launching daemons. Tested and verified for History Server and Standalone Mode. commit e421a03acbd410a835cf3117fe6592523dc649b5 Author: pgandhi Date: 2017-08-25T16:13:47Z [SPARK-21798]: No config to replace deprecated SPARK_CLASSPATH config for launching daemons like History Server Reverted the previous code change and added the environment variable SPARK_DAEMON_CLASSPATH only for launching daemon processes. commit 6dfb0b0a0ff571850835386664e43f4788d0f046 Author: Parth Gandhi Date: 2017-09-11T18:15:23Z Merge pull request #6 from apache/master Spark - September 11, 2017 commit d95d69b110f27e409b9185d694cde13a472762c2 Author: pgandhi Date: 2017-09-12T14:06:22Z [SPARK-21809] : Change Stage Page to use
[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19136#discussion_r138366512 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -0,0 +1,95 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import org.apache.spark.sql.Strategy +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.planning.PhysicalOperation +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.execution.{FilterExec, ProjectExec, SparkPlan} +import org.apache.spark.sql.execution.datasources.DataSourceStrategy +import org.apache.spark.sql.sources.Filter +import org.apache.spark.sql.sources.v2.reader.downward.{CatalystFilterPushDownSupport, ColumnPruningSupport, FilterPushDownSupport} + +object DataSourceV2Strategy extends Strategy { + // TODO: write path + override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { +case PhysicalOperation(projects, filters, DataSourceV2Relation(output, reader)) => + val attrMap = AttributeMap(output.zip(output)) + + val projectSet = AttributeSet(projects.flatMap(_.references)) + val filterSet = AttributeSet(filters.flatMap(_.references)) + + // Match original case of attributes. + // TODO: nested fields pruning + val requiredColumns = (projectSet ++ filterSet).toSeq.map(attrMap) + reader match { +case r: ColumnPruningSupport => + r.pruneColumns(requiredColumns.toStructType) +case _ => + } + + val stayUpFilters: Seq[Expression] = reader match { +case r: CatalystFilterPushDownSupport => + r.pushCatalystFilters(filters.toArray) + +case r: FilterPushDownSupport => --- End diff -- By doing so, do we still need to match both `CatalystFilterPushDownSupport` and `FilterPushDownSupport` here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18659: [SPARK-21190][PYSPARK][WIP] Simple Python Vectorized UDF...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/18659 (I am sorry, I didn't realise this PR was open already ..) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18704: [SPARK-20783][SQL] Create ColumnVector to abstrac...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18704#discussion_r138364852 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnAccessor.scala --- @@ -149,4 +153,23 @@ private[columnar] object ColumnAccessor { throw new Exception(s"not support type: $other") } } + + def decompress(columnAccessor: ColumnAccessor, columnVector: WritableColumnVector, numRows: Int): + Unit = { +if (columnAccessor.isInstanceOf[NativeColumnAccessor[_]]) { + val nativeAccessor = columnAccessor.asInstanceOf[NativeColumnAccessor[_]] + nativeAccessor.decompress(columnVector, numRows) +} else { + val dataBuffer = columnAccessor.asInstanceOf[BasicColumnAccessor[_]].getByteBuffer + val nullsBuffer = dataBuffer.duplicate().order(ByteOrder.nativeOrder()) + nullsBuffer.rewind() + + val numNulls = ByteBufferHelper.getInt(nullsBuffer) + for (i <- 0 until numNulls) { +val cordinal = ByteBufferHelper.getInt(nullsBuffer) --- End diff -- typo? `ordinal`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18704: [SPARK-20783][SQL] Create ColumnVector to abstrac...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18704#discussion_r138363787 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java --- @@ -147,6 +147,11 @@ private void throwUnsupportedException(int requiredCapacity, Throwable cause) { public abstract void putShorts(int rowId, int count, short[] src, int srcIndex); /** + * Sets values from [rowId, rowId + count) to [src[srcIndex], src[srcIndex + count]) --- End diff -- This description is a little vague, as the input data is `byte[]`. Can we say more about this? e.g. endianness. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18704: [SPARK-20783][SQL] Create ColumnVector to abstrac...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18704#discussion_r138366156 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/compressionSchemes.scala --- @@ -61,6 +63,162 @@ private[columnar] case object PassThrough extends CompressionScheme { } override def hasNext: Boolean = buffer.hasRemaining + +override def decompress(columnVector: WritableColumnVector, capacity: Int): Unit = { + val nullsBuffer = buffer.duplicate().order(ByteOrder.nativeOrder()) + nullsBuffer.rewind() + val nullCount = ByteBufferHelper.getInt(nullsBuffer) + var nextNullIndex = if (nullCount > 0) ByteBufferHelper.getInt(nullsBuffer) else capacity + var pos = 0 + var seenNulls = 0 + val srcArray = buffer.array + var bufferPos = buffer.position + columnType.dataType match { +case _: BooleanType => + val unitSize = 1 + while (pos < capacity) { +if (pos != nextNullIndex) { + val len = nextNullIndex - pos + assert(len * unitSize < Int.MaxValue) + for (i <- 0 until len) { +val value = buffer.get(bufferPos + i) != 0 +columnVector.putBoolean(pos + i, value) + } + bufferPos += len + pos += len +} else { + seenNulls += 1 + nextNullIndex = if (seenNulls < nullCount) { +ByteBufferHelper.getInt(nullsBuffer) + } else { +capacity + } + columnVector.putNull(pos) + pos += 1 +} + } +case _: ByteType => --- End diff -- hmmm, is there any way to reduce the code duplication? maybe codegen? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18704: [SPARK-20783][SQL] Create ColumnVector to abstrac...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18704#discussion_r138365192 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnAccessor.scala --- @@ -149,4 +153,23 @@ private[columnar] object ColumnAccessor { throw new Exception(s"not support type: $other") } } + + def decompress(columnAccessor: ColumnAccessor, columnVector: WritableColumnVector, numRows: Int): + Unit = { +if (columnAccessor.isInstanceOf[NativeColumnAccessor[_]]) { + val nativeAccessor = columnAccessor.asInstanceOf[NativeColumnAccessor[_]] + nativeAccessor.decompress(columnVector, numRows) +} else { + val dataBuffer = columnAccessor.asInstanceOf[BasicColumnAccessor[_]].getByteBuffer + val nullsBuffer = dataBuffer.duplicate().order(ByteOrder.nativeOrder()) + nullsBuffer.rewind() + + val numNulls = ByteBufferHelper.getInt(nullsBuffer) + for (i <- 0 until numNulls) { +val cordinal = ByteBufferHelper.getInt(nullsBuffer) +columnVector.putNull(cordinal) + } + throw new RuntimeException("Not support non-primitive type now") --- End diff -- If we need to throw exception at last, why not do it at the beginning? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18704: [SPARK-20783][SQL] Create ColumnVector to abstrac...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18704#discussion_r138363222 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/columnar/ColumnDictionary.java --- @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.columnar; + +import org.apache.spark.sql.execution.vectorized.Dictionary; + +public final class ColumnDictionary implements Dictionary { + private Object[] dictionary; + + public ColumnDictionary(Object[] dictionary) { +this.dictionary = dictionary; + } + + @Override + public int decodeToInt(int id) { +return (Integer)dictionary[id]; --- End diff -- is it possible to avoid boxing here? e.g. we can have a lot of primitive array members. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19134: [SPARK-21893][BUILD][STREAMING][WIP] Put Kafka 0.8 behin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19134 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19134: [SPARK-21893][BUILD][STREAMING][WIP] Put Kafka 0.8 behin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19134 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81674/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19134: [SPARK-21893][BUILD][STREAMING][WIP] Put Kafka 0.8 behin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19134 **[Test build #81674 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81674/testReport)** for PR 19134 at commit [`d888f7b`](https://github.com/apache/spark/commit/d888f7b4b457d537c6875de31cbd77f5460c7d3b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19195: [DOCS] Fix unreachable links in the document
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19195 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19195: [DOCS] Fix unreachable links in the document
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19195 Merged to master/2.2, and now that I look again here, realize the last tests technically didn't pass. As it's a doc change only that passed before, I can't see it will fail, but iwll keep an eye out. Oops. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19136#discussion_r138357726 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/upward/StatisticsSupport.java --- @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.reader.upward; + +/** + * A mix in interface for `DataSourceV2Reader`. Users can implement this interface to report + * statistics to Spark. + */ +public interface StatisticsSupport { + Statistics getStatistics(); --- End diff -- It should, but we need some refactor on optimizer, see https://github.com/apache/spark/pull/19136#discussion_r137023744 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19136#discussion_r138357442 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/downward/CatalystFilterPushDownSupport.java --- @@ -0,0 +1,36 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.reader.downward; + +import org.apache.spark.annotation.Experimental; +import org.apache.spark.annotation.InterfaceStability; +import org.apache.spark.sql.catalyst.expressions.Expression; + +/** + * A mix-in interface for `DataSourceV2Reader`. Users can implement this interface to push down + * arbitrary expressions as predicates to the data source. + */ +@Experimental +@InterfaceStability.Unstable +public interface CatalystFilterPushDownSupport { + + /** + * Push down filters, returns unsupported filters. + */ + Expression[] pushCatalystFilters(Expression[] filters); --- End diff -- java list is not friendly to scala implementations :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19199: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not han...
Github user jmchung commented on the issue: https://github.com/apache/spark/pull/19199 Thanks @HyukjinKwon and @viirya :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19136: [SPARK-15689][SQL] data source v2
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19136#discussion_r138355462 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -0,0 +1,95 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import org.apache.spark.sql.Strategy +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.planning.PhysicalOperation +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.execution.{FilterExec, ProjectExec, SparkPlan} +import org.apache.spark.sql.execution.datasources.DataSourceStrategy +import org.apache.spark.sql.sources.Filter +import org.apache.spark.sql.sources.v2.reader.downward.{CatalystFilterPushDownSupport, ColumnPruningSupport, FilterPushDownSupport} + +object DataSourceV2Strategy extends Strategy { + // TODO: write path + override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { +case PhysicalOperation(projects, filters, DataSourceV2Relation(output, reader)) => + val attrMap = AttributeMap(output.zip(output)) + + val projectSet = AttributeSet(projects.flatMap(_.references)) + val filterSet = AttributeSet(filters.flatMap(_.references)) + + // Match original case of attributes. + // TODO: nested fields pruning + val requiredColumns = (projectSet ++ filterSet).toSeq.map(attrMap) + reader match { +case r: ColumnPruningSupport => + r.pruneColumns(requiredColumns.toStructType) +case _ => + } + + val stayUpFilters: Seq[Expression] = reader match { +case r: CatalystFilterPushDownSupport => + r.pushCatalystFilters(filters.toArray) + +case r: FilterPushDownSupport => --- End diff -- good idea! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19199: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19199 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19199: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not han...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19199 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19195: [DOCS] Fix unreachable links in the document
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19195 **[Test build #81680 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81680/testReport)** for PR 19195 at commit [`bec41c8`](https://github.com/apache/spark/commit/bec41c8702b11654202b179769e291ac6bfa9894). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19206: Client and ApplicationMaster resolvePath is inappropriat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19206 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19195: [DOCS] Fix unreachable links in the document
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19195 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19199: [SPARK-21610][SQL][FOLLOWUP] Corrupt records are not han...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19199 LGTM too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19181: [SPARK-21907][CORE] oom during spill
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19181 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18592: [SPARK-21368][SQL] TPCDSQueryBenchmark can't refer query...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18592 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81679/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18592: [SPARK-21368][SQL] TPCDSQueryBenchmark can't refer query...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18592 **[Test build #81679 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81679/testReport)** for PR 18592 at commit [`06e306f`](https://github.com/apache/spark/commit/06e306fdb4199a8c7850a6a370ce67aeac0cdf8e). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class TPCDSQueryBenchmarkArguments(val args: Array[String]) ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19206: Client and ApplicationMaster resolvePath is inapp...
GitHub user Chaos-Ju opened a pull request: https://github.com/apache/spark/pull/19206 Client and ApplicationMaster resolvePath is inappropriate when use viewfs ## What changes were proposed in this pull request? When HDFS use viewfs and spark construct Executor's and ApplicationMaster's localResource Map ( the list of localized files ) ,can't covert viewfs:// path to the real hdfs:// path . Therefore , when NodeManager download the local Resource, will throw java.io.IOException: ViewFs: Cannot initialize: Empty Mount table in config for viewfs://clusterName/ Exception stackï¼ java.io.IOException: ViewFs: Cannot initialize: Empty Mount table in config for viewfs://ns-view/ at org.apache.hadoop.fs.viewfs.InodeTree.(InodeTree.java:337) at org.apache.hadoop.fs.viewfs.ViewFileSystem$1.(ViewFileSystem.java:167) at org.apache.hadoop.fs.viewfs.ViewFileSystem.initialize(ViewFileSystem.java:167) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1700) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Failing this attempt. Failing the application ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/Chaos-Ju/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19206.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19206 commit f1fff009d32b8f7d1d2b24734e4d677c6264ec90 Author: Chaos-Ju Date: 2017-09-12T12:45:36Z fix spark support viewfs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18592: [SPARK-21368][SQL] TPCDSQueryBenchmark can't refer query...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18592 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19190: [SPARK-21976][DOC] Fix wrong documentation for Mean Abso...
Github user FavioVazquez commented on the issue: https://github.com/apache/spark/pull/19190 Thanks to Carlos Munguia, Jared Romero and Christhian Flores :). @montactuaria @jared275 @chris122flores --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19181: [SPARK-21907][CORE] oom during spill
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19181 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81672/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19181: [SPARK-21907][CORE] oom during spill
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19181 **[Test build #81672 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81672/testReport)** for PR 19181 at commit [`ae7fbc4`](https://github.com/apache/spark/commit/ae7fbc48b349f5608aaef9f66e9e692354b72d18). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org