[GitHub] spark issue #15868: [SPARK-18413][SQL] Add `maxConnections` JDBCOption
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15868 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15868: [SPARK-18413][SQL] Add `maxConnections` JDBCOption
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15868 **[Test build #68641 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68641/consoleFull)** for PR 15868 at commit [`3378b5e`](https://github.com/apache/spark/commit/3378b5e040041f1af1159d07e3d3b1ef47c6c8c1). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15857: [SPARK-18300][SQL] Do not apply foldable propagat...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15857#discussion_r87932854 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -428,43 +428,47 @@ object FoldablePropagation extends Rule[LogicalPlan] { } case _ => Nil }) +val replaceFoldable: PartialFunction[Expression, Expression] = { + case a: AttributeReference if foldableMap.contains(a) => foldableMap(a) +} if (foldableMap.isEmpty) { plan } else { var stop = false CleanupAliases(plan.transformUp { -case u: Union => - stop = true - u -case c: Command => - stop = true - c -// For outer join, although its output attributes are derived from its children, they are -// actually different attributes: the output of outer join is not always picked from its -// children, but can also be null. +// Allow all leafnodes --- End diff -- ah i see --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15857: [SPARK-18300][SQL] Do not apply foldable propagat...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15857#discussion_r87932789 --- Diff: sql/core/src/test/resources/sql-tests/results/group-by.sql.out --- @@ -131,3 +131,11 @@ FROM testData struct-- !query 13 output -0.2723801058145729-1.5069204152249134 1 3 2.142857142857143 0.8095238095238094 0.8997354108424372 15 7 + + +-- !query 14 +SELECT COUNT(DISTINCT b), COUNT(DISTINCT b, c) FROM (SELECT 1 AS a, 2 AS b, 3 AS c) GROUP BY a --- End diff -- is it also a regression test? I think you are just fixing `Expand` in this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15857: [SPARK-18300][SQL] Do not apply foldable propagat...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15857#discussion_r87932636 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FoldablePropagationSuite.scala --- @@ -118,14 +118,30 @@ class FoldablePropagationSuite extends PlanTest { Seq( testRelation.select(Literal(1).as('x), 'a).select('x + 'a), testRelation.select(Literal(2).as('x), 'a).select('x + 'a))) - .select('x) val optimized = Optimize.execute(query.analyze) val correctAnswer = Union( Seq( testRelation.select(Literal(1).as('x), 'a).select((Literal(1).as('x) + 'a).as("(x + a)")), testRelation.select(Literal(2).as('x), 'a).select((Literal(2).as('x) + 'a).as("(x + a)" - .select('x).analyze --- End diff -- how can this test pass before... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15857: [SPARK-18300][SQL] Do not apply foldable propagat...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/15857#discussion_r87932525 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -428,43 +428,47 @@ object FoldablePropagation extends Rule[LogicalPlan] { } case _ => Nil }) +val replaceFoldable: PartialFunction[Expression, Expression] = { + case a: AttributeReference if foldableMap.contains(a) => foldableMap(a) +} if (foldableMap.isEmpty) { plan } else { var stop = false CleanupAliases(plan.transformUp { -case u: Union => - stop = true - u -case c: Command => - stop = true - c -// For outer join, although its output attributes are derived from its children, they are -// actually different attributes: the output of outer join is not always picked from its -// children, but can also be null. +// Allow all leafnodes --- End diff -- LeafNodes should not stop the folding process. That is what I am trying to dat. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15857: [SPARK-18300][SQL] Do not apply foldable propagat...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15857#discussion_r87932494 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -428,43 +428,47 @@ object FoldablePropagation extends Rule[LogicalPlan] { } case _ => Nil }) +val replaceFoldable: PartialFunction[Expression, Expression] = { + case a: AttributeReference if foldableMap.contains(a) => foldableMap(a) +} if (foldableMap.isEmpty) { plan } else { var stop = false CleanupAliases(plan.transformUp { -case u: Union => - stop = true - u -case c: Command => - stop = true - c -// For outer join, although its output attributes are derived from its children, they are -// actually different attributes: the output of outer join is not always picked from its -// children, but can also be null. +// Allow all leafnodes +case l: LeafNode => + l + +// Whitelist of all nodes we are allowed to apply this rule to. +case p @ (_: Project | _: Filter | _: SubqueryAlias | _: Aggregate | _: Window | + _: Sample | _: GlobalLimit | _: LocalLimit | _: Generate | _: Distinct | + _: AppendColumns | _: AppendColumnsWithObject | _: BroadcastHint | + _: RedistributeData | _: Repartition | _: Sort | _: TypedFilter) if !stop => + p.transformExpressions(replaceFoldable) + +// Allow inner joins. We do not allow outer join, although its output attributes are +// derived from its children, they are actually different attributes: the output of outer +// join is not always picked from its children, but can also be null. // TODO(cloud-fan): It seems more reasonable to use new attributes as the output attributes // of outer join. -case j @ Join(_, _, LeftOuter | RightOuter | FullOuter, _) => +case j @ Join(_, _, Inner, _) => + j.transformExpressions(replaceFoldable) + +// We can fold the projections an expand holds. However expand changes the output columns +// and often reuses the underlying attributes; so we cannot assume that a column is still +// foldable after the expand has been applied. +case expand: Expand if !stop => --- End diff -- should we add a TODO that `Expand` should always output new attributes? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15279: SPARK-12347 [ML][WIP] Add a script to test Spark ML exam...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15279 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68646/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15279: SPARK-12347 [ML][WIP] Add a script to test Spark ML exam...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15279 **[Test build #68646 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68646/consoleFull)** for PR 15279 at commit [`c566a5b`](https://github.com/apache/spark/commit/c566a5bfe72aa9be10d9b3f90ea18ec0d0382f93). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15279: SPARK-12347 [ML][WIP] Add a script to test Spark ML exam...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15279 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15279: SPARK-12347 [ML][WIP] Add a script to test Spark ML exam...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15279 **[Test build #68646 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68646/consoleFull)** for PR 15279 at commit [`c566a5b`](https://github.com/apache/spark/commit/c566a5bfe72aa9be10d9b3f90ea18ec0d0382f93). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15279: SPARK-12347 [ML][WIP] Add a script to test Spark ML exam...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15279 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15279: SPARK-12347 [ML][WIP] Add a script to test Spark ML exam...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15279 Can you please change the title to have: "SPARK-12347" -> "[SPARK-12347]"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15880: [SPARK-17913][SQL] compare long and string type column m...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/15880 +1 on the postgres approach --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15880: [SPARK-17913][SQL] compare long and string type column m...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15880 Ok we need to make a decision here, to follow hive and give a warning message, or to follow postgres and cast string to the type of the other side. Personally I prefer the postgres way, I think it's always better than blindly cast both side to double. cc @rxin @marmbrust --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15880: [SPARK-17913][SQL] compare long and string type column m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15880 **[Test build #68645 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68645/consoleFull)** for PR 15880 at commit [`1506d40`](https://github.com/apache/spark/commit/1506d406b5596a557a5c86f16b180239850901ad). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15877: [SPARK-18429] [SQL] implement a new Aggregate for...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/15877#discussion_r87929682 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala --- @@ -0,0 +1,131 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import java.io.{ByteArrayInputStream, ByteArrayOutputStream} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, TypeCheckSuccess} +import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionDescription} +import org.apache.spark.sql.catalyst.util.GenericArrayData +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String +import org.apache.spark.util.sketch.CountMinSketch + +/** + * This function returns a count-min sketch of a column with the given esp, confidence and seed. + * A count-min sketch is a probabilistic data structure used for summarizing streams of data in + * sub-linear space, which is useful for equality predicates and join size estimation. + * + * @param child child expression that can produce column value with `child.eval(inputRow)` + * @param epsExpression relative error, must be positive + * @param confidenceExpression confidence, must be positive and less than 1.0 + * @param seedExpression random seed + */ +@ExpressionDescription( + usage = """ +_FUNC_(col, eps, confidence, seed) - Returns a count-min sketch of a column with the given esp, + confidence and seed. The result is an array of bytes, which should be deserialized to a + `CountMinSketch` before usage. `CountMinSketch` is useful for equality predicates and join + size estimation. + """) +case class CountMinSketchAgg( +child: Expression, +epsExpression: Expression, +confidenceExpression: Expression, +seedExpression: Expression, +override val mutableAggBufferOffset: Int, +override val inputAggBufferOffset: Int) extends TypedImperativeAggregate[CountMinSketch] { + + def this( + child: Expression, + epsExpression: Expression, + confidenceExpression: Expression, + seedExpression: Expression) = { +this(child, epsExpression, confidenceExpression, seedExpression, 0, 0) + } + + override def checkInputDataTypes(): TypeCheckResult = { +val defaultCheck = super.checkInputDataTypes() --- End diff -- That is fair. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87906133 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -31,13 +31,9 @@ import org.apache.spark.sql.types.StructType /** * :: Experimental :: * - * Model produced by [[MinHash]], where multiple hash functions are stored. Each hash function is - * a perfect hash function: - *`h_i(x) = (x * k_i mod prime) mod numEntries` - * where `k_i` is the i-th coefficient, and both `x` and `k_i` are from `Z_prime^*` - * - * Reference: - * [[https://en.wikipedia.org/wiki/Perfect_hash_function Wikipedia on Perfect Hash Function]] + * Model produced by [[MinHashLSH]], where multiple hash functions are stored. Each hash function is + * a perfect hash function for a specific set `S` with cardinality equal to a half of `numEntries`: --- End diff -- I'm not following exactly why the cardinality of `S` is _half_ of `numEntries`. Actually, why is threshold for feature dimensionality `prime / 2` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87906309 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -46,21 +42,23 @@ import org.apache.spark.sql.types.StructType @Since("2.1.0") class MinHashModel private[ml] ( override val uid: String, -@Since("2.1.0") val numEntries: Int, -@Since("2.1.0") val randCoefficients: Array[Int]) +@Since("2.1.0") private[ml] val numEntries: Int, +@Since("2.1.0") private[ml] val randCoefficients: Array[Int]) extends LSHModel[MinHashModel] { @Since("2.1.0") - override protected[ml] val hashFunction: Vector => Vector = { -elems: Vector => + override protected[ml] val hashFunction: Vector => Array[Vector] = { +elems: Vector => { require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.") val elemsList = elems.toSparse.indices.toList val hashValues = randCoefficients.map({ randCoefficient: Int => - elemsList.map({elem: Int => -(1 + elem) * randCoefficient.toLong % MinHash.prime % numEntries - }).min.toDouble +elemsList.map({ elem: Int => --- End diff -- redundant brackets. Just use `elemsList.map { elem: Int =>` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87844941 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -106,22 +123,24 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]] * transformed data when necessary. * * This method implements two ways of fetching k nearest neighbors: - * - Single Probing: Fast, return at most k elements (Probing only one buckets) - * - Multiple Probing: Slow, return exact k elements (Probing multiple buckets close to the key) + * - Single-probe: Fast, return at most k elements (Probing only one buckets) --- End diff -- "Probing only one bucket" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15877: [SPARK-18429] [SQL] implement a new Aggregate for...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/15877#discussion_r87929562 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala --- @@ -0,0 +1,131 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import java.io.{ByteArrayInputStream, ByteArrayOutputStream} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, TypeCheckSuccess} +import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionDescription} +import org.apache.spark.sql.catalyst.util.GenericArrayData +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String +import org.apache.spark.util.sketch.CountMinSketch + +/** + * This function returns a count-min sketch of a column with the given esp, confidence and seed. + * A count-min sketch is a probabilistic data structure used for summarizing streams of data in + * sub-linear space, which is useful for equality predicates and join size estimation. + * + * @param child child expression that can produce column value with `child.eval(inputRow)` + * @param epsExpression relative error, must be positive + * @param confidenceExpression confidence, must be positive and less than 1.0 + * @param seedExpression random seed + */ +@ExpressionDescription( + usage = """ +_FUNC_(col, eps, confidence, seed) - Returns a count-min sketch of a column with the given esp, + confidence and seed. The result is an array of bytes, which should be deserialized to a + `CountMinSketch` before usage. `CountMinSketch` is useful for equality predicates and join + size estimation. + """) +case class CountMinSketchAgg( +child: Expression, +epsExpression: Expression, +confidenceExpression: Expression, +seedExpression: Expression, +override val mutableAggBufferOffset: Int, +override val inputAggBufferOffset: Int) extends TypedImperativeAggregate[CountMinSketch] { + + def this( + child: Expression, + epsExpression: Expression, + confidenceExpression: Expression, + seedExpression: Expression) = { +this(child, epsExpression, confidenceExpression, seedExpression, 0, 0) + } + + override def checkInputDataTypes(): TypeCheckResult = { +val defaultCheck = super.checkInputDataTypes() +if (defaultCheck.isFailure) { + defaultCheck +} else if (!epsExpression.foldable || !confidenceExpression.foldable || + !seedExpression.foldable) { + TypeCheckFailure( +"The eps, confidence or seed provided must be a literal or constant foldable") +} else if (epsExpression.eval() == null || confidenceExpression.eval() == null || + seedExpression.eval() == null) { + TypeCheckFailure("The eps, confidence or seed provided should not be null") +} else { + // parameter validity will be checked in CountMinSketchImpl + TypeCheckSuccess +} + } + + override def createAggregationBuffer(): CountMinSketch = { +val eps: Double = epsExpression.eval().asInstanceOf[Double] --- End diff -- Ok, i'll change them to lazy vals --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87906709 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -46,21 +42,23 @@ import org.apache.spark.sql.types.StructType @Since("2.1.0") class MinHashModel private[ml] ( override val uid: String, -@Since("2.1.0") val numEntries: Int, -@Since("2.1.0") val randCoefficients: Array[Int]) +@Since("2.1.0") private[ml] val numEntries: Int, --- End diff -- no since tags for private values. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87874869 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -66,10 +66,10 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]] self: T => /** - * The hash function of LSH, mapping a predefined KeyType to a Vector + * The hash function of LSH, mapping an input feature to multiple vectors --- End diff -- "mapping an input feature vector to multiple hash vectors." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87878252 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -102,8 +103,7 @@ class MinHashModel private[ml] ( */ @Experimental @Since("2.1.0") -class MinHash(override val uid: String) extends LSH[MinHashModel] with HasSeed { - +class MinHashLSH(override val uid: String) extends LSH[MinHashModel] with HasSeed { --- End diff -- Also, the comment above says: * ... For example, *`Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` * means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. * Also, any input vector must have at least 1 non-zero indices, and all non-zero values are treated * as binary "1" values. Can we change it to: * ... For example, *`Vectors.sparse(10, Array((2, 1.0), (3, 1.0), (5, 1.0)))` * means there are 10 elements in the space. This set contains non-zero values at indices 2, 3, and * 5. Also, any input vector must have at least 1 non-zero index, and all non-zero values are * treated as binary "1" values. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87908012 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -125,11 +125,11 @@ class MinHash(override val uid: String) extends LSH[MinHashModel] with HasSeed { @Since("2.1.0") override protected[ml] def createRawLSHModel(inputDim: Int): MinHashModel = { -require(inputDim <= MinHash.prime / 2, - s"The input vector dimension $inputDim exceeds the threshold ${MinHash.prime / 2}.") +require(inputDim <= MinHashLSH.prime / 2, + s"The input vector dimension $inputDim exceeds the threshold ${MinHashLSH.prime / 2}.") val rand = new Random($(seed)) val numEntry = inputDim * 2 -val randCoofs: Array[Int] = Array.fill($(outputDim))(1 + rand.nextInt(MinHash.prime - 1)) +val randCoofs: Array[Int] = Array.fill($(numHashTables))(1 + rand.nextInt(MinHashLSH.prime - 1)) --- End diff -- `randCoefs` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87922281 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -179,16 +211,13 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]] inputName: String, explodeCols: Seq[String]): Dataset[_] = { require(explodeCols.size == 2, "explodeCols must be two strings.") -val vectorToMap = udf((x: Vector) => x.asBreeze.iterator.toMap, - MapType(DataTypes.IntegerType, DataTypes.DoubleType)) val modelDataset: DataFrame = if (!dataset.columns.contains($(outputCol))) { transform(dataset) } else { dataset.toDF() } modelDataset.select( - struct(col("*")).as(inputName), - explode(vectorToMap(col($(outputCol.as(explodeCols)) + struct(col("*")).as(inputName), posexplode(col($(outputCol))).as(explodeCols)) --- End diff -- Well here's a fun one. When I run this test: scala test("memory leak test") { val numDim = 50 val data = { for (i <- 0 until numDim; j <- Seq(-2, -1, 1, 2)) yield Vectors.sparse(numDim, Seq((i, j.toDouble))) } val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("keys") // Project from 100 dimensional Euclidean Space to 10 dimensions val brp = new BucketedRandomProjectionLSH() .setNumHashTables(10) .setInputCol("keys") .setOutputCol("values") .setBucketLength(2.5) .setSeed(12345) val model = brp.fit(df) val joined = model.approxSimilarityJoin(df, df, Double.MaxValue, "distCol") joined.show() } I get the following error: [info] - BucketedRandomProjectionLSH with high dimension data: test of LSH property *** FAILED *** (7 seconds, 568 milliseconds) [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 205, localhost, executor driver): org.apache.spark.SparkException: Managed memory leak detected; size = 33816576 bytes, TID = 205 [info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:295) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [info] at java.lang.Thread.run(Thread.java:745) Could you run the same test and see if you get an error? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87904353 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/MinHashLSHSuite.scala --- @@ -24,7 +24,7 @@ import org.apache.spark.ml.util.DefaultReadWriteTest import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.sql.Dataset -class MinHashSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { +class MinHashLSHSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { --- End diff -- Looking at the code for LSH, I see a few requires on input to some of the public methods, but there aren't tests for these edge cases. Specifically we should add **MinHash** * tests for empty vectors (or all zero vectors) * tests for `inputDim > prime / 2` **LSH** * Test for `numNearestNeighbors < 0` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87875688 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -46,21 +42,23 @@ import org.apache.spark.sql.types.StructType @Since("2.1.0") class MinHashModel private[ml] ( --- End diff -- Not specifically related to this pr: I checked and the default random uids used in ML library never contain spaces. For more complex uids, it seems more common to use camel case, but I do see some with hyphens. Can we make the default uids: `"mh-lsh"` and `"brp-lsh"` or similar? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87928721 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.scala --- @@ -89,23 +90,25 @@ class RandomProjectionModel private[ml] ( } @Since("2.1.0") - override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { + override protected[ml] def hashDistance(x: Seq[Vector], y: Seq[Vector]): Double = { // Since it's generated by hashing, it will be a pair of dense vectors. -x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - pair._2)).min +x.zip(y).map(vectorPair => Vectors.sqdist(vectorPair._1, vectorPair._2)).min } @Since("2.1.0") override def copy(extra: ParamMap): this.type = defaultCopy(extra) @Since("2.1.0") - override def write: MLWriter = new RandomProjectionModel.RandomProjectionModelWriter(this) + override def write: MLWriter = { +new BucketedRandomProjectionModel.BucketedRandomProjectionModelWriter(this) + } } /** * :: Experimental :: * - * This [[RandomProjection]] implements Locality Sensitive Hashing functions for Euclidean - * distance metrics. + * This [[BucketedRandomProjectionLSH]] implements Locality Sensitive Hashing functions for + * Euclidean distance metrics. * * The input is dense or sparse vectors, each of which represents a point in the Euclidean * distance space. The output will be vectors of configurable dimension. Hash value in the same --- End diff -- "Hash values in the same dimension are calculated" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87876322 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -102,8 +103,7 @@ class MinHashModel private[ml] ( */ @Experimental @Since("2.1.0") -class MinHash(override val uid: String) extends LSH[MinHashModel] with HasSeed { - +class MinHashLSH(override val uid: String) extends LSH[MinHashModel] with HasSeed { --- End diff -- change the model names to reflect the new estimator names. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87871105 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -106,22 +106,24 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]] * transformed data when necessary. * * This method implements two ways of fetching k nearest neighbors: - * - Single Probing: Fast, return at most k elements (Probing only one buckets) - * - Multiple Probing: Slow, return exact k elements (Probing multiple buckets close to the key) + * - Single-probe: Fast, return at most k elements (Probing only one buckets) + * - Multi-probe: Slow, return exact k elements (Probing multiple buckets close to the key) + * + * Currently it is made private since more discussion is needed for Multi-probe --- End diff -- I don't understand the point here. Are you trying to make the `approxNearestNeighbors` method completely private? There is still a public overload of this method - which now shows up as the only method in the docs and just says "overloaded method for approxNearestNeighbors". This doc above does not show up. As a general rule, we should always generate and closely inspect the docs to make sure that they are what we intend and that they make sense from an end user's perspective. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87874663 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala --- @@ -35,26 +35,26 @@ private[ml] trait LSHParams extends HasInputCol with HasOutputCol { /** * Param for the dimension of LSH OR-amplification. * - * In this implementation, we use LSH OR-amplification to reduce the false negative rate. The - * higher the dimension is, the lower the false negative rate. + * LSH OR-amplification can be used to reduce the false negative rate. The higher the dimension --- End diff -- We are still using the word "dimension" here. It might also be useful to add that reducing false negatives comes at the cost of added computation. How does this sound? * Param for the number of hash tables used in LSH OR-amplification. * * LSH OR-amplification can be used to reduce the false negative rate. Higher values for this * param lead to a reduced false negative rate, at the expense of added computational complexity. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87910679 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -74,9 +72,12 @@ class MinHashModel private[ml] ( } @Since("2.1.0") - override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { + override protected[ml] def hashDistance(x: Seq[Vector], y: Seq[Vector]): Double = { // Since it's generated by hashing, it will be a pair of dense vectors. -x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - pair._2)).min +// TODO: This hashDistance function is controversial. Requires more discussion. +x.zip(y).map(vectorPair => --- End diff -- At this point, I'm quite unsure, but this does not look to me like what what was discussed [here](https://github.com/apache/spark/pull/15800#event-857283655). @jkbradley Can you confirm this is what you wanted? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87875995 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -74,9 +72,12 @@ class MinHashModel private[ml] ( } @Since("2.1.0") - override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { + override protected[ml] def hashDistance(x: Seq[Vector], y: Seq[Vector]): Double = { // Since it's generated by hashing, it will be a pair of dense vectors. -x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - pair._2)).min +// TODO: This hashDistance function is controversial. Requires more discussion. --- End diff -- This is likely to confuse future developers. Let's just link it to a JIRA and note that it may be changed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15874#discussion_r87844308 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala --- @@ -144,12 +152,12 @@ class MinHash(override val uid: String) extends LSH[MinHashModel] with HasSeed { } @Since("2.1.0") -object MinHash extends DefaultParamsReadable[MinHash] { +object MinHashLSH extends DefaultParamsReadable[MinHashLSH] { // A large prime smaller than sqrt(2^63 − 1) private[ml] val prime = 2038074743 --- End diff -- We typically use all caps for constants like these. I prefer `MinHashLSH.HASH_PRIME` or `MinHashLSH.PRIME_MODULUS`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15852: Spark-18187 [SQL] CompactibleFileStreamLog should not us...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15852 **[Test build #68644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68644/consoleFull)** for PR 15852 at commit [`24e3617`](https://github.com/apache/spark/commit/24e36177e1eb24e7b250cb5356b47c0507e96d68). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15877: [SPARK-18429] [SQL] implement a new Aggregate for...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/15877#discussion_r87928329 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala --- @@ -0,0 +1,131 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import java.io.{ByteArrayInputStream, ByteArrayOutputStream} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, TypeCheckSuccess} +import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionDescription} +import org.apache.spark.sql.catalyst.util.GenericArrayData +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String +import org.apache.spark.util.sketch.CountMinSketch + +/** + * This function returns a count-min sketch of a column with the given esp, confidence and seed. + * A count-min sketch is a probabilistic data structure used for summarizing streams of data in + * sub-linear space, which is useful for equality predicates and join size estimation. + * + * @param child child expression that can produce column value with `child.eval(inputRow)` + * @param epsExpression relative error, must be positive + * @param confidenceExpression confidence, must be positive and less than 1.0 + * @param seedExpression random seed + */ +@ExpressionDescription( + usage = """ +_FUNC_(col, eps, confidence, seed) - Returns a count-min sketch of a column with the given esp, + confidence and seed. The result is an array of bytes, which should be deserialized to a + `CountMinSketch` before usage. `CountMinSketch` is useful for equality predicates and join + size estimation. + """) +case class CountMinSketchAgg( +child: Expression, +epsExpression: Expression, +confidenceExpression: Expression, +seedExpression: Expression, +override val mutableAggBufferOffset: Int, +override val inputAggBufferOffset: Int) extends TypedImperativeAggregate[CountMinSketch] { + + def this( + child: Expression, + epsExpression: Expression, + confidenceExpression: Expression, + seedExpression: Expression) = { +this(child, epsExpression, confidenceExpression, seedExpression, 0, 0) + } + + override def checkInputDataTypes(): TypeCheckResult = { +val defaultCheck = super.checkInputDataTypes() +if (defaultCheck.isFailure) { + defaultCheck +} else if (!epsExpression.foldable || !confidenceExpression.foldable || + !seedExpression.foldable) { + TypeCheckFailure( +"The eps, confidence or seed provided must be a literal or constant foldable") +} else if (epsExpression.eval() == null || confidenceExpression.eval() == null || + seedExpression.eval() == null) { + TypeCheckFailure("The eps, confidence or seed provided should not be null") +} else { + // parameter validity will be checked in CountMinSketchImpl + TypeCheckSuccess +} + } + + override def createAggregationBuffer(): CountMinSketch = { +val eps: Double = epsExpression.eval().asInstanceOf[Double] +val confidence: Double = confidenceExpression.eval().asInstanceOf[Double] +val seed: Int = seedExpression.eval().asInstanceOf[Int] +CountMinSketch.create(eps, confidence, seed) + } + + override def update(buffer: CountMinSketch, input: InternalRow): Unit = { +val value = child.eval(input) +// ignore empty rows +if (value != null) { + // UTF8String is a spark sql type, while CountMinSketch accepts String type + buffer.add(if (value.isInstanceOf[UTF8String]) value.toString else value) +} + } + + override
[GitHub] spark issue #15885: [SPARK-18440][Structured Streaming] Pass correct query e...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15885 **[Test build #68643 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68643/consoleFull)** for PR 15885 at commit [`337ef01`](https://github.com/apache/spark/commit/337ef01d06237b613d04011795b73c564b4b3e54). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15885: [SPARK-18440][Structured Streaming] Pass correct query e...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/15885 @marmbrus @rxin Can you take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15885: [SPARK-18440][Structured Streaming] Pass correct ...
GitHub user tdas opened a pull request: https://github.com/apache/spark/pull/15885 [SPARK-18440][Structured Streaming] Pass correct query execution to FileFormatWriter ## What changes were proposed in this pull request? SPARK-18012 refactored the file write path in FileStreamSink using FileFormatWriter which always uses the default non-streaming QueryExecution to perform the writes. This is wrong for FileStreamSink, because the streaming QueryExecution (i.e. IncrementalExecution) should be used for correctly incrementalizing aggregation. The addition of watermarks in SPARK-18124, file stream sink should logically supports aggregation + watermark + append mode. But actually it fails with ``` 16:23:07.389 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: Query query-0 terminated with error java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark timestamp#7: timestamp, interval 10 seconds +- LocalRelation [timestamp#7] at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) ``` This PR fixes it by passing the correct query execution. ## How was this patch tested? New unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/tdas/spark SPARK-18440 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15885.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15885 commit 337ef01d06237b613d04011795b73c564b4b3e54 Author: Tathagata DasDate: 2016-11-15T00:48:47Z Pass correct query execution to FileFormatWriter --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15659 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15659 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68638/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15659 **[Test build #68638 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68638/consoleFull)** for PR 15659 at commit [`d753d80`](https://github.com/apache/spark/commit/d753d8094e5483e0da7577a85c0c2ed182de3e34). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15702: [SPARK-18124] Observed delay based Event Time Wat...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15702 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks
Github user tdas commented on the issue: https://github.com/apache/spark/pull/15702 I am merging this to master and 2.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15702 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15702 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68637/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15702 **[Test build #68637 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68637/consoleFull)** for PR 15702 at commit [`87d8618`](https://github.com/apache/spark/commit/87d8618234a86d666a711a97080e2b014214b84a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15884: [WIP][SPARK-18433][SQL] Improve DataSource option keys t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15884 **[Test build #68642 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68642/consoleFull)** for PR 15884 at commit [`30eff08`](https://github.com/apache/spark/commit/30eff086159dabc8db7a46f6d4021c187d7fa4ed). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15883: [SPARK-18438][SPARKR][ML] spark.mlp should suppor...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15883#discussion_r87924839 --- Diff: R/pkg/inst/tests/testthat/test_mllib.R --- @@ -395,46 +396,56 @@ test_that("spark.mlp", { model2 <- read.ml(modelPath) summary2 <- summary(model2) - expect_equal(summary2$labelCount, 3) + expect_equal(summary2$numOfInputs, 4) + expect_equal(summary2$numOfOutputs, 3) expect_equal(summary2$layers, c(4, 5, 4, 3)) expect_equal(length(summary2$weights), 64) unlink(modelPath) # Test default parameter - model <- spark.mlp(df, layers = c(4, 5, 4, 3)) + model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3)) mlpPredictions <- collect(select(predict(model, mlpTestDF), "prediction")) - expect_equal(head(mlpPredictions$prediction, 10), c(1, 1, 1, 1, 0, 1, 2, 2, 1, 0)) + expect_equal(head(mlpPredictions$prediction, 10), + c("1.0", "1.0", "1.0", "1.0", "0.0", "1.0", "2.0", "2.0", "1.0", "0.0")) # Test illegal parameter - expect_error(spark.mlp(df, layers = NULL), "layers must be a integer vector with length > 1.") - expect_error(spark.mlp(df, layers = c()), "layers must be a integer vector with length > 1.") - expect_error(spark.mlp(df, layers = c(3)), "layers must be a integer vector with length > 1.") + expect_error(spark.mlp(df, label ~ features, layers = NULL), + "layers must be a integer vector with length > 1.") + expect_error(spark.mlp(df, label ~ features, layers = c()), + "layers must be a integer vector with length > 1.") + expect_error(spark.mlp(df, label ~ features, layers = c(3)), --- End diff -- is there a case for formula != `label ~ features`? link to my comment above https://github.com/apache/spark/pull/15883/files#r87923913 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15884: [WIP][SPARK-18433][SQL] Improve DataSource option...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/15884 [WIP][SPARK-18433][SQL] Improve DataSource option keys to be more case-insensitive ## What changes were proposed in this pull request? This PR aims to improve DataSource option keys to be more case-insensitive DataSource partially use CaseInsensitiveMap in code-path. For example, the following fails to find url. ```scala val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), schema2) df.write.format("jdbc") .option("URL", url1) .option("dbtable", "TEST.SAVETEST") .options(properties.asScala) .save() ``` This PR makes DataSource options to use CaseInsensitiveMap internally and also makes DataSource to use CaseInsensitiveMap generally except `InMemoryFileIndex` and `InsertIntoHadoopFsRelationCommand`. We can not pass them CaseInsensitiveMap because they creates new case-sensitive HadoopConfs by calling newHadoopConfWithOptions(options) inside. ## How was this patch tested? Pass the Jenkins test with newly added test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-18433 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15884.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15884 commit 30eff086159dabc8db7a46f6d4021c187d7fa4ed Author: Dongjoon HyunDate: 2016-11-14T08:59:23Z [SPARK-18433][SQL] Improve DataSource option keys to be more case-insensitive --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15883: [SPARK-18438][SPARKR][ML] spark.mlp should suppor...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15883#discussion_r87923954 --- Diff: R/pkg/R/mllib.R --- @@ -896,9 +898,10 @@ setMethod("summary", signature(object = "LogisticRegressionModel"), #' summary(savedModel) #' } #' @note spark.mlp since 2.1.0 --- End diff -- we are targeting 2.1.0 for this change yes? otherwise it is a breaking signature change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15883: [SPARK-18438][SPARKR][ML] spark.mlp should suppor...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15883#discussion_r87923913 --- Diff: R/pkg/R/mllib.R --- @@ -896,9 +898,10 @@ setMethod("summary", signature(object = "LogisticRegressionModel"), #' summary(savedModel) #' } #' @note spark.mlp since 2.1.0 -setMethod("spark.mlp", signature(data = "SparkDataFrame"), - function(data, layers, blockSize = 128, solver = "l-bfgs", maxIter = 100, +setMethod("spark.mlp", signature(data = "SparkDataFrame", formula = "formula"), + function(data, formula, layers, blockSize = 128, solver = "l-bfgs", maxIter = 100, --- End diff -- if without `formula` works before, is/should `formula` be optional then? with this change it will require it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15883: [SPARK-18438][SPARKR][ML] spark.mlp should suppor...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15883#discussion_r87923822 --- Diff: R/pkg/R/mllib.R --- @@ -896,9 +898,10 @@ setMethod("summary", signature(object = "LogisticRegressionModel"), #' summary(savedModel) #' } #' @note spark.mlp since 2.1.0 -setMethod("spark.mlp", signature(data = "SparkDataFrame"), - function(data, layers, blockSize = 128, solver = "l-bfgs", maxIter = 100, +setMethod("spark.mlp", signature(data = "SparkDataFrame", formula = "formula"), + function(data, formula, layers, blockSize = 128, solver = "l-bfgs", maxIter = 100, tol = 1E-6, stepSize = 0.03, seed = NULL, initialWeights = NULL) { +formula <- paste(deparse(formula), collapse = "") --- End diff -- should use paste0? `paste0(deparse(formula), collapse = "")` otherwise you get one space between each terms back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15868: [SPARK-18413][SQL] Add `maxConnections` JDBCOption
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/15868 @gatorsmile , I addressed all comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15868: [SPARK-18413][SQL] Add `maxConnections` JDBCOption
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15868 **[Test build #68641 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68641/consoleFull)** for PR 15868 at commit [`3378b5e`](https://github.com/apache/spark/commit/3378b5e040041f1af1159d07e3d3b1ef47c6c8c1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15852: Spark-18187 [SQL] CompactibleFileStreamLog should not us...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15852 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15852: Spark-18187 [SQL] CompactibleFileStreamLog should not us...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15852 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68639/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15852: Spark-18187 [SQL] CompactibleFileStreamLog should not us...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15852 **[Test build #68639 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68639/consoleFull)** for PR 15852 at commit [`efa7022`](https://github.com/apache/spark/commit/efa7022bcc2e8b169c7dd109d878439ac9f058a9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15878: [SPARK-18430] [SQL] Fixed Exception Messages when Hittin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15878 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15878: [SPARK-18430] [SQL] Fixed Exception Messages when Hittin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15878 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68636/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15878: [SPARK-18430] [SQL] Fixed Exception Messages when Hittin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15878 **[Test build #68636 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68636/consoleFull)** for PR 15878 at commit [`918aa25`](https://github.com/apache/spark/commit/918aa2551300b2c5e1e29feb8a8c3315c623a146). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION should sup...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/15704 Thank you, @hvanhovell ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION should sup...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/15704 LGTM - pending jenkins --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION should sup...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15704 **[Test build #68640 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68640/consoleFull)** for PR 15704 at commit [`fab5682`](https://github.com/apache/spark/commit/fab5682ab4c78fc23f0d2db40ae6338e2d5dbab3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15868: [SPARK-18413][SQL] Control the number of JDBC connection...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/15868 Thank you, @gatorsmile ! I'll update this soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15874: [Spark-18408] API Improvements for LSH
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15874 Can you please add "[ML]" to the PR title? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15817: [SPARK-18366][PYSPARK] Add handleInvalid to Pyspark for ...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/15817 Can you please add "[ML]" to the PR description? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15868: [SPARK-18413][SQL] Control the number of JDBC con...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15868#discussion_r87915156 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala --- @@ -667,7 +667,14 @@ object JdbcUtils extends Logging { val getConnection: () => Connection = createConnectionFactory(options) val batchSize = options.batchSize val isolationLevel = options.isolationLevel -df.foreachPartition(iterator => savePartition( +val numPartitions = options.numPartitions +val repartitionedDF = + if (numPartitions != null && numPartitions.toInt != df.rdd.getNumPartitions) { --- End diff -- Increasing the number of partitions can improve the insert performance in some scenarios, I think. However, `repartition` is not cheap. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION sho...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/15704#discussion_r87914133 --- Diff: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 --- @@ -243,7 +243,7 @@ partitionSpec ; partitionVal -: identifier (EQ constant)? +: expression --- End diff -- It's removed now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15868: [SPARK-18413][SQL] Control the number of JDBC con...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15868#discussion_r87913599 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala --- @@ -667,7 +667,14 @@ object JdbcUtils extends Logging { val getConnection: () => Connection = createConnectionFactory(options) val batchSize = options.batchSize val isolationLevel = options.isolationLevel -df.foreachPartition(iterator => savePartition( +val numPartitions = options.numPartitions +val repartitionedDF = + if (numPartitions != null && numPartitions.toInt != df.rdd.getNumPartitions) { +df.repartition(numPartitions.toInt) --- End diff -- Is that ok to use `coalesce` here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15868: [SPARK-18413][SQL] Control the number of JDBC connection...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15868 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15682: [SPARK-18169][SQL] Suppress warnings when dropping views...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15682 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68635/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15682: [SPARK-18169][SQL] Suppress warnings when dropping views...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15682 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15682: [SPARK-18169][SQL] Suppress warnings when dropping views...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15682 **[Test build #68635 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68635/consoleFull)** for PR 15682 at commit [`fef9981`](https://github.com/apache/spark/commit/fef9981ac140112c05f40c093b2174d1584caaf9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15868: [SPARK-18413][SQL] Control the number of JDBC connection...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15868 **[Test build #68634 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68634/consoleFull)** for PR 15868 at commit [`93916b1`](https://github.com/apache/spark/commit/93916b13b902292c09a5bbe67ed083e3e891f4b0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15868: [SPARK-18413][SQL] Control the number of JDBC connection...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15868 `numPartitions` might be not a good name for this purpose. How about `maxConnections`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION sho...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/15704#discussion_r87910722 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -418,27 +419,58 @@ case class AlterTableRenamePartitionCommand( */ case class AlterTableDropPartitionCommand( tableName: TableIdentifier, -specs: Seq[TablePartitionSpec], +specs: Seq[Expression], ifExists: Boolean, purge: Boolean) - extends RunnableCommand { + extends RunnableCommand with PredicateHelper { + + private def isRangeComparison(expr: Expression): Boolean = { +expr.find(e => e.isInstanceOf[BinaryComparison] && !e.isInstanceOf[EqualTo]).isDefined + } override def run(sparkSession: SparkSession): Seq[Row] = { val catalog = sparkSession.sessionState.catalog val table = catalog.getTableMetadata(tableName) +val resolver = sparkSession.sessionState.conf.resolver DDLUtils.verifyAlterTableType(catalog, table, isView = false) DDLUtils.verifyPartitionProviderIsHive(sparkSession, table, "ALTER TABLE DROP PARTITION") -val normalizedSpecs = specs.map { spec => - PartitioningUtils.normalizePartitionSpec( -spec, -table.partitionColumnNames, -table.identifier.quotedString, -sparkSession.sessionState.conf.resolver) +specs.foreach { expr => + expr.references.foreach { attr => +if (!table.partitionColumnNames.exists(resolver(_, attr.name))) { + throw new AnalysisException(s"${attr.name} is not a valid partition column " + +s"in table ${table.identifier.quotedString}.") +} + } } -catalog.dropPartitions( - table.identifier, normalizedSpecs, ignoreIfNotExists = ifExists, purge = purge) +if (specs.exists(isRangeComparison)) { + val partitionSet = scala.collection.mutable.Set.empty[CatalogTablePartition] + specs.foreach { spec => +val partitions = catalog.listPartitionsByFilter(table.identifier, Seq(spec)) +if (partitions.nonEmpty) { + partitionSet ++= partitions +} else if (!ifExists) { + throw new AnalysisException(s"There is no partition for ${spec.sql}") +} + } + catalog.dropPartitions(table.identifier, partitionSet.map(_.spec).toSeq, +ignoreIfNotExists = ifExists, purge = purge) +} else { + val normalizedSpecs = specs.map { expr => +val spec = splitConjunctivePredicates(expr).map { + case BinaryComparison(left, right) => --- End diff -- Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION sho...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/15704#discussion_r87910682 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -215,8 +215,14 @@ case class DataSourceAnalysis(conf: CatalystConf) extends Rule[LogicalPlan] { if (overwrite.enabled) { val deletedPartitions = initialMatchingPartitions.toSet -- updatedPartitions if (deletedPartitions.nonEmpty) { + import org.apache.spark.sql.catalyst.expressions._ + val expressions = deletedPartitions.map { specs => +specs.map { case (key, value) => + EqualTo(AttributeReference(key, StringType)(), Literal.create(value, StringType)) +}.reduceLeft(org.apache.spark.sql.catalyst.expressions.And) --- End diff -- Yep. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15868: [SPARK-18413][SQL] Control the number of JDBC con...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15868#discussion_r87910285 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala --- @@ -667,7 +667,14 @@ object JdbcUtils extends Logging { val getConnection: () => Connection = createConnectionFactory(options) val batchSize = options.batchSize val isolationLevel = options.isolationLevel -df.foreachPartition(iterator => savePartition( +val numPartitions = options.numPartitions +val repartitionedDF = + if (numPartitions != null && numPartitions.toInt != df.rdd.getNumPartitions) { --- End diff -- Normally, based on my understanding, users only cares the maximal number of connections. Thus, no need to repartition it when `numPartitions.toInt >= df.rdd.getNumPartitions`, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION should sup...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/15704 Thank you for review, again. I'll fix them soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15868: [SPARK-18413][SQL] Control the number of JDBC con...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15868#discussion_r87909790 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -70,6 +70,9 @@ class JDBCOptions( } } + // the number of partitions --- End diff -- This is not clear. The document needs an update. http://spark.apache.org/docs/latest/sql-programming-guide.html --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14720: SPARK-12868: Allow Add jar to add jars from hdfs/...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/14720#discussion_r87908473 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala --- @@ -856,6 +856,17 @@ class HiveQuerySuite extends HiveComparisonTest with BeforeAndAfter { sql("DROP TABLE alter1") } + test("SPARK-12868 ADD JAR FROM HDFS") { +val testJar = "hdfs://nn:8020/foo.jar" +// This should fail with unknown host, as its just testing the URL parsing +// before SPARK-12868 it was failing with Malformed URI +val e = intercept[RuntimeException] { --- End diff -- I think this test should be improved before merging this. Looking for a RuntimeException to validate that the Jar was registered is brittle and can easily pass when the registration doesn't actually work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15852: Spark-18187 [SQL] CompactibleFileStreamLog should not us...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15852 **[Test build #68639 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68639/consoleFull)** for PR 15852 at commit [`efa7022`](https://github.com/apache/spark/commit/efa7022bcc2e8b169c7dd109d878439ac9f058a9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14638: [SPARK-11374][SQL] Support `skip.header.line.count` opti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14638 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68633/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15659 **[Test build #68638 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68638/consoleFull)** for PR 15659 at commit [`d753d80`](https://github.com/apache/spark/commit/d753d8094e5483e0da7577a85c0c2ed182de3e34). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14638: [SPARK-11374][SQL] Support `skip.header.line.count` opti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14638 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14638: [SPARK-11374][SQL] Support `skip.header.line.count` opti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14638 **[Test build #68633 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68633/consoleFull)** for PR 14638 at commit [`3c06aa6`](https://github.com/apache/spark/commit/3c06aa6679700b4d770889aa2f766a01f851ec43). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15868: [SPARK-18413][SQL] Control the number of JDBC connection...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15868 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15868: [SPARK-18413][SQL] Control the number of JDBC connection...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15868 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68630/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15840: [SPARK-18398][SQL] Fix nullabilities of MapObjects and o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15840 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68629/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15868: [SPARK-18413][SQL] Control the number of JDBC connection...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15868 **[Test build #68630 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68630/consoleFull)** for PR 15868 at commit [`c926012`](https://github.com/apache/spark/commit/c9260122ce47d90267e434dfbef75ee66f345547). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15840: [SPARK-18398][SQL] Fix nullabilities of MapObjects and o...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15840 **[Test build #68629 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68629/consoleFull)** for PR 15840 at commit [`ec0c55c`](https://github.com/apache/spark/commit/ec0c55c73c080f887c0914de7601698dc1c82c57). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15702 **[Test build #68637 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68637/consoleFull)** for PR 15702 at commit [`87d8618`](https://github.com/apache/spark/commit/87d8618234a86d666a711a97080e2b014214b84a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15880: [SPARK-17913][SQL] compare long and string type column m...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15880 Below contains a section of `Implicit Data Conversion` in Oracle: https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements002.htm It clearly documents the potential changes in implicit conversion and encourage users to do explicit casting. > Algorithms for implicit conversion are subject to change across software releases and among Oracle products. Behavior of explicit conversions is more predictable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks
Github user marmbrus commented on the issue: https://github.com/apache/spark/pull/15702 jenkins test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15702 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68631/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15702 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15702 **[Test build #68631 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68631/consoleFull)** for PR 15702 at commit [`87d8618`](https://github.com/apache/spark/commit/87d8618234a86d666a711a97080e2b014214b84a). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION sho...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/15704#discussion_r87892226 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -215,8 +215,14 @@ case class DataSourceAnalysis(conf: CatalystConf) extends Rule[LogicalPlan] { if (overwrite.enabled) { val deletedPartitions = initialMatchingPartitions.toSet -- updatedPartitions if (deletedPartitions.nonEmpty) { + import org.apache.spark.sql.catalyst.expressions._ + val expressions = deletedPartitions.map { specs => +specs.map { case (key, value) => + EqualTo(AttributeReference(key, StringType)(), Literal.create(value, StringType)) +}.reduceLeft(org.apache.spark.sql.catalyst.expressions.And) --- End diff -- just `And`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org