[GitHub] spark issue #15868: [SPARK-18413][SQL] Add `maxConnections` JDBCOption

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15868
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15868: [SPARK-18413][SQL] Add `maxConnections` JDBCOption

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15868
  
**[Test build #68641 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68641/consoleFull)**
 for PR 15868 at commit 
[`3378b5e`](https://github.com/apache/spark/commit/3378b5e040041f1af1159d07e3d3b1ef47c6c8c1).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15857: [SPARK-18300][SQL] Do not apply foldable propagat...

2016-11-14 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15857#discussion_r87932854
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -428,43 +428,47 @@ object FoldablePropagation extends Rule[LogicalPlan] {
   }
   case _ => Nil
 })
+val replaceFoldable: PartialFunction[Expression, Expression] = {
+  case a: AttributeReference if foldableMap.contains(a) => 
foldableMap(a)
+}
 
 if (foldableMap.isEmpty) {
   plan
 } else {
   var stop = false
   CleanupAliases(plan.transformUp {
-case u: Union =>
-  stop = true
-  u
-case c: Command =>
-  stop = true
-  c
-// For outer join, although its output attributes are derived from 
its children, they are
-// actually different attributes: the output of outer join is not 
always picked from its
-// children, but can also be null.
+// Allow all leafnodes
--- End diff --

ah i see


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15857: [SPARK-18300][SQL] Do not apply foldable propagat...

2016-11-14 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15857#discussion_r87932789
  
--- Diff: sql/core/src/test/resources/sql-tests/results/group-by.sql.out ---
@@ -131,3 +131,11 @@ FROM testData
 struct
 -- !query 13 output
 -0.2723801058145729-1.5069204152249134 1   3   
2.142857142857143   0.8095238095238094  0.8997354108424372  15  
7
+
+
+-- !query 14
+SELECT COUNT(DISTINCT b), COUNT(DISTINCT b, c) FROM (SELECT 1 AS a, 2 AS 
b, 3 AS c) GROUP BY a
--- End diff --

is it also a regression test? I think you are just fixing `Expand` in this 
PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15857: [SPARK-18300][SQL] Do not apply foldable propagat...

2016-11-14 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15857#discussion_r87932636
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FoldablePropagationSuite.scala
 ---
@@ -118,14 +118,30 @@ class FoldablePropagationSuite extends PlanTest {
   Seq(
 testRelation.select(Literal(1).as('x), 'a).select('x + 'a),
 testRelation.select(Literal(2).as('x), 'a).select('x + 'a)))
-  .select('x)
 val optimized = Optimize.execute(query.analyze)
 val correctAnswer = Union(
   Seq(
 testRelation.select(Literal(1).as('x), 
'a).select((Literal(1).as('x) + 'a).as("(x + a)")),
 testRelation.select(Literal(2).as('x), 
'a).select((Literal(2).as('x) + 'a).as("(x + a)"
-  .select('x).analyze
--- End diff --

how can this test pass before...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15857: [SPARK-18300][SQL] Do not apply foldable propagat...

2016-11-14 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15857#discussion_r87932525
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -428,43 +428,47 @@ object FoldablePropagation extends Rule[LogicalPlan] {
   }
   case _ => Nil
 })
+val replaceFoldable: PartialFunction[Expression, Expression] = {
+  case a: AttributeReference if foldableMap.contains(a) => 
foldableMap(a)
+}
 
 if (foldableMap.isEmpty) {
   plan
 } else {
   var stop = false
   CleanupAliases(plan.transformUp {
-case u: Union =>
-  stop = true
-  u
-case c: Command =>
-  stop = true
-  c
-// For outer join, although its output attributes are derived from 
its children, they are
-// actually different attributes: the output of outer join is not 
always picked from its
-// children, but can also be null.
+// Allow all leafnodes
--- End diff --

LeafNodes should not stop the folding process. That is what I am trying to 
dat.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15857: [SPARK-18300][SQL] Do not apply foldable propagat...

2016-11-14 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15857#discussion_r87932494
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -428,43 +428,47 @@ object FoldablePropagation extends Rule[LogicalPlan] {
   }
   case _ => Nil
 })
+val replaceFoldable: PartialFunction[Expression, Expression] = {
+  case a: AttributeReference if foldableMap.contains(a) => 
foldableMap(a)
+}
 
 if (foldableMap.isEmpty) {
   plan
 } else {
   var stop = false
   CleanupAliases(plan.transformUp {
-case u: Union =>
-  stop = true
-  u
-case c: Command =>
-  stop = true
-  c
-// For outer join, although its output attributes are derived from 
its children, they are
-// actually different attributes: the output of outer join is not 
always picked from its
-// children, but can also be null.
+// Allow all leafnodes
+case l: LeafNode =>
+  l
+
+// Whitelist of all nodes we are allowed to apply this rule to.
+case p @ (_: Project | _: Filter | _: SubqueryAlias | _: Aggregate 
| _: Window |
+  _: Sample | _: GlobalLimit | _: LocalLimit | _: Generate 
| _: Distinct |
+  _: AppendColumns | _: AppendColumnsWithObject | _: 
BroadcastHint |
+  _: RedistributeData | _: Repartition | _: Sort | _: 
TypedFilter) if !stop =>
+  p.transformExpressions(replaceFoldable)
+
+// Allow inner joins. We do not allow outer join, although its 
output attributes are
+// derived from its children, they are actually different 
attributes: the output of outer
+// join is not always picked from its children, but can also be 
null.
 // TODO(cloud-fan): It seems more reasonable to use new attributes 
as the output attributes
 // of outer join.
-case j @ Join(_, _, LeftOuter | RightOuter | FullOuter, _) =>
+case j @ Join(_, _, Inner, _) =>
+  j.transformExpressions(replaceFoldable)
+
+// We can fold the projections an expand holds. However expand 
changes the output columns
+// and often reuses the underlying attributes; so we cannot assume 
that a column is still
+// foldable after the expand has been applied.
+case expand: Expand if !stop =>
--- End diff --

should we add a TODO that `Expand` should always output new attributes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15279: SPARK-12347 [ML][WIP] Add a script to test Spark ML exam...

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15279
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68646/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15279: SPARK-12347 [ML][WIP] Add a script to test Spark ML exam...

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15279
  
**[Test build #68646 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68646/consoleFull)**
 for PR 15279 at commit 
[`c566a5b`](https://github.com/apache/spark/commit/c566a5bfe72aa9be10d9b3f90ea18ec0d0382f93).
 * This patch **fails RAT tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15279: SPARK-12347 [ML][WIP] Add a script to test Spark ML exam...

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15279
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15279: SPARK-12347 [ML][WIP] Add a script to test Spark ML exam...

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15279
  
**[Test build #68646 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68646/consoleFull)**
 for PR 15279 at commit 
[`c566a5b`](https://github.com/apache/spark/commit/c566a5bfe72aa9be10d9b3f90ea18ec0d0382f93).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15279: SPARK-12347 [ML][WIP] Add a script to test Spark ML exam...

2016-11-14 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/15279
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15279: SPARK-12347 [ML][WIP] Add a script to test Spark ML exam...

2016-11-14 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/15279
  
Can you please change the title to have: "SPARK-12347" -> "[SPARK-12347]"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15880: [SPARK-17913][SQL] compare long and string type column m...

2016-11-14 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/15880
  
+1 on the postgres approach


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15880: [SPARK-17913][SQL] compare long and string type column m...

2016-11-14 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/15880
  
Ok we need to make a decision here, to follow hive and give a warning 
message, or to follow postgres and cast string to the type of the other side.

Personally I prefer the postgres way, I think it's always better than 
blindly cast both side to double.

cc @rxin @marmbrust


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15880: [SPARK-17913][SQL] compare long and string type column m...

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15880
  
**[Test build #68645 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68645/consoleFull)**
 for PR 15880 at commit 
[`1506d40`](https://github.com/apache/spark/commit/1506d406b5596a557a5c86f16b180239850901ad).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15877: [SPARK-18429] [SQL] implement a new Aggregate for...

2016-11-14 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15877#discussion_r87929682
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala
 ---
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import java.io.{ByteArrayInputStream, ByteArrayOutputStream}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, 
TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
ExpressionDescription}
+import org.apache.spark.sql.catalyst.util.GenericArrayData
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+import org.apache.spark.util.sketch.CountMinSketch
+
+/**
+ * This function returns a count-min sketch of a column with the given 
esp, confidence and seed.
+ * A count-min sketch is a probabilistic data structure used for 
summarizing streams of data in
+ * sub-linear space, which is useful for equality predicates and join size 
estimation.
+ *
+ * @param child child expression that can produce column value with 
`child.eval(inputRow)`
+ * @param epsExpression relative error, must be positive
+ * @param confidenceExpression confidence, must be positive and less than 
1.0
+ * @param seedExpression random seed
+ */
+@ExpressionDescription(
+  usage = """
+_FUNC_(col, eps, confidence, seed) - Returns a count-min sketch of a 
column with the given esp,
+  confidence and seed. The result is an array of bytes, which should 
be deserialized to a
+  `CountMinSketch` before usage. `CountMinSketch` is useful for 
equality predicates and join
+  size estimation.
+  """)
+case class CountMinSketchAgg(
+child: Expression,
+epsExpression: Expression,
+confidenceExpression: Expression,
+seedExpression: Expression,
+override val mutableAggBufferOffset: Int,
+override val inputAggBufferOffset: Int) extends 
TypedImperativeAggregate[CountMinSketch] {
+
+  def this(
+  child: Expression,
+  epsExpression: Expression,
+  confidenceExpression: Expression,
+  seedExpression: Expression) = {
+this(child, epsExpression, confidenceExpression, seedExpression, 0, 0)
+  }
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+val defaultCheck = super.checkInputDataTypes()
--- End diff --

That is fair.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87906133
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -31,13 +31,9 @@ import org.apache.spark.sql.types.StructType
 /**
  * :: Experimental ::
  *
- * Model produced by [[MinHash]], where multiple hash functions are 
stored. Each hash function is
- * a perfect hash function:
- *`h_i(x) = (x * k_i mod prime) mod numEntries`
- * where `k_i` is the i-th coefficient, and both `x` and `k_i` are from 
`Z_prime^*`
- *
- * Reference:
- * [[https://en.wikipedia.org/wiki/Perfect_hash_function Wikipedia on 
Perfect Hash Function]]
+ * Model produced by [[MinHashLSH]], where multiple hash functions are 
stored. Each hash function is
+ * a perfect hash function for a specific set `S` with cardinality equal 
to a half of `numEntries`:
--- End diff --

I'm not following exactly why the cardinality of `S` is _half_ of 
`numEntries`. Actually, why is threshold for feature dimensionality `prime / 2` 
? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87906309
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -46,21 +42,23 @@ import org.apache.spark.sql.types.StructType
 @Since("2.1.0")
 class MinHashModel private[ml] (
 override val uid: String,
-@Since("2.1.0") val numEntries: Int,
-@Since("2.1.0") val randCoefficients: Array[Int])
+@Since("2.1.0") private[ml] val numEntries: Int,
+@Since("2.1.0") private[ml] val randCoefficients: Array[Int])
   extends LSHModel[MinHashModel] {
 
   @Since("2.1.0")
-  override protected[ml] val hashFunction: Vector => Vector = {
-elems: Vector =>
+  override protected[ml] val hashFunction: Vector => Array[Vector] = {
+elems: Vector => {
   require(elems.numNonzeros > 0, "Must have at least 1 non zero 
entry.")
   val elemsList = elems.toSparse.indices.toList
   val hashValues = randCoefficients.map({ randCoefficient: Int =>
-  elemsList.map({elem: Int =>
-(1 + elem) * randCoefficient.toLong % MinHash.prime % 
numEntries
-  }).min.toDouble
+elemsList.map({ elem: Int =>
--- End diff --

redundant brackets. Just use `elemsList.map { elem: Int =>`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87844941
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -106,22 +123,24 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
* transformed data when necessary.
*
* This method implements two ways of fetching k nearest neighbors:
-   *  - Single Probing: Fast, return at most k elements (Probing only one 
buckets)
-   *  - Multiple Probing: Slow, return exact k elements (Probing multiple 
buckets close to the key)
+   *  - Single-probe: Fast, return at most k elements (Probing only one 
buckets)
--- End diff --

"Probing only one bucket"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15877: [SPARK-18429] [SQL] implement a new Aggregate for...

2016-11-14 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/15877#discussion_r87929562
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala
 ---
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import java.io.{ByteArrayInputStream, ByteArrayOutputStream}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, 
TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
ExpressionDescription}
+import org.apache.spark.sql.catalyst.util.GenericArrayData
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+import org.apache.spark.util.sketch.CountMinSketch
+
+/**
+ * This function returns a count-min sketch of a column with the given 
esp, confidence and seed.
+ * A count-min sketch is a probabilistic data structure used for 
summarizing streams of data in
+ * sub-linear space, which is useful for equality predicates and join size 
estimation.
+ *
+ * @param child child expression that can produce column value with 
`child.eval(inputRow)`
+ * @param epsExpression relative error, must be positive
+ * @param confidenceExpression confidence, must be positive and less than 
1.0
+ * @param seedExpression random seed
+ */
+@ExpressionDescription(
+  usage = """
+_FUNC_(col, eps, confidence, seed) - Returns a count-min sketch of a 
column with the given esp,
+  confidence and seed. The result is an array of bytes, which should 
be deserialized to a
+  `CountMinSketch` before usage. `CountMinSketch` is useful for 
equality predicates and join
+  size estimation.
+  """)
+case class CountMinSketchAgg(
+child: Expression,
+epsExpression: Expression,
+confidenceExpression: Expression,
+seedExpression: Expression,
+override val mutableAggBufferOffset: Int,
+override val inputAggBufferOffset: Int) extends 
TypedImperativeAggregate[CountMinSketch] {
+
+  def this(
+  child: Expression,
+  epsExpression: Expression,
+  confidenceExpression: Expression,
+  seedExpression: Expression) = {
+this(child, epsExpression, confidenceExpression, seedExpression, 0, 0)
+  }
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+val defaultCheck = super.checkInputDataTypes()
+if (defaultCheck.isFailure) {
+  defaultCheck
+} else if (!epsExpression.foldable || !confidenceExpression.foldable ||
+  !seedExpression.foldable) {
+  TypeCheckFailure(
+"The eps, confidence or seed provided must be a literal or 
constant foldable")
+} else if (epsExpression.eval() == null || confidenceExpression.eval() 
== null ||
+  seedExpression.eval() == null) {
+  TypeCheckFailure("The eps, confidence or seed provided should not be 
null")
+} else {
+  // parameter validity will be checked in CountMinSketchImpl
+  TypeCheckSuccess
+}
+  }
+
+  override def createAggregationBuffer(): CountMinSketch = {
+val eps: Double = epsExpression.eval().asInstanceOf[Double]
--- End diff --

Ok, i'll change them to lazy vals


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87906709
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -46,21 +42,23 @@ import org.apache.spark.sql.types.StructType
 @Since("2.1.0")
 class MinHashModel private[ml] (
 override val uid: String,
-@Since("2.1.0") val numEntries: Int,
-@Since("2.1.0") val randCoefficients: Array[Int])
+@Since("2.1.0") private[ml] val numEntries: Int,
--- End diff --

no since tags for private values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87874869
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -66,10 +66,10 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
   self: T =>
 
   /**
-   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * The hash function of LSH, mapping an input feature to multiple vectors
--- End diff --

"mapping an input feature vector to multiple hash vectors."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87878252
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -102,8 +103,7 @@ class MinHashModel private[ml] (
  */
 @Experimental
 @Since("2.1.0")
-class MinHash(override val uid: String) extends LSH[MinHashModel] with 
HasSeed {
-
+class MinHashLSH(override val uid: String) extends LSH[MinHashModel] with 
HasSeed {
--- End diff --

Also, the comment above says:


 * ... For example,
 *`Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])`
 * means there are 10 elements in the space. This set contains elem 2, elem 
3 and elem 5.
 * Also, any input vector must have at least 1 non-zero indices, and all 
non-zero values are treated
 * as binary "1" values.


Can we change it to:


* ... For example,
*`Vectors.sparse(10, Array((2, 1.0), (3, 1.0), (5, 1.0)))`
* means there are 10 elements in the space. This set contains non-zero 
values at indices 2, 3, and
* 5. Also, any input vector must have at least 1 non-zero index, and all 
non-zero values are
* treated as binary "1" values.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87908012
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -125,11 +125,11 @@ class MinHash(override val uid: String) extends 
LSH[MinHashModel] with HasSeed {
 
   @Since("2.1.0")
   override protected[ml] def createRawLSHModel(inputDim: Int): 
MinHashModel = {
-require(inputDim <= MinHash.prime / 2,
-  s"The input vector dimension $inputDim exceeds the threshold 
${MinHash.prime / 2}.")
+require(inputDim <= MinHashLSH.prime / 2,
+  s"The input vector dimension $inputDim exceeds the threshold 
${MinHashLSH.prime / 2}.")
 val rand = new Random($(seed))
 val numEntry = inputDim * 2
-val randCoofs: Array[Int] = Array.fill($(outputDim))(1 + 
rand.nextInt(MinHash.prime - 1))
+val randCoofs: Array[Int] = Array.fill($(numHashTables))(1 + 
rand.nextInt(MinHashLSH.prime - 1))
--- End diff --

`randCoefs`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87922281
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -179,16 +211,13 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
   inputName: String,
   explodeCols: Seq[String]): Dataset[_] = {
 require(explodeCols.size == 2, "explodeCols must be two strings.")
-val vectorToMap = udf((x: Vector) => x.asBreeze.iterator.toMap,
-  MapType(DataTypes.IntegerType, DataTypes.DoubleType))
 val modelDataset: DataFrame = if 
(!dataset.columns.contains($(outputCol))) {
   transform(dataset)
 } else {
   dataset.toDF()
 }
 modelDataset.select(
-  struct(col("*")).as(inputName),
-  explode(vectorToMap(col($(outputCol.as(explodeCols))
+  struct(col("*")).as(inputName), 
posexplode(col($(outputCol))).as(explodeCols))
--- End diff --

Well here's a fun one. When I run this test:

scala
  test("memory leak test") {
val numDim = 50
val data = {
  for (i <- 0 until numDim; j <- Seq(-2, -1, 1, 2))
yield Vectors.sparse(numDim, Seq((i, j.toDouble)))
}
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("keys")

// Project from 100 dimensional Euclidean Space to 10 dimensions
val brp = new BucketedRandomProjectionLSH()
  .setNumHashTables(10)
  .setInputCol("keys")
  .setOutputCol("values")
  .setBucketLength(2.5)
  .setSeed(12345)
val model = brp.fit(df)
val joined = model.approxSimilarityJoin(df, df, Double.MaxValue, 
"distCol")
joined.show()
}

I get the following error:

[info] - BucketedRandomProjectionLSH with high dimension data: test of LSH 
property *** FAILED *** (7 seconds, 568 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
4.0 (TID 205, localhost, executor driver): org.apache.spark.SparkException: 
Managed memory leak detected; size = 33816576 bytes, TID = 205
[info]  at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:295)
[info]  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[info]  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[info]  at java.lang.Thread.run(Thread.java:745)

Could you run the same test and see if you get an error?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87904353
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/MinHashLSHSuite.scala ---
@@ -24,7 +24,7 @@ import org.apache.spark.ml.util.DefaultReadWriteTest
 import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.sql.Dataset
 
-class MinHashSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+class MinHashLSHSuite extends SparkFunSuite with MLlibTestSparkContext 
with DefaultReadWriteTest {
--- End diff --

Looking at the code for LSH, I see a few requires on input to some of the 
public methods, but there aren't tests for these edge cases. Specifically we 
should add

**MinHash**
* tests for empty vectors (or all zero vectors)
* tests for `inputDim > prime / 2`

**LSH**
* Test for `numNearestNeighbors < 0`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87875688
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -46,21 +42,23 @@ import org.apache.spark.sql.types.StructType
 @Since("2.1.0")
 class MinHashModel private[ml] (
--- End diff --

Not specifically related to this pr: I checked and the default random uids 
used in ML library never contain spaces. For more complex uids, it seems more 
common to use camel case, but I do see some with hyphens. Can we make the 
default uids: `"mh-lsh"` and `"brp-lsh"` or similar?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87928721
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.scala
 ---
@@ -89,23 +90,25 @@ class RandomProjectionModel private[ml] (
   }
 
   @Since("2.1.0")
-  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+  override protected[ml] def hashDistance(x: Seq[Vector], y: Seq[Vector]): 
Double = {
 // Since it's generated by hashing, it will be a pair of dense vectors.
-x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - 
pair._2)).min
+x.zip(y).map(vectorPair => Vectors.sqdist(vectorPair._1, 
vectorPair._2)).min
   }
 
   @Since("2.1.0")
   override def copy(extra: ParamMap): this.type = defaultCopy(extra)
 
   @Since("2.1.0")
-  override def write: MLWriter = new 
RandomProjectionModel.RandomProjectionModelWriter(this)
+  override def write: MLWriter = {
+new 
BucketedRandomProjectionModel.BucketedRandomProjectionModelWriter(this)
+  }
 }
 
 /**
  * :: Experimental ::
  *
- * This [[RandomProjection]] implements Locality Sensitive Hashing 
functions for Euclidean
- * distance metrics.
+ * This [[BucketedRandomProjectionLSH]] implements Locality Sensitive 
Hashing functions for
+ * Euclidean distance metrics.
  *
  * The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
  * distance space. The output will be vectors of configurable dimension. 
Hash value in the same
--- End diff --

"Hash values in the same dimension are calculated"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87876322
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -102,8 +103,7 @@ class MinHashModel private[ml] (
  */
 @Experimental
 @Since("2.1.0")
-class MinHash(override val uid: String) extends LSH[MinHashModel] with 
HasSeed {
-
+class MinHashLSH(override val uid: String) extends LSH[MinHashModel] with 
HasSeed {
--- End diff --

change the model names to reflect the new estimator names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87871105
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -106,22 +106,24 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
* transformed data when necessary.
*
* This method implements two ways of fetching k nearest neighbors:
-   *  - Single Probing: Fast, return at most k elements (Probing only one 
buckets)
-   *  - Multiple Probing: Slow, return exact k elements (Probing multiple 
buckets close to the key)
+   *  - Single-probe: Fast, return at most k elements (Probing only one 
buckets)
+   *  - Multi-probe: Slow, return exact k elements (Probing multiple 
buckets close to the key)
+   *
+   * Currently it is made private since more discussion is needed for 
Multi-probe
--- End diff --

I don't understand the point here. Are you trying to make the 
`approxNearestNeighbors` method completely private? There is still a public 
overload of this method - which now shows up as the only method in the docs and 
just says "overloaded method for approxNearestNeighbors". This doc above does 
not show up. 

As a general rule, we should always generate and closely inspect the docs 
to make sure that they are what we intend and that they make sense from an end 
user's perspective.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87874663
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -35,26 +35,26 @@ private[ml] trait LSHParams extends HasInputCol with 
HasOutputCol {
   /**
* Param for the dimension of LSH OR-amplification.
*
-   * In this implementation, we use LSH OR-amplification to reduce the 
false negative rate. The
-   * higher the dimension is, the lower the false negative rate.
+   * LSH OR-amplification can be used to reduce the false negative rate. 
The higher the dimension
--- End diff --

We are still using the word "dimension" here. It might also be useful to 
add that reducing false negatives comes at the cost of added computation. How 
does this sound?


   * Param for the number of hash tables used in LSH OR-amplification.
   *
   * LSH OR-amplification can be used to reduce the false negative rate. 
Higher values for this 
   * param lead to a reduced false negative rate, at the expense of added 
computational complexity.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87910679
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -74,9 +72,12 @@ class MinHashModel private[ml] (
   }
 
   @Since("2.1.0")
-  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+  override protected[ml] def hashDistance(x: Seq[Vector], y: Seq[Vector]): 
Double = {
 // Since it's generated by hashing, it will be a pair of dense vectors.
-x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - 
pair._2)).min
+// TODO: This hashDistance function is controversial. Requires more 
discussion.
+x.zip(y).map(vectorPair =>
--- End diff --

At this point, I'm quite unsure, but this does not look to me like what 
what was discussed 
[here](https://github.com/apache/spark/pull/15800#event-857283655). @jkbradley 
Can you confirm this is what you wanted?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87875995
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -74,9 +72,12 @@ class MinHashModel private[ml] (
   }
 
   @Since("2.1.0")
-  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+  override protected[ml] def hashDistance(x: Seq[Vector], y: Seq[Vector]): 
Double = {
 // Since it's generated by hashing, it will be a pair of dense vectors.
-x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - 
pair._2)).min
+// TODO: This hashDistance function is controversial. Requires more 
discussion.
--- End diff --

This is likely to confuse future developers. Let's just link it to a JIRA 
and note that it may be changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15874#discussion_r87844308
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala 
---
@@ -144,12 +152,12 @@ class MinHash(override val uid: String) extends 
LSH[MinHashModel] with HasSeed {
 }
 
 @Since("2.1.0")
-object MinHash extends DefaultParamsReadable[MinHash] {
+object MinHashLSH extends DefaultParamsReadable[MinHashLSH] {
   // A large prime smaller than sqrt(2^63 − 1)
   private[ml] val prime = 2038074743
--- End diff --

We typically use all caps for constants like these. I prefer 
`MinHashLSH.HASH_PRIME` or `MinHashLSH.PRIME_MODULUS`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15852: Spark-18187 [SQL] CompactibleFileStreamLog should not us...

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15852
  
**[Test build #68644 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68644/consoleFull)**
 for PR 15852 at commit 
[`24e3617`](https://github.com/apache/spark/commit/24e36177e1eb24e7b250cb5356b47c0507e96d68).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15877: [SPARK-18429] [SQL] implement a new Aggregate for...

2016-11-14 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/15877#discussion_r87928329
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala
 ---
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import java.io.{ByteArrayInputStream, ByteArrayOutputStream}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import 
org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, 
TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
ExpressionDescription}
+import org.apache.spark.sql.catalyst.util.GenericArrayData
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+import org.apache.spark.util.sketch.CountMinSketch
+
+/**
+ * This function returns a count-min sketch of a column with the given 
esp, confidence and seed.
+ * A count-min sketch is a probabilistic data structure used for 
summarizing streams of data in
+ * sub-linear space, which is useful for equality predicates and join size 
estimation.
+ *
+ * @param child child expression that can produce column value with 
`child.eval(inputRow)`
+ * @param epsExpression relative error, must be positive
+ * @param confidenceExpression confidence, must be positive and less than 
1.0
+ * @param seedExpression random seed
+ */
+@ExpressionDescription(
+  usage = """
+_FUNC_(col, eps, confidence, seed) - Returns a count-min sketch of a 
column with the given esp,
+  confidence and seed. The result is an array of bytes, which should 
be deserialized to a
+  `CountMinSketch` before usage. `CountMinSketch` is useful for 
equality predicates and join
+  size estimation.
+  """)
+case class CountMinSketchAgg(
+child: Expression,
+epsExpression: Expression,
+confidenceExpression: Expression,
+seedExpression: Expression,
+override val mutableAggBufferOffset: Int,
+override val inputAggBufferOffset: Int) extends 
TypedImperativeAggregate[CountMinSketch] {
+
+  def this(
+  child: Expression,
+  epsExpression: Expression,
+  confidenceExpression: Expression,
+  seedExpression: Expression) = {
+this(child, epsExpression, confidenceExpression, seedExpression, 0, 0)
+  }
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+val defaultCheck = super.checkInputDataTypes()
+if (defaultCheck.isFailure) {
+  defaultCheck
+} else if (!epsExpression.foldable || !confidenceExpression.foldable ||
+  !seedExpression.foldable) {
+  TypeCheckFailure(
+"The eps, confidence or seed provided must be a literal or 
constant foldable")
+} else if (epsExpression.eval() == null || confidenceExpression.eval() 
== null ||
+  seedExpression.eval() == null) {
+  TypeCheckFailure("The eps, confidence or seed provided should not be 
null")
+} else {
+  // parameter validity will be checked in CountMinSketchImpl
+  TypeCheckSuccess
+}
+  }
+
+  override def createAggregationBuffer(): CountMinSketch = {
+val eps: Double = epsExpression.eval().asInstanceOf[Double]
+val confidence: Double = 
confidenceExpression.eval().asInstanceOf[Double]
+val seed: Int = seedExpression.eval().asInstanceOf[Int]
+CountMinSketch.create(eps, confidence, seed)
+  }
+
+  override def update(buffer: CountMinSketch, input: InternalRow): Unit = {
+val value = child.eval(input)
+// ignore empty rows
+if (value != null) {
+  // UTF8String is a spark sql type, while CountMinSketch accepts 
String type
+  buffer.add(if (value.isInstanceOf[UTF8String]) value.toString else 
value)
+}
+  }
+
+  override 

[GitHub] spark issue #15885: [SPARK-18440][Structured Streaming] Pass correct query e...

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15885
  
**[Test build #68643 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68643/consoleFull)**
 for PR 15885 at commit 
[`337ef01`](https://github.com/apache/spark/commit/337ef01d06237b613d04011795b73c564b4b3e54).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15885: [SPARK-18440][Structured Streaming] Pass correct query e...

2016-11-14 Thread tdas
Github user tdas commented on the issue:

https://github.com/apache/spark/pull/15885
  
@marmbrus @rxin Can you take a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15885: [SPARK-18440][Structured Streaming] Pass correct ...

2016-11-14 Thread tdas
GitHub user tdas opened a pull request:

https://github.com/apache/spark/pull/15885

[SPARK-18440][Structured Streaming] Pass correct query execution to 
FileFormatWriter

## What changes were proposed in this pull request?

SPARK-18012 refactored the file write path in FileStreamSink using 
FileFormatWriter which always uses the default non-streaming QueryExecution to 
perform the writes. This is wrong for FileStreamSink, because the streaming 
QueryExecution (i.e. IncrementalExecution) should be used for correctly 
incrementalizing aggregation. The addition of watermarks in SPARK-18124, file 
stream sink should logically supports aggregation + watermark + append mode. 
But actually it fails with
```
16:23:07.389 ERROR 
org.apache.spark.sql.execution.streaming.StreamExecution: Query query-0 
terminated with error
java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark 
timestamp#7: timestamp, interval 10 seconds
+- LocalRelation [timestamp#7]

at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
```

This PR fixes it by passing the correct query execution.

## How was this patch tested?
New unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tdas/spark SPARK-18440

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15885.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15885


commit 337ef01d06237b613d04011795b73c564b4b3e54
Author: Tathagata Das 
Date:   2016-11-15T00:48:47Z

Pass correct query execution to FileFormatWriter




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15659
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15659
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68638/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15659
  
**[Test build #68638 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68638/consoleFull)**
 for PR 15659 at commit 
[`d753d80`](https://github.com/apache/spark/commit/d753d8094e5483e0da7577a85c0c2ed182de3e34).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15702: [SPARK-18124] Observed delay based Event Time Wat...

2016-11-14 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15702


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks

2016-11-14 Thread tdas
Github user tdas commented on the issue:

https://github.com/apache/spark/pull/15702
  
I am merging this to master and 2.1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15702
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15702
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68637/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15702
  
**[Test build #68637 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68637/consoleFull)**
 for PR 15702 at commit 
[`87d8618`](https://github.com/apache/spark/commit/87d8618234a86d666a711a97080e2b014214b84a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15884: [WIP][SPARK-18433][SQL] Improve DataSource option keys t...

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15884
  
**[Test build #68642 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68642/consoleFull)**
 for PR 15884 at commit 
[`30eff08`](https://github.com/apache/spark/commit/30eff086159dabc8db7a46f6d4021c187d7fa4ed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15883: [SPARK-18438][SPARKR][ML] spark.mlp should suppor...

2016-11-14 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15883#discussion_r87924839
  
--- Diff: R/pkg/inst/tests/testthat/test_mllib.R ---
@@ -395,46 +396,56 @@ test_that("spark.mlp", {
   model2 <- read.ml(modelPath)
   summary2 <- summary(model2)
 
-  expect_equal(summary2$labelCount, 3)
+  expect_equal(summary2$numOfInputs, 4)
+  expect_equal(summary2$numOfOutputs, 3)
   expect_equal(summary2$layers, c(4, 5, 4, 3))
   expect_equal(length(summary2$weights), 64)
 
   unlink(modelPath)
 
   # Test default parameter
-  model <- spark.mlp(df, layers = c(4, 5, 4, 3))
+  model <- spark.mlp(df, label ~ features, layers = c(4, 5, 4, 3))
   mlpPredictions <- collect(select(predict(model, mlpTestDF), 
"prediction"))
-  expect_equal(head(mlpPredictions$prediction, 10), c(1, 1, 1, 1, 0, 1, 2, 
2, 1, 0))
+  expect_equal(head(mlpPredictions$prediction, 10),
+   c("1.0", "1.0", "1.0", "1.0", "0.0", "1.0", "2.0", "2.0", 
"1.0", "0.0"))
 
   # Test illegal parameter
-  expect_error(spark.mlp(df, layers = NULL), "layers must be a integer 
vector with length > 1.")
-  expect_error(spark.mlp(df, layers = c()), "layers must be a integer 
vector with length > 1.")
-  expect_error(spark.mlp(df, layers = c(3)), "layers must be a integer 
vector with length > 1.")
+  expect_error(spark.mlp(df, label ~ features, layers = NULL),
+   "layers must be a integer vector with length > 1.")
+  expect_error(spark.mlp(df, label ~ features, layers = c()),
+   "layers must be a integer vector with length > 1.")
+  expect_error(spark.mlp(df, label ~ features, layers = c(3)),
--- End diff --

is there a case for formula != `label ~ features`?
link to my comment above 
https://github.com/apache/spark/pull/15883/files#r87923913


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15884: [WIP][SPARK-18433][SQL] Improve DataSource option...

2016-11-14 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/15884

[WIP][SPARK-18433][SQL] Improve DataSource option keys to be more 
case-insensitive

## What changes were proposed in this pull request?

This PR aims to improve DataSource option keys to be more case-insensitive

DataSource partially use CaseInsensitiveMap in code-path. For example, the 
following fails to find url.

```scala
val df = spark.createDataFrame(sparkContext.parallelize(arr2x2), schema2)
df.write.format("jdbc")
.option("URL", url1)
.option("dbtable", "TEST.SAVETEST")
.options(properties.asScala)
.save()
```

This PR makes DataSource options to use CaseInsensitiveMap internally and 
also makes DataSource to use CaseInsensitiveMap generally except 
`InMemoryFileIndex` and `InsertIntoHadoopFsRelationCommand`. We can not pass 
them CaseInsensitiveMap because they creates new case-sensitive HadoopConfs by 
calling newHadoopConfWithOptions(options) inside.

## How was this patch tested?

Pass the Jenkins test with newly added test cases.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-18433

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15884.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15884


commit 30eff086159dabc8db7a46f6d4021c187d7fa4ed
Author: Dongjoon Hyun 
Date:   2016-11-14T08:59:23Z

[SPARK-18433][SQL] Improve DataSource option keys to be more 
case-insensitive




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15883: [SPARK-18438][SPARKR][ML] spark.mlp should suppor...

2016-11-14 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15883#discussion_r87923954
  
--- Diff: R/pkg/R/mllib.R ---
@@ -896,9 +898,10 @@ setMethod("summary", signature(object = 
"LogisticRegressionModel"),
 #' summary(savedModel)
 #' }
 #' @note spark.mlp since 2.1.0
--- End diff --

we are targeting 2.1.0 for this change yes? otherwise it is a breaking 
signature change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15883: [SPARK-18438][SPARKR][ML] spark.mlp should suppor...

2016-11-14 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15883#discussion_r87923913
  
--- Diff: R/pkg/R/mllib.R ---
@@ -896,9 +898,10 @@ setMethod("summary", signature(object = 
"LogisticRegressionModel"),
 #' summary(savedModel)
 #' }
 #' @note spark.mlp since 2.1.0
-setMethod("spark.mlp", signature(data = "SparkDataFrame"),
-  function(data, layers, blockSize = 128, solver = "l-bfgs", 
maxIter = 100,
+setMethod("spark.mlp", signature(data = "SparkDataFrame", formula = 
"formula"),
+  function(data, formula, layers, blockSize = 128, solver = 
"l-bfgs", maxIter = 100,
--- End diff --

if without `formula` works before, is/should `formula` be optional then? 
with this change it will require it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15883: [SPARK-18438][SPARKR][ML] spark.mlp should suppor...

2016-11-14 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15883#discussion_r87923822
  
--- Diff: R/pkg/R/mllib.R ---
@@ -896,9 +898,10 @@ setMethod("summary", signature(object = 
"LogisticRegressionModel"),
 #' summary(savedModel)
 #' }
 #' @note spark.mlp since 2.1.0
-setMethod("spark.mlp", signature(data = "SparkDataFrame"),
-  function(data, layers, blockSize = 128, solver = "l-bfgs", 
maxIter = 100,
+setMethod("spark.mlp", signature(data = "SparkDataFrame", formula = 
"formula"),
+  function(data, formula, layers, blockSize = 128, solver = 
"l-bfgs", maxIter = 100,
tol = 1E-6, stepSize = 0.03, seed = NULL, 
initialWeights = NULL) {
+formula <- paste(deparse(formula), collapse = "")
--- End diff --

should use paste0?
`paste0(deparse(formula), collapse = "")`
otherwise you get one space between each terms back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15868: [SPARK-18413][SQL] Add `maxConnections` JDBCOption

2016-11-14 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/15868
  
@gatorsmile , I addressed all comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15868: [SPARK-18413][SQL] Add `maxConnections` JDBCOption

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15868
  
**[Test build #68641 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68641/consoleFull)**
 for PR 15868 at commit 
[`3378b5e`](https://github.com/apache/spark/commit/3378b5e040041f1af1159d07e3d3b1ef47c6c8c1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15852: Spark-18187 [SQL] CompactibleFileStreamLog should not us...

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15852
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15852: Spark-18187 [SQL] CompactibleFileStreamLog should not us...

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15852
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68639/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15852: Spark-18187 [SQL] CompactibleFileStreamLog should not us...

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15852
  
**[Test build #68639 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68639/consoleFull)**
 for PR 15852 at commit 
[`efa7022`](https://github.com/apache/spark/commit/efa7022bcc2e8b169c7dd109d878439ac9f058a9).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15878: [SPARK-18430] [SQL] Fixed Exception Messages when Hittin...

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15878
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15878: [SPARK-18430] [SQL] Fixed Exception Messages when Hittin...

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15878
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68636/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15878: [SPARK-18430] [SQL] Fixed Exception Messages when Hittin...

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15878
  
**[Test build #68636 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68636/consoleFull)**
 for PR 15878 at commit 
[`918aa25`](https://github.com/apache/spark/commit/918aa2551300b2c5e1e29feb8a8c3315c623a146).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION should sup...

2016-11-14 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/15704
  
Thank you, @hvanhovell !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION should sup...

2016-11-14 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/15704
  
LGTM - pending jenkins


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION should sup...

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15704
  
**[Test build #68640 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68640/consoleFull)**
 for PR 15704 at commit 
[`fab5682`](https://github.com/apache/spark/commit/fab5682ab4c78fc23f0d2db40ae6338e2d5dbab3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15868: [SPARK-18413][SQL] Control the number of JDBC connection...

2016-11-14 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/15868
  
Thank you, @gatorsmile ! I'll update this soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15874: [Spark-18408] API Improvements for LSH

2016-11-14 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/15874
  
Can you please add "[ML]" to the PR title?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15817: [SPARK-18366][PYSPARK] Add handleInvalid to Pyspark for ...

2016-11-14 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/15817
  
Can you please add "[ML]" to the PR description?  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15868: [SPARK-18413][SQL] Control the number of JDBC con...

2016-11-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15868#discussion_r87915156
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ---
@@ -667,7 +667,14 @@ object JdbcUtils extends Logging {
 val getConnection: () => Connection = createConnectionFactory(options)
 val batchSize = options.batchSize
 val isolationLevel = options.isolationLevel
-df.foreachPartition(iterator => savePartition(
+val numPartitions = options.numPartitions
+val repartitionedDF =
+  if (numPartitions != null && numPartitions.toInt != 
df.rdd.getNumPartitions) {
--- End diff --

Increasing the number of partitions can improve the insert performance in 
some scenarios, I think. However, `repartition` is not cheap.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION sho...

2016-11-14 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/15704#discussion_r87914133
  
--- Diff: 
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 ---
@@ -243,7 +243,7 @@ partitionSpec
 ;
 
 partitionVal
-: identifier (EQ constant)?
+: expression
--- End diff --

It's removed now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15868: [SPARK-18413][SQL] Control the number of JDBC con...

2016-11-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15868#discussion_r87913599
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ---
@@ -667,7 +667,14 @@ object JdbcUtils extends Logging {
 val getConnection: () => Connection = createConnectionFactory(options)
 val batchSize = options.batchSize
 val isolationLevel = options.isolationLevel
-df.foreachPartition(iterator => savePartition(
+val numPartitions = options.numPartitions
+val repartitionedDF =
+  if (numPartitions != null && numPartitions.toInt != 
df.rdd.getNumPartitions) {
+df.repartition(numPartitions.toInt)
--- End diff --

Is that ok to use `coalesce` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15868: [SPARK-18413][SQL] Control the number of JDBC connection...

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15868
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15682: [SPARK-18169][SQL] Suppress warnings when dropping views...

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15682
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68635/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15682: [SPARK-18169][SQL] Suppress warnings when dropping views...

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15682
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15682: [SPARK-18169][SQL] Suppress warnings when dropping views...

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15682
  
**[Test build #68635 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68635/consoleFull)**
 for PR 15682 at commit 
[`fef9981`](https://github.com/apache/spark/commit/fef9981ac140112c05f40c093b2174d1584caaf9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15868: [SPARK-18413][SQL] Control the number of JDBC connection...

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15868
  
**[Test build #68634 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68634/consoleFull)**
 for PR 15868 at commit 
[`93916b1`](https://github.com/apache/spark/commit/93916b13b902292c09a5bbe67ed083e3e891f4b0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15868: [SPARK-18413][SQL] Control the number of JDBC connection...

2016-11-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15868
  
`numPartitions` might be not a good name for this purpose. How about 
`maxConnections`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION sho...

2016-11-14 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/15704#discussion_r87910722
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -418,27 +419,58 @@ case class AlterTableRenamePartitionCommand(
  */
 case class AlterTableDropPartitionCommand(
 tableName: TableIdentifier,
-specs: Seq[TablePartitionSpec],
+specs: Seq[Expression],
 ifExists: Boolean,
 purge: Boolean)
-  extends RunnableCommand {
+  extends RunnableCommand with PredicateHelper {
+
+  private def isRangeComparison(expr: Expression): Boolean = {
+expr.find(e => e.isInstanceOf[BinaryComparison] && 
!e.isInstanceOf[EqualTo]).isDefined
+  }
 
   override def run(sparkSession: SparkSession): Seq[Row] = {
 val catalog = sparkSession.sessionState.catalog
 val table = catalog.getTableMetadata(tableName)
+val resolver = sparkSession.sessionState.conf.resolver
 DDLUtils.verifyAlterTableType(catalog, table, isView = false)
 DDLUtils.verifyPartitionProviderIsHive(sparkSession, table, "ALTER 
TABLE DROP PARTITION")
 
-val normalizedSpecs = specs.map { spec =>
-  PartitioningUtils.normalizePartitionSpec(
-spec,
-table.partitionColumnNames,
-table.identifier.quotedString,
-sparkSession.sessionState.conf.resolver)
+specs.foreach { expr =>
+  expr.references.foreach { attr =>
+if (!table.partitionColumnNames.exists(resolver(_, attr.name))) {
+  throw new AnalysisException(s"${attr.name} is not a valid 
partition column " +
+s"in table ${table.identifier.quotedString}.")
+}
+  }
 }
 
-catalog.dropPartitions(
-  table.identifier, normalizedSpecs, ignoreIfNotExists = ifExists, 
purge = purge)
+if (specs.exists(isRangeComparison)) {
+  val partitionSet = 
scala.collection.mutable.Set.empty[CatalogTablePartition]
+  specs.foreach { spec =>
+val partitions = catalog.listPartitionsByFilter(table.identifier, 
Seq(spec))
+if (partitions.nonEmpty) {
+  partitionSet ++= partitions
+} else if (!ifExists) {
+  throw new AnalysisException(s"There is no partition for 
${spec.sql}")
+}
+  }
+  catalog.dropPartitions(table.identifier, 
partitionSet.map(_.spec).toSeq,
+ignoreIfNotExists = ifExists, purge = purge)
+} else {
+  val normalizedSpecs = specs.map { expr =>
+val spec = splitConjunctivePredicates(expr).map {
+  case BinaryComparison(left, right) =>
--- End diff --

Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION sho...

2016-11-14 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/15704#discussion_r87910682
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 ---
@@ -215,8 +215,14 @@ case class DataSourceAnalysis(conf: CatalystConf) 
extends Rule[LogicalPlan] {
   if (overwrite.enabled) {
 val deletedPartitions = initialMatchingPartitions.toSet -- 
updatedPartitions
 if (deletedPartitions.nonEmpty) {
+  import org.apache.spark.sql.catalyst.expressions._
+  val expressions = deletedPartitions.map { specs =>
+specs.map { case (key, value) =>
+  EqualTo(AttributeReference(key, StringType)(), 
Literal.create(value, StringType))
+}.reduceLeft(org.apache.spark.sql.catalyst.expressions.And)
--- End diff --

Yep.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15868: [SPARK-18413][SQL] Control the number of JDBC con...

2016-11-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15868#discussion_r87910285
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ---
@@ -667,7 +667,14 @@ object JdbcUtils extends Logging {
 val getConnection: () => Connection = createConnectionFactory(options)
 val batchSize = options.batchSize
 val isolationLevel = options.isolationLevel
-df.foreachPartition(iterator => savePartition(
+val numPartitions = options.numPartitions
+val repartitionedDF =
+  if (numPartitions != null && numPartitions.toInt != 
df.rdd.getNumPartitions) {
--- End diff --

Normally, based on my understanding, users only cares the maximal number of 
connections. Thus, no need to repartition it when `numPartitions.toInt >= 
df.rdd.getNumPartitions`, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION should sup...

2016-11-14 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/15704
  
Thank you for review, again. I'll fix them soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15868: [SPARK-18413][SQL] Control the number of JDBC con...

2016-11-14 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15868#discussion_r87909790
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
 ---
@@ -70,6 +70,9 @@ class JDBCOptions(
 }
   }
 
+  // the number of partitions
--- End diff --

This is not clear. The document needs an update. 

http://spark.apache.org/docs/latest/sql-programming-guide.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14720: SPARK-12868: Allow Add jar to add jars from hdfs/...

2016-11-14 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/14720#discussion_r87908473
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala
 ---
@@ -856,6 +856,17 @@ class HiveQuerySuite extends HiveComparisonTest with 
BeforeAndAfter {
 sql("DROP TABLE alter1")
   }
 
+  test("SPARK-12868 ADD JAR FROM HDFS") {
+val testJar = "hdfs://nn:8020/foo.jar"
+// This should fail with unknown host, as its just testing the URL 
parsing
+// before SPARK-12868 it was failing with Malformed URI
+val e = intercept[RuntimeException] {
--- End diff --

I think this test should be improved before merging this. Looking for a 
RuntimeException to validate that the Jar was registered is brittle and can 
easily pass when the registration doesn't actually work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15852: Spark-18187 [SQL] CompactibleFileStreamLog should not us...

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15852
  
**[Test build #68639 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68639/consoleFull)**
 for PR 15852 at commit 
[`efa7022`](https://github.com/apache/spark/commit/efa7022bcc2e8b169c7dd109d878439ac9f058a9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14638: [SPARK-11374][SQL] Support `skip.header.line.count` opti...

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14638
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68633/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15659
  
**[Test build #68638 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68638/consoleFull)**
 for PR 15659 at commit 
[`d753d80`](https://github.com/apache/spark/commit/d753d8094e5483e0da7577a85c0c2ed182de3e34).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14638: [SPARK-11374][SQL] Support `skip.header.line.count` opti...

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14638
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14638: [SPARK-11374][SQL] Support `skip.header.line.count` opti...

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14638
  
**[Test build #68633 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68633/consoleFull)**
 for PR 14638 at commit 
[`3c06aa6`](https://github.com/apache/spark/commit/3c06aa6679700b4d770889aa2f766a01f851ec43).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15868: [SPARK-18413][SQL] Control the number of JDBC connection...

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15868
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15868: [SPARK-18413][SQL] Control the number of JDBC connection...

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15868
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68630/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15840: [SPARK-18398][SQL] Fix nullabilities of MapObjects and o...

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15840
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68629/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15868: [SPARK-18413][SQL] Control the number of JDBC connection...

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15868
  
**[Test build #68630 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68630/consoleFull)**
 for PR 15868 at commit 
[`c926012`](https://github.com/apache/spark/commit/c9260122ce47d90267e434dfbef75ee66f345547).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15840: [SPARK-18398][SQL] Fix nullabilities of MapObjects and o...

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15840
  
**[Test build #68629 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68629/consoleFull)**
 for PR 15840 at commit 
[`ec0c55c`](https://github.com/apache/spark/commit/ec0c55c73c080f887c0914de7601698dc1c82c57).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15702
  
**[Test build #68637 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68637/consoleFull)**
 for PR 15702 at commit 
[`87d8618`](https://github.com/apache/spark/commit/87d8618234a86d666a711a97080e2b014214b84a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15880: [SPARK-17913][SQL] compare long and string type column m...

2016-11-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15880
  
Below contains a section of `Implicit Data Conversion` in Oracle: 
https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements002.htm

It clearly documents the potential changes in implicit conversion and 
encourage users to do explicit casting.
> Algorithms for implicit conversion are subject to change across software 
releases and among Oracle products. Behavior of explicit conversions is more 
predictable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks

2016-11-14 Thread marmbrus
Github user marmbrus commented on the issue:

https://github.com/apache/spark/pull/15702
  
jenkins test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15702
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68631/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks

2016-11-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15702
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks

2016-11-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15702
  
**[Test build #68631 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68631/consoleFull)**
 for PR 15702 at commit 
[`87d8618`](https://github.com/apache/spark/commit/87d8618234a86d666a711a97080e2b014214b84a).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION sho...

2016-11-14 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15704#discussion_r87892226
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 ---
@@ -215,8 +215,14 @@ case class DataSourceAnalysis(conf: CatalystConf) 
extends Rule[LogicalPlan] {
   if (overwrite.enabled) {
 val deletedPartitions = initialMatchingPartitions.toSet -- 
updatedPartitions
 if (deletedPartitions.nonEmpty) {
+  import org.apache.spark.sql.catalyst.expressions._
+  val expressions = deletedPartitions.map { specs =>
+specs.map { case (key, value) =>
+  EqualTo(AttributeReference(key, StringType)(), 
Literal.create(value, StringType))
+}.reduceLeft(org.apache.spark.sql.catalyst.expressions.And)
--- End diff --

just `And`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   >