date:20150429

[GitHub] spark pull request: [SPARK-7232] [SQL] Add a Substitution batch fo...

2015-04-29 Thread scwf

Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/5776#issuecomment-97682532
  
Updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...

2015-04-29 Thread scwf

Github user scwf commented on a diff in the pull request:

https://github.com/apache/spark/pull/5798#discussion_r29407019
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/package.scala
 ---
@@ -29,12 +29,23 @@ package object analysis {
 
   /**
* Resolver should return true if the first string refers to the same 
entity as the second string.
-   * For example, by using case insensitive equality.
+   * For example, by using case insensitive equality. Besides, Resolver 
also provides the ability
+   * to normalize the string according to its semantic.
*/
-  type Resolver = (String, String) => Boolean
+  trait Resolver {
+def apply(a: String, b: String): Boolean
+def apply(a: String): String
+  }
+
+  val caseInsensitiveResolution = new Resolver {
+override def apply(a: String, b: String): Boolean = 
a.equalsIgnoreCase(b)
+override def apply(a: String): String = a.toLowerCase // as Hive does
--- End diff --

/cc @rxin may has concern about this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29407007
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala 
---
@@ -0,0 +1,127 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.execution.stat
+
+import org.apache.spark.Logging
+import org.apache.spark.sql.{Column, DataFrame, Row}
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.types.{ArrayType, StructField, StructType}
+
+import scala.collection.mutable.{Map => MutableMap}
+
+private[sql] object FrequentItems extends Logging {
+
+  /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */
+  private class FreqItemCounter(size: Int) extends Serializable {
+val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long]
+
+/**
+ * Add a new example to the counts if it exists, otherwise deduct the 
count
+ * from existing items.
+ */
+def add(key: Any, count: Long): this.type = {
+  if (baseMap.contains(key))  {
+baseMap(key) += count
+  } else {
+if (baseMap.size < size) {
+  baseMap += key -> count
+} else {
+  // TODO: Make this more efficient... A flatMap?
+  baseMap.retain((k, v) => v > count)
+  baseMap.transform((k, v) => v - count)
+}
+  }
+  this
+}
+
+/**
+ * Merge two maps of counts.
+ * @param other The map containing the counts for that partition
+ */
+def merge(other: FreqItemCounter): this.type = {
+  other.toSeq.foreach { case (k, v) =>
+add(k, v)
+  }
+  this
+}
+
+def toSeq: Seq[(Any, Long)] = baseMap.toSeq
+
+def foldLeft[A, B](start: A)(f: (A, (Any, Long)) => A): A = 
baseMap.foldLeft(start)(f)
+
+def freqItems: Seq[Any] = baseMap.keys.toSeq
+  }
+
+  /**
+   * Finding frequent items for columns, possibly with false positives. 
Using the 
+   * frequent element count algorithm described in
+   * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, 
Schenker, and Papadimitriou]].
+   * For Internal use only.
+   *
+   * @param df The input DataFrame
+   * @param cols the names of the columns to search frequent items in
+   * @param support The minimum frequency for an item to be considered 
`frequent`
+   * @return A Local DataFrame with the Array of frequent items for each 
column.
+   */
+  private[sql] def singlePassFreqItems(
+  df: DataFrame, 
+  cols: Seq[String],
+  support: Double): DataFrame = {
+if (support < 1e-6) {
--- End diff --

I talked more with @mengxr. Let's do default e-2, and bound it at e-4 
initially.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5799#issuecomment-97681270
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31391/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29406990
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala 
---
@@ -0,0 +1,127 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.execution.stat
+
+import org.apache.spark.Logging
+import org.apache.spark.sql.{Column, DataFrame, Row}
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.types.{ArrayType, StructField, StructType}
+
+import scala.collection.mutable.{Map => MutableMap}
+
+private[sql] object FrequentItems extends Logging {
+
+  /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */
+  private class FreqItemCounter(size: Int) extends Serializable {
+val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long]
+
+/**
+ * Add a new example to the counts if it exists, otherwise deduct the 
count
+ * from existing items.
+ */
+def add(key: Any, count: Long): this.type = {
+  if (baseMap.contains(key))  {
+baseMap(key) += count
+  } else {
+if (baseMap.size < size) {
+  baseMap += key -> count
+} else {
+  // TODO: Make this more efficient... A flatMap?
+  baseMap.retain((k, v) => v > count)
+  baseMap.transform((k, v) => v - count)
+}
+  }
+  this
+}
+
+/**
+ * Merge two maps of counts.
+ * @param other The map containing the counts for that partition
+ */
+def merge(other: FreqItemCounter): this.type = {
+  other.toSeq.foreach { case (k, v) =>
+add(k, v)
+  }
+  this
+}
+
+def toSeq: Seq[(Any, Long)] = baseMap.toSeq
+
+def foldLeft[A, B](start: A)(f: (A, (Any, Long)) => A): A = 
baseMap.foldLeft(start)(f)
+
+def freqItems: Seq[Any] = baseMap.keys.toSeq
+  }
+
+  /**
+   * Finding frequent items for columns, possibly with false positives. 
Using the 
+   * frequent element count algorithm described in
+   * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, 
Schenker, and Papadimitriou]].
+   * For Internal use only.
+   *
+   * @param df The input DataFrame
+   * @param cols the names of the columns to search frequent items in
+   * @param support The minimum frequency for an item to be considered 
`frequent`
+   * @return A Local DataFrame with the Array of frequent items for each 
column.
+   */
+  private[sql] def singlePassFreqItems(
+  df: DataFrame, 
+  cols: Seq[String],
+  support: Double): DataFrame = {
+if (support < 1e-6) {
--- End diff --

might as well do e-5 since e-6 can be super large and not very practical.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5799#issuecomment-97681263
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29406968
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala 
---
@@ -0,0 +1,127 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.execution.stat
+
+import org.apache.spark.Logging
+import org.apache.spark.sql.{Column, DataFrame, Row}
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.types.{ArrayType, StructField, StructType}
+
+import scala.collection.mutable.{Map => MutableMap}
+
+private[sql] object FrequentItems extends Logging {
+
+  /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */
+  private class FreqItemCounter(size: Int) extends Serializable {
+val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long]
+
+/**
+ * Add a new example to the counts if it exists, otherwise deduct the 
count
+ * from existing items.
+ */
+def add(key: Any, count: Long): this.type = {
+  if (baseMap.contains(key))  {
+baseMap(key) += count
+  } else {
+if (baseMap.size < size) {
+  baseMap += key -> count
+} else {
+  // TODO: Make this more efficient... A flatMap?
+  baseMap.retain((k, v) => v > count)
+  baseMap.transform((k, v) => v - count)
+}
+  }
+  this
+}
+
+/**
+ * Merge two maps of counts.
+ * @param other The map containing the counts for that partition
+ */
+def merge(other: FreqItemCounter): this.type = {
+  other.toSeq.foreach { case (k, v) =>
+add(k, v)
+  }
+  this
+}
+
+def toSeq: Seq[(Any, Long)] = baseMap.toSeq
+
+def foldLeft[A, B](start: A)(f: (A, (Any, Long)) => A): A = 
baseMap.foldLeft(start)(f)
+
+def freqItems: Seq[Any] = baseMap.keys.toSeq
+  }
+
+  /**
+   * Finding frequent items for columns, possibly with false positives. 
Using the 
+   * frequent element count algorithm described in
+   * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, 
Schenker, and Papadimitriou]].
+   * For Internal use only.
+   *
+   * @param df The input DataFrame
+   * @param cols the names of the columns to search frequent items in
+   * @param support The minimum frequency for an item to be considered 
`frequent`
+   * @return A Local DataFrame with the Array of frequent items for each 
column.
+   */
+  private[sql] def singlePassFreqItems(
+  df: DataFrame, 
+  cols: Seq[String],
+  support: Double): DataFrame = {
+if (support < 1e-6) {
--- End diff --

```scala
require(support >= 1e-6, s"support ($support) must be greater than 1e-6.")
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...

2015-04-29 Thread scwf

Github user scwf commented on a diff in the pull request:

https://github.com/apache/spark/pull/5798#discussion_r29406960
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveResolutionSuite.scala
 ---
@@ -81,9 +81,13 @@ class HiveResolutionSuite extends HiveComparisonTest {
   .toDF().registerTempTable("caseSensitivityTest")
 
 val query = sql("SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM 
caseSensitivityTest")
-assert(query.schema.fields.map(_.name) === Seq("a", "b", "A", "B", 
"a", "b", "A", "B"),
+assert(query.schema.fields.map(_.name) === Seq("a", "b", "a", "b", 
"a", "b", "a", "b"),
   "The output schema did not preserve the case of the query.")
--- End diff --

Yes I think for caseInSensitivity case we should normalize the table name 
and attribute name


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...

2015-04-29 Thread chenghao-intel

Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/5798#discussion_r29406952
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/package.scala
 ---
@@ -29,12 +29,23 @@ package object analysis {
 
   /**
* Resolver should return true if the first string refers to the same 
entity as the second string.
-   * For example, by using case insensitive equality.
+   * For example, by using case insensitive equality. Besides, Resolver 
also provides the ability
+   * to normalize the string according to its semantic.
*/
-  type Resolver = (String, String) => Boolean
+  trait Resolver {
+def apply(a: String, b: String): Boolean
+def apply(a: String): String
+  }
+
+  val caseInsensitiveResolution = new Resolver {
+override def apply(a: String, b: String): Boolean = 
a.equalsIgnoreCase(b)
+override def apply(a: String): String = a.toLowerCase // as Hive does
--- End diff --

I'd like keep the first `apply` as it was, because I don't want to impact a 
lots of existed code. I agree  we should rename the second `apply` => 
`normalize`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29406904
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -0,0 +1,55 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.sql.execution.stat.FrequentItems
+
+/**
+ * :: Experimental ::
+ * Statistic functions for [[DataFrame]]s.
+ */
+@Experimental
+final class DataFrameStatFunctions private[sql](df: DataFrame) {
+
+  /**
+   * Finding frequent items for columns, possibly with false positives. 
Using the
+   * frequent element count algorithm described in
+   * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, 
Schenker, and Papadimitriou]].
+   *
+   * @param cols the names of the columns to search frequent items in
+   * @param support The minimum frequency for an item to be considered 
`frequent`
+   * @return A Local DataFrame with the Array of frequent items for each 
column.
+   */
+  def freqItems(cols: Seq[String], support: Double): DataFrame = {
--- End diff --

also make sure you add a test to the JavaDataFrameSuite


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29406864
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.sql.test.TestSQLContext
+import org.apache.spark.sql.types._
+import org.scalatest.FunSuite
+
+class DataFrameStatSuite extends FunSuite  {
+
+  val sqlCtx = TestSQLContext
+
+  test("Frequent Items") {
+def toLetter(i: Int): String = (i + 96).toChar.toString
+val rows = Array.tabulate(1000)(i => if (i % 3 == 0) (1, toLetter(1)) 
else (i, toLetter(i)))
+val rowRdd = sqlCtx.sparkContext.parallelize(rows.map(v => Row(v._1, 
v._2)))
--- End diff --

this can be a lot simpler.
```scala
val df = sqlCtx.sparkContext.parallelize(rows.map(v =>(v._1, 
v._2))).toDF("numbers", "letters")
```

remove the struct type and stuff.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29406886
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.sql.test.TestSQLContext
+import org.apache.spark.sql.types._
+import org.scalatest.FunSuite
+
+class DataFrameStatSuite extends FunSuite  {
+
+  val sqlCtx = TestSQLContext
+
+  test("Frequent Items") {
+def toLetter(i: Int): String = (i + 96).toChar.toString
+val rows = Array.tabulate(1000)(i => if (i % 3 == 0) (1, toLetter(1)) 
else (i, toLetter(i)))
+val rowRdd = sqlCtx.sparkContext.parallelize(rows.map(v => Row(v._1, 
v._2)))
+val schema = StructType(StructField("numbers", IntegerType, false) ::
+StructField("letters", StringType, false) :: 
Nil)
+val df = sqlCtx.createDataFrame(rowRdd, schema)
+
+val results = df.stat.freqItems(Array("numbers", "letters"), 0.1)
+val items = results.collect().head
+assert(items.getSeq(0).contains(1),
--- End diff --

use the should matcher for scala test so it prints a better error message. 
(you don't need your custom error message anymore)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...

2015-04-29 Thread chenghao-intel

Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/5798#discussion_r29406841
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveResolutionSuite.scala
 ---
@@ -81,9 +81,13 @@ class HiveResolutionSuite extends HiveComparisonTest {
   .toDF().registerTempTable("caseSensitivityTest")
 
 val query = sql("SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM 
caseSensitivityTest")
-assert(query.schema.fields.map(_.name) === Seq("a", "b", "A", "B", 
"a", "b", "A", "B"),
+assert(query.schema.fields.map(_.name) === Seq("a", "b", "a", "b", 
"a", "b", "a", "b"),
   "The output schema did not preserve the case of the query.")
--- End diff --

In Hive
```
hive> create table ddDD as select Key, valUe from src;
hive> desc extended ;
OK
key string  
value   string  
 
Detailed Table Information  Table(tableName:, dbName:default, 
owner:hcheng, createTime:1430368423, lastAccessTime:0, retention:0, 
sd:StorageDescriptor(cols:[FieldSchema(name:key, type:string, comment:null), 
FieldSchema(name:value, type:string, comment:null)], 
location:file:/home/hcheng/warehouse/, 
inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
parameters:{serialization.format=1}), bucketCols:[], sortCols:[], 
parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
skewedColValueLocationMaps:{}), storedAsSubDirectories:false), 
partitionKeys:[], parameters:{numFiles=1, COLUMN_STATS_ACCURATE=true, 
transient_lastDdlTime=1430368423, numRows=0, totalSize=5824, rawDataSize=0}, 
viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)   
Time taken: 0.111 seconds, Fetched: 4 row(s)
```
You will see both table name & column names are normalized (to lower case), 
so I think it's probably not necessary for the preservation (Normalized name is 
what we want, doesn't it?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7232] [SQL] Add a Substitution batch fo...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5776#issuecomment-97680177
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31385/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7232] [SQL] Add a Substitution batch fo...

2015-04-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5776#issuecomment-97680167
  
  [Test build #31385 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31385/consoleFull)
 for   PR 5776 at commit 
[`553005a`](https://github.com/apache/spark/commit/553005a4e9aebcbb42c712efd833118235d205dc).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7232] [SQL] Add a Substitution batch fo...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5776#issuecomment-97680176
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29406784
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala 
---
@@ -0,0 +1,127 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.execution.stat
+
+import org.apache.spark.Logging
+import org.apache.spark.sql.{Column, DataFrame, Row}
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.types.{ArrayType, StructField, StructType}
+
+import scala.collection.mutable.{Map => MutableMap}
+
+private[sql] object FrequentItems extends Logging {
+
+  /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */
+  private class FreqItemCounter(size: Int) extends Serializable {
+val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long]
+
+/**
+ * Add a new example to the counts if it exists, otherwise deduct the 
count
+ * from existing items.
+ */
+def add(key: Any, count: Long): this.type = {
+  if (baseMap.contains(key))  {
+baseMap(key) += count
+  } else {
+if (baseMap.size < size) {
+  baseMap += key -> count
+} else {
+  // TODO: Make this more efficient... A flatMap?
+  baseMap.retain((k, v) => v > count)
+  baseMap.transform((k, v) => v - count)
+}
+  }
+  this
+}
+
+/**
+ * Merge two maps of counts.
+ * @param other The map containing the counts for that partition
+ */
+def merge(other: FreqItemCounter): this.type = {
+  other.toSeq.foreach { case (k, v) =>
+add(k, v)
+  }
+  this
+}
+
+def toSeq: Seq[(Any, Long)] = baseMap.toSeq
--- End diff --

u don't need this, do you? you can just operate on the map directly. i'm 
asking because i'm not sure whether baseMap.toSeq materializes a whole seq, 
which might be unnecessary


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29406755
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -0,0 +1,55 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.sql.execution.stat.FrequentItems
+
+/**
+ * :: Experimental ::
+ * Statistic functions for [[DataFrame]]s.
+ */
+@Experimental
+final class DataFrameStatFunctions private[sql](df: DataFrame) {
+
+  /**
+   * Finding frequent items for columns, possibly with false positives. 
Using the
+   * frequent element count algorithm described in
+   * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, 
Schenker, and Papadimitriou]].
+   *
--- End diff --

make sure you document the range of support allowed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5112. Expose SizeEstimator as a develope...

2015-04-29 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/3913#issuecomment-97679846
  
Pending an update LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5799#issuecomment-97679720
  
  [Test build #31392 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31392/consoleFull)
 for   PR 5799 at commit 
[`482e741`](https://github.com/apache/spark/commit/482e74180445d30d0b5a769cd5f9bd0e94abfd17).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29406677
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
@@ -0,0 +1,55 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.sql.execution.stat.FrequentItems
+
+/**
+ * :: Experimental ::
+ * Statistic functions for [[DataFrame]]s.
+ */
+@Experimental
+final class DataFrameStatFunctions private[sql](df: DataFrame) {
+
+  /**
+   * Finding frequent items for columns, possibly with false positives. 
Using the
+   * frequent element count algorithm described in
+   * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, 
Schenker, and Papadimitriou]].
+   *
+   * @param cols the names of the columns to search frequent items in
+   * @param support The minimum frequency for an item to be considered 
`frequent`
+   * @return A Local DataFrame with the Array of frequent items for each 
column.
+   */
+  def freqItems(cols: Seq[String], support: Double): DataFrame = {
--- End diff --

don't forget to add java.util.List ones


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5799#issuecomment-97679663
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5799#issuecomment-97679655
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5798#issuecomment-97679301
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31383/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...

2015-04-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5798#issuecomment-97679293
  
  [Test build #31383 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31383/consoleFull)
 for   PR 5798 at commit 
[`c00f1ad`](https://github.com/apache/spark/commit/c00f1adda402f139b41d63fbe20dd9f1b4d6677e).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  trait Resolver `

 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5798#issuecomment-97679300
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5799#issuecomment-97679097
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5799#issuecomment-97679106
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...

2015-04-29 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/5798#discussion_r29406469
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveResolutionSuite.scala
 ---
@@ -81,9 +81,13 @@ class HiveResolutionSuite extends HiveComparisonTest {
   .toDF().registerTempTable("caseSensitivityTest")
 
 val query = sql("SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM 
caseSensitivityTest")
-assert(query.schema.fields.map(_.name) === Seq("a", "b", "A", "B", 
"a", "b", "A", "B"),
+assert(query.schema.fields.map(_.name) === Seq("a", "b", "a", "b", 
"a", "b", "a", "b"),
   "The output schema did not preserve the case of the query.")
--- End diff --

Supporting normalization is good. However, when explicitly specifying the 
case in the query, should we need to preserve the case of the query, instead of 
normalizing it like this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...

2015-04-29 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5744#discussion_r29406389
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1166,7 +1166,7 @@ def __init__(self, jc):
 
 # container operators
 __contains__ = _bin_op("contains")
-__getitem__ = _bin_op("getItem")
+__getitem__ = _bin_op("apply")
--- End diff --

can we add a unit test?

you can add it in tests.py


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...

2015-04-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5744#issuecomment-97678166
  
  [Test build #31390 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31390/consoleFull)
 for   PR 5744 at commit 
[`51719b7`](https://github.com/apache/spark/commit/51719b7f612859219ba31658da4e9582c6ef2856).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5744#issuecomment-97677743
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...

2015-04-29 Thread cloud-fan

Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/5744#issuecomment-97677728
  
@rxin , already done :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5744#issuecomment-97677730
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Merge pull request #1 from apache/master

2015-04-29 Thread sven0726

GitHub user sven0726 opened a pull request:

https://github.com/apache/spark/pull/5800

Merge pull request #1 from apache/master



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sven0726/spark-1 master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5800.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5800


commit 5da8c543ff542951c3fefe6e123b891f66edf4b6
Author: sven0726 
Date:   2015-04-27T08:21:55Z

Merge pull request #1 from apache/master

2015-04-27ç¬¬ä¸æ¬¡merge




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Merge pull request #1 from apache/master

2015-04-29 Thread sven0726

Github user sven0726 closed the pull request at:

https://github.com/apache/spark/pull/5800


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...

2015-04-29 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/5744#issuecomment-97677458
  
I will let @marmbrus take a look at this tomorrow.

Meantime, can you add the apply method and Python getitem method? Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1406] Mllib pmml model export

2015-04-29 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3062#issuecomment-97677239
  
LGTM. Merged into master. Thanks! @selvinsource I created SPARK-7272 for 
the user guide and assigned it to you. Let me know if you don't have time to do 
it. Btw, please submit your PMML evaluator code as a Spark package at 
spark-packages.org. Due to license issues, we cannot include them as example 
code in the Spark codebase. But PMML model scoring is quite important.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5112. Expose SizeEstimator as a develope...

2015-04-29 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3913#issuecomment-97676897
  
Looking good though this needs a rebase now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...

2015-04-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5744#issuecomment-97676693
  
  [Test build #31380 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31380/consoleFull)
 for   PR 5744 at commit 
[`727fbc1`](https://github.com/apache/spark/commit/727fbc1a5891c481bd382dd30fee4452c5d341c8).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class UnresolvedGetField(child: Expression, fieldExpr: 
Expression) extends UnaryExpression `
  * `trait GetField extends UnaryExpression `
  * `case class SimpleStructGetField(child: Expression, field: StructField, 
ordinal: Int)`
  * `case class ArrayStructGetField(`
  * `abstract class OrdinalGetField extends GetField `
  * `case class ArrayOrdinalGetField(child: Expression, ordinal: 
Expression)`
  * `case class MapOrdinalGetField(child: Expression, ordinal: Expression)`

 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5744#issuecomment-97676722
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5744#issuecomment-97676725
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31380/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-6846 [WEBUI] Stage kill URL easy to acci...

2015-04-29 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/5528#issuecomment-97676514
  
That's fine, in the sense that the endpoint returns no data. OK, so it 
works except for this proxying. Hm, surely the YARN proxy can pass on a POST. 
We'll have to look into this. Any wisdom from YARN folks about where to look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1406] Mllib pmml model export

2015-04-29 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3062


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6612] [MLLib] [PySpark] Python KMeans p...

2015-04-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5647#issuecomment-97675656
  
  [Test build #31382 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31382/consoleFull)
 for   PR 5647 at commit 
[`0319821`](https://github.com/apache/spark/commit/0319821db7406f3cca359af5bc021d2f3fd92a17).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6612] [MLLib] [PySpark] Python KMeans p...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5647#issuecomment-97675701
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31382/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6612] [MLLib] [PySpark] Python KMeans p...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5647#issuecomment-97675700
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7120][SPARK-7121] Closure cleaner nesti...

2015-04-29 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5685#issuecomment-97672308
  
Hey Andrew,

This is looking good. The code is quite dense but as far as I can tell, 
this is correct. I left some more surface level comments, if you can get to 
those I think it will be good to go.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1442][SQL][WIP] Window Function Support...

2015-04-29 Thread yhuai

Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/5604#issuecomment-97671446
  
I have been working on it a few days. I believe that my update will cover 
most of your comments. Please hold your comments until my update. Thanks :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...

2015-04-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5609#issuecomment-97671182
  
  [Test build #31384 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31384/consoleFull)
 for   PR 5609 at commit 
[`8d3fc16`](https://github.com/apache/spark/commit/8d3fc16dd22c87fbf768951b64dabe7d121731ec).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5609#issuecomment-97671186
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29405622
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala ---
@@ -0,0 +1,124 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.ml
+
+
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.types.{StructType, ArrayType, StructField}
+
+import scala.collection.mutable.{Map => MutableMap}
+
+import org.apache.spark.Logging
+import org.apache.spark.sql.{Row, DataFrame, functions}
+
+private[sql] object FrequentItems extends Logging {
+
+  /**
+   * Merge two maps of counts. Subtracts the sum of `otherMap` from 
`baseMap`, and fills in
+   * any emptied slots with the most frequent of `otherMap`.
+   * @param baseMap The map containing the global counts
+   * @param otherMap The map containing the counts for that partition
+   * @param maxSize The maximum number of counts to keep in memory
+   */
+  private def mergeCounts[A](
--- End diff --

I think the implementation could be cleaner if we wrap `MutableMap[A, 
Long]` with a utility class:

~~~scala
class FreqItemCounter(size: k) {
  def add(any: Any, count: Long = 1L): this.type
  def merge(other: FreqItemCounter): this.type = {
other.toSeq.foreach { case (k, c) =>
  add(k, c)
}
  }
  def items: Array[Any]
  def toSeq: Seq[(Any, Long)]
}
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5609#issuecomment-97671187
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31384/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1442][SQL][WIP] Window Function Support...

2015-04-29 Thread guowei2

Github user guowei2 commented on the pull request:

https://github.com/apache/spark/pull/5604#issuecomment-97671084
  
@scwf  I think it is a good choice, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...

2015-04-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5609#issuecomment-97670982
  
  [Test build #31389 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31389/consoleFull)
 for   PR 5609 at commit 
[`8d3fc16`](https://github.com/apache/spark/commit/8d3fc16dd22c87fbf768951b64dabe7d121731ec).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5609#issuecomment-97670934
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5609#issuecomment-97670920
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread brkyvz

Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29405486
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -1414,4 +1415,25 @@ class DataFrame private[sql](
 val jrdd = rdd.map(EvaluatePython.rowToArray(_, 
fieldTypes)).toJavaRDD()
 SerDeUtil.javaToPython(jrdd)
   }
+
+  
/
+  // Statistic functions
+  
/
+
+  // scalastyle:off
+  object stat {
--- End diff --

aha! I like it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29405470
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala ---
@@ -0,0 +1,124 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.ml
+
+
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.types.{StructType, ArrayType, StructField}
+
+import scala.collection.mutable.{Map => MutableMap}
+
+import org.apache.spark.Logging
+import org.apache.spark.sql.{Row, DataFrame, functions}
+
+private[sql] object FrequentItems extends Logging {
+
+  /**
+   * Merge two maps of counts. Subtracts the sum of `otherMap` from 
`baseMap`, and fills in
+   * any emptied slots with the most frequent of `otherMap`.
+   * @param baseMap The map containing the global counts
+   * @param otherMap The map containing the counts for that partition
+   * @param maxSize The maximum number of counts to keep in memory
+   */
+  private def mergeCounts[A](
+  baseMap: MutableMap[A, Long],
+  otherMap: MutableMap[A, Long],
+  maxSize: Int): Unit = {
+val otherSum = otherMap.foldLeft(0L) { case (sum, (k, v)) =>
+  if (!baseMap.contains(k)) sum + v else sum
+}
+baseMap.retain((k, v) => v > otherSum)
+// sort in decreasing order, so that we will add the most frequent 
items first
+val sorted = otherMap.toSeq.sortBy(-_._2)
+var i = 0
+val otherSize = sorted.length
+while (i < otherSize && baseMap.size < maxSize) {
+  val keyVal = sorted(i)
+  baseMap += keyVal._1 -> keyVal._2
+  i += 1
+}
+  }
+  
+
+  /**
+   * Finding frequent items for columns, possibly with false positives. 
Using the algorithm 
+   * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
+   * For Internal use only.
+   *
+   * @param df The input DataFrame
+   * @param cols the names of the columns to search frequent items in
+   * @param support The minimum frequency for an item to be considered 
`frequent`
+   * @return A Local DataFrame with the Array of frequent items for each 
column.
+   */
+  private[sql] def singlePassFreqItems(
+  df: DataFrame, 
+  cols: Array[String], 
+  support: Double): DataFrame = {
+val numCols = cols.length
+// number of max items to keep counts for
+val sizeOfMap = math.floor(1 / support).toInt
+val countMaps = Array.tabulate(numCols)(i => MutableMap.empty[Any, 
Long])
+val originalSchema = df.schema
+val colInfo = cols.map { name =>
+  val index = originalSchema.fieldIndex(name)
+  val dataType = originalSchema.fields(index)
+  (index, dataType.dataType)
+}
+val colIndices = colInfo.map(_._1)
+
+val freqItems: Array[MutableMap[Any, Long]] = 
df.rdd.aggregate(countMaps)(
+  seqOp = (counts, row) => {
+var i = 0
+colIndices.foreach { index =>
+  val thisMap = counts(i)
+  val key = row.get(index)
+  if (thisMap.contains(key))  {
+thisMap(key) += 1
+  } else {
+if (thisMap.size < sizeOfMap) {
+  thisMap += key -> 1
+} else {
+  // TODO: Make this more efficient... A flatMap?
+  thisMap.retain((k, v) => v > 1)
+  thisMap.transform((k, v) => v - 1)
+}
+  }
+  i += 1
+}
+counts
+  },
+  combOp = (baseCounts, counts) => {
+var i = 0
+while (i < numCols) {
+  mergeCounts(baseCounts(i), counts(i), sizeOfMap)
+  i += 1
+}
+baseCounts
+  }
+)
+//
+val justItems = freqItems.map(m => m.keys.toSeq)
+val resultRow = Row(justItems:_*)
+// appe

[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...

2015-04-29 Thread WangTaoTheTonic

Github user WangTaoTheTonic commented on the pull request:

https://github.com/apache/spark/pull/5609#issuecomment-97670648
  
Looks like the failed test case is flacky. Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6602][Core] Update Master, Worker, Clie...

2015-04-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5392#issuecomment-97670499
  
  [Test build #31388 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31388/consoleFull)
 for   PR 5392 at commit 
[`72304f0`](https://github.com/apache/spark/commit/72304f0150e74eb6432fc3141d3d5bc71bb93d61).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6602][Core] Update Master, Worker, Clie...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5392#issuecomment-97670459
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6602][Core] Update Master, Worker, Clie...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5392#issuecomment-97670448
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7120][SPARK-7121] Closure cleaner nesti...

2015-04-29 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/5685#discussion_r29405393
  
--- Diff: 
core/src/test/scala/org/apache/spark/util/ClosureCleanerSuite2.scala ---
@@ -0,0 +1,562 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+import java.io.NotSerializableException
+
+import scala.collection.mutable
+
+import org.scalatest.{BeforeAndAfterAll, FunSuite, PrivateMethodTester}
+
+import org.apache.spark.{SparkContext, SparkException}
+import org.apache.spark.serializer.SerializerInstance
+
+/**
+ * Another test suite for the closure cleaner that is finer-grained.
+ * For tests involving end-to-end Spark jobs, see {{ClosureCleanerSuite}}.
+ */
+class ClosureCleanerSuite2 extends FunSuite with BeforeAndAfterAll with 
PrivateMethodTester {
+
+  // Start a SparkContext so that the closure serializer is accessible
+  // We do not actually use this explicitly otherwise
+  private var sc: SparkContext = null
+  private var closureSerializer: SerializerInstance = null
+
+  override def beforeAll(): Unit = {
+sc = new SparkContext("local", "test")
+closureSerializer = sc.env.closureSerializer.newInstance()
+  }
+
+  override def afterAll(): Unit = {
+sc.stop()
+sc = null
+closureSerializer = null
+  }
+
+  // Some fields and methods to reference in inner closures later
+  private val someSerializableValue = 1
+  private val someNonSerializableValue = new NonSerializable
+  private def someSerializableMethod() = 1
+  private def someNonSerializableMethod() = new NonSerializable
+
+  /** Assert that the given closure is serializable (or not). */
+  private def assertSerializable(closure: AnyRef, serializable: Boolean): 
Unit = {
+if (serializable) {
+  closureSerializer.serialize(closure)
+} else {
+  intercept[NotSerializableException] {
+closureSerializer.serialize(closure)
+  }
+}
+  }
+
+  /**
+   * Helper method for testing whether closure cleaning works as expected.
+   * This cleans the given closure twice, with and without transitive 
cleaning.
+   */
+  private def testClean(
+  closure: AnyRef,
+  serializableBefore: Boolean,
+  serializableAfter: Boolean): Unit = {
+testClean(closure, serializableBefore, serializableAfter, transitive = 
true)
+testClean(closure, serializableBefore, serializableAfter, transitive = 
false)
+  }
+
+  /** Helper method for testing whether closure cleaning works as 
expected. */
+  private def testClean(
+  closure: AnyRef,
+  serializableBefore: Boolean,
+  serializableAfter: Boolean,
+  transitive: Boolean): Unit = {
+assertSerializable(closure, serializableBefore)
+// If the resulting closure is not serializable even after
+// cleaning, we expect ClosureCleaner to throw a SparkException
+if (serializableAfter) {
+  ClosureCleaner.clean(closure, checkSerializable = true, transitive)
+} else {
+  intercept[SparkException] {
+ClosureCleaner.clean(closure, checkSerializable = true, transitive)
+  }
+}
+assertSerializable(closure, serializableAfter)
+  }
+
+  /**
+   * Return the fields accessed by the given closure by class.
+   * This also optionally finds the fields transitively referenced through 
methods invocations.
+   */
+  private def findAccessedFields(
+  closure: AnyRef,
+  outerClasses: Seq[Class[_]],
+  findTransitively: Boolean): Map[Class[_], Set[String]] = {
+val fields = new mutable.HashMap[Class[_], mutable.Set[String]]
+outerClasses.foreach { c => fields(c) = new mutable.HashSet[String] }
+ClosureCleaner.getClassReader(closure.getClass)
+  .accept(new FieldAccessFinder(fields, findTransitively), 0)
+fields.mapV

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29405344
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala ---
@@ -0,0 +1,124 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.ml
+
+
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.types.{StructType, ArrayType, StructField}
+
+import scala.collection.mutable.{Map => MutableMap}
+
+import org.apache.spark.Logging
+import org.apache.spark.sql.{Row, DataFrame, functions}
+
+private[sql] object FrequentItems extends Logging {
+
+  /**
+   * Merge two maps of counts. Subtracts the sum of `otherMap` from 
`baseMap`, and fills in
+   * any emptied slots with the most frequent of `otherMap`.
+   * @param baseMap The map containing the global counts
+   * @param otherMap The map containing the counts for that partition
+   * @param maxSize The maximum number of counts to keep in memory
+   */
+  private def mergeCounts[A](
+  baseMap: MutableMap[A, Long],
+  otherMap: MutableMap[A, Long],
+  maxSize: Int): Unit = {
+val otherSum = otherMap.foldLeft(0L) { case (sum, (k, v)) =>
+  if (!baseMap.contains(k)) sum + v else sum
+}
+baseMap.retain((k, v) => v > otherSum)
+// sort in decreasing order, so that we will add the most frequent 
items first
+val sorted = otherMap.toSeq.sortBy(-_._2)
+var i = 0
+val otherSize = sorted.length
+while (i < otherSize && baseMap.size < maxSize) {
+  val keyVal = sorted(i)
+  baseMap += keyVal._1 -> keyVal._2
+  i += 1
+}
+  }
+  
+
+  /**
+   * Finding frequent items for columns, possibly with false positives. 
Using the algorithm 
+   * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
+   * For Internal use only.
+   *
+   * @param df The input DataFrame
+   * @param cols the names of the columns to search frequent items in
+   * @param support The minimum frequency for an item to be considered 
`frequent`
+   * @return A Local DataFrame with the Array of frequent items for each 
column.
+   */
+  private[sql] def singlePassFreqItems(
+  df: DataFrame, 
+  cols: Array[String], 
+  support: Double): DataFrame = {
+val numCols = cols.length
+// number of max items to keep counts for
+val sizeOfMap = math.floor(1 / support).toInt
+val countMaps = Array.tabulate(numCols)(i => MutableMap.empty[Any, 
Long])
+val originalSchema = df.schema
+val colInfo = cols.map { name =>
+  val index = originalSchema.fieldIndex(name)
+  val dataType = originalSchema.fields(index)
+  (index, dataType.dataType)
+}
+val colIndices = colInfo.map(_._1)
+
+val freqItems: Array[MutableMap[Any, Long]] = 
df.rdd.aggregate(countMaps)(
+  seqOp = (counts, row) => {
+var i = 0
+colIndices.foreach { index =>
+  val thisMap = counts(i)
+  val key = row.get(index)
+  if (thisMap.contains(key))  {
+thisMap(key) += 1
--- End diff --

`1` -> `1L`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6939][Streaming][WebUI] Add timeline an...

2015-04-29 Thread zsxwing

Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/5533#issuecomment-97670189
  
new screenshot


![g8](https://cloud.githubusercontent.com/assets/1000778/7407142/57adb2a0-eec3-11e4-8492-7304da7baaa3.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6939][Streaming][WebUI] Add timeline an...

2015-04-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5533#issuecomment-97670140
  
  [Test build #31387 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31387/consoleFull)
 for   PR 5533 at commit 
[`a2972e9`](https://github.com/apache/spark/commit/a2972e988053168b7299f40d16cd2896a6eb8110).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29405307
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala ---
@@ -0,0 +1,124 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.ml
+
+
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.types.{StructType, ArrayType, StructField}
+
+import scala.collection.mutable.{Map => MutableMap}
+
+import org.apache.spark.Logging
+import org.apache.spark.sql.{Row, DataFrame, functions}
+
+private[sql] object FrequentItems extends Logging {
+
+  /**
+   * Merge two maps of counts. Subtracts the sum of `otherMap` from 
`baseMap`, and fills in
+   * any emptied slots with the most frequent of `otherMap`.
+   * @param baseMap The map containing the global counts
+   * @param otherMap The map containing the counts for that partition
+   * @param maxSize The maximum number of counts to keep in memory
+   */
+  private def mergeCounts[A](
+  baseMap: MutableMap[A, Long],
+  otherMap: MutableMap[A, Long],
+  maxSize: Int): Unit = {
+val otherSum = otherMap.foldLeft(0L) { case (sum, (k, v)) =>
+  if (!baseMap.contains(k)) sum + v else sum
+}
+baseMap.retain((k, v) => v > otherSum)
+// sort in decreasing order, so that we will add the most frequent 
items first
+val sorted = otherMap.toSeq.sortBy(-_._2)
+var i = 0
+val otherSize = sorted.length
+while (i < otherSize && baseMap.size < maxSize) {
+  val keyVal = sorted(i)
+  baseMap += keyVal._1 -> keyVal._2
+  i += 1
+}
+  }
+  
+
+  /**
+   * Finding frequent items for columns, possibly with false positives. 
Using the algorithm 
+   * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
+   * For Internal use only.
+   *
+   * @param df The input DataFrame
+   * @param cols the names of the columns to search frequent items in
+   * @param support The minimum frequency for an item to be considered 
`frequent`
+   * @return A Local DataFrame with the Array of frequent items for each 
column.
+   */
+  private[sql] def singlePassFreqItems(
+  df: DataFrame, 
+  cols: Array[String], 
+  support: Double): DataFrame = {
+val numCols = cols.length
+// number of max items to keep counts for
+val sizeOfMap = math.floor(1 / support).toInt
+val countMaps = Array.tabulate(numCols)(i => MutableMap.empty[Any, 
Long])
+val originalSchema = df.schema
+val colInfo = cols.map { name =>
+  val index = originalSchema.fieldIndex(name)
+  val dataType = originalSchema.fields(index)
+  (index, dataType.dataType)
+}
+val colIndices = colInfo.map(_._1)
+
+val freqItems: Array[MutableMap[Any, Long]] = 
df.rdd.aggregate(countMaps)(
--- End diff --

`df.select(cols).rdd.aggregate` (then you don't need to skip elements)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6939][Streaming][WebUI] Add timeline an...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5533#issuecomment-97670038
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6939][Streaming][WebUI] Add timeline an...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5533#issuecomment-97670024
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29405254
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala ---
@@ -0,0 +1,124 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.ml
+
+
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.types.{StructType, ArrayType, StructField}
+
+import scala.collection.mutable.{Map => MutableMap}
+
+import org.apache.spark.Logging
+import org.apache.spark.sql.{Row, DataFrame, functions}
+
+private[sql] object FrequentItems extends Logging {
+
+  /**
+   * Merge two maps of counts. Subtracts the sum of `otherMap` from 
`baseMap`, and fills in
+   * any emptied slots with the most frequent of `otherMap`.
+   * @param baseMap The map containing the global counts
+   * @param otherMap The map containing the counts for that partition
+   * @param maxSize The maximum number of counts to keep in memory
+   */
+  private def mergeCounts[A](
+  baseMap: MutableMap[A, Long],
+  otherMap: MutableMap[A, Long],
+  maxSize: Int): Unit = {
+val otherSum = otherMap.foldLeft(0L) { case (sum, (k, v)) =>
+  if (!baseMap.contains(k)) sum + v else sum
+}
+baseMap.retain((k, v) => v > otherSum)
+// sort in decreasing order, so that we will add the most frequent 
items first
+val sorted = otherMap.toSeq.sortBy(-_._2)
+var i = 0
+val otherSize = sorted.length
+while (i < otherSize && baseMap.size < maxSize) {
+  val keyVal = sorted(i)
+  baseMap += keyVal._1 -> keyVal._2
+  i += 1
+}
+  }
+  
+
+  /**
+   * Finding frequent items for columns, possibly with false positives. 
Using the algorithm 
+   * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
+   * For Internal use only.
+   *
+   * @param df The input DataFrame
+   * @param cols the names of the columns to search frequent items in
+   * @param support The minimum frequency for an item to be considered 
`frequent`
+   * @return A Local DataFrame with the Array of frequent items for each 
column.
+   */
+  private[sql] def singlePassFreqItems(
+  df: DataFrame, 
+  cols: Array[String], 
+  support: Double): DataFrame = {
+val numCols = cols.length
--- End diff --

Check the range of `support`. Warn if the it is too small (e.g., 1e-6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7267][SQL]Push down Project when it's c...

2015-04-29 Thread scwf

Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/5797#issuecomment-97669998
  
Retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29405251
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala ---
@@ -0,0 +1,124 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.ml
+
+
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.types.{StructType, ArrayType, StructField}
+
+import scala.collection.mutable.{Map => MutableMap}
+
+import org.apache.spark.Logging
+import org.apache.spark.sql.{Row, DataFrame, functions}
+
+private[sql] object FrequentItems extends Logging {
+
+  /**
+   * Merge two maps of counts. Subtracts the sum of `otherMap` from 
`baseMap`, and fills in
+   * any emptied slots with the most frequent of `otherMap`.
+   * @param baseMap The map containing the global counts
+   * @param otherMap The map containing the counts for that partition
+   * @param maxSize The maximum number of counts to keep in memory
+   */
+  private def mergeCounts[A](
+  baseMap: MutableMap[A, Long],
+  otherMap: MutableMap[A, Long],
+  maxSize: Int): Unit = {
+val otherSum = otherMap.foldLeft(0L) { case (sum, (k, v)) =>
+  if (!baseMap.contains(k)) sum + v else sum
+}
+baseMap.retain((k, v) => v > otherSum)
+// sort in decreasing order, so that we will add the most frequent 
items first
+val sorted = otherMap.toSeq.sortBy(-_._2)
+var i = 0
+val otherSize = sorted.length
+while (i < otherSize && baseMap.size < maxSize) {
+  val keyVal = sorted(i)
+  baseMap += keyVal._1 -> keyVal._2
+  i += 1
+}
+  }
+  
+
+  /**
+   * Finding frequent items for columns, possibly with false positives. 
Using the algorithm 
+   * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
+   * For Internal use only.
+   *
+   * @param df The input DataFrame
+   * @param cols the names of the columns to search frequent items in
+   * @param support The minimum frequency for an item to be considered 
`frequent`
+   * @return A Local DataFrame with the Array of frequent items for each 
column.
+   */
+  private[sql] def singlePassFreqItems(
+  df: DataFrame, 
+  cols: Array[String], 
--- End diff --

If multiple columns are provided, shall we search the combination of them 
instead of each individually? For example, if I call

~~~scala
freqItems(Array("gender", "title"), 0.01)
~~~

I'm expecting the frequent combinations instead of each of them. The 
current implementation is more flexible because users can create a struct from 
multiple columns, and this allows to find frequent items on multiple columns in 
parallel. But I'm a little worried about what users expect when they call 
`freqItems(Array("gender", "title"))` @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29405257
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala ---
@@ -0,0 +1,124 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.ml
+
+
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.types.{StructType, ArrayType, StructField}
+
+import scala.collection.mutable.{Map => MutableMap}
+
+import org.apache.spark.Logging
+import org.apache.spark.sql.{Row, DataFrame, functions}
+
+private[sql] object FrequentItems extends Logging {
+
+  /**
+   * Merge two maps of counts. Subtracts the sum of `otherMap` from 
`baseMap`, and fills in
+   * any emptied slots with the most frequent of `otherMap`.
+   * @param baseMap The map containing the global counts
+   * @param otherMap The map containing the counts for that partition
+   * @param maxSize The maximum number of counts to keep in memory
+   */
+  private def mergeCounts[A](
+  baseMap: MutableMap[A, Long],
+  otherMap: MutableMap[A, Long],
+  maxSize: Int): Unit = {
+val otherSum = otherMap.foldLeft(0L) { case (sum, (k, v)) =>
+  if (!baseMap.contains(k)) sum + v else sum
+}
+baseMap.retain((k, v) => v > otherSum)
+// sort in decreasing order, so that we will add the most frequent 
items first
+val sorted = otherMap.toSeq.sortBy(-_._2)
+var i = 0
+val otherSize = sorted.length
+while (i < otherSize && baseMap.size < maxSize) {
+  val keyVal = sorted(i)
+  baseMap += keyVal._1 -> keyVal._2
+  i += 1
+}
+  }
+  
+
+  /**
+   * Finding frequent items for columns, possibly with false positives. 
Using the algorithm 
+   * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
+   * For Internal use only.
+   *
+   * @param df The input DataFrame
+   * @param cols the names of the columns to search frequent items in
+   * @param support The minimum frequency for an item to be considered 
`frequent`
+   * @return A Local DataFrame with the Array of frequent items for each 
column.
+   */
+  private[sql] def singlePassFreqItems(
+  df: DataFrame, 
+  cols: Array[String], 
+  support: Double): DataFrame = {
+val numCols = cols.length
+// number of max items to keep counts for
+val sizeOfMap = math.floor(1 / support).toInt
--- End diff --

`math.floor` is not necessary: `(1.0 / support).toInt`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29405043
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -1414,4 +1415,37 @@ class DataFrame private[sql](
 val jrdd = rdd.map(EvaluatePython.rowToArray(_, 
fieldTypes)).toJavaRDD()
 SerDeUtil.javaToPython(jrdd)
   }
+
+  
/
+  // Statistic functions
+  
/
+
+  // scalastyle:off
+  object stat {
+  // scalastyle:on
+
+/**
+ * Finding frequent items for columns, possibly with false positives. 
Using the algorithm
+ * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
--- End diff --

Use DOI link: http://dx.doi.org/10.1145/762471.762473


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29405009
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -1414,4 +1415,25 @@ class DataFrame private[sql](
 val jrdd = rdd.map(EvaluatePython.rowToArray(_, 
fieldTypes)).toJavaRDD()
 SerDeUtil.javaToPython(jrdd)
   }
+
+  
/
+  // Statistic functions
+  
/
+
+  // scalastyle:off
+  object stat {
--- End diff --

take a look at how we implemented na.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread brkyvz

Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29404981
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -1414,4 +1415,25 @@ class DataFrame private[sql](
 val jrdd = rdd.map(EvaluatePython.rowToArray(_, 
fieldTypes)).toJavaRDD()
 SerDeUtil.javaToPython(jrdd)
   }
+
+  
/
+  // Statistic functions
+  
/
+
+  // scalastyle:off
+  object stat {
--- End diff --

I think it looks like `df.stat$.MODULE$.freqItems()`. I don't know how we 
can otherwise make it `df.stat.freqItems` in scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7225][SQL] CombineLimits optimizer does...

2015-04-29 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/5770


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/5799#issuecomment-97667923
  
I'm going to let @mengxr to comment on the actual algorithm implementation.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29404857
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/ml/FrequentItemsSuite.scala ---
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.ml
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types._
+import org.scalatest.FunSuite
+
+import org.apache.spark.sql.test.TestSQLContext
+
+class FrequentItemsSuite extends FunSuite  {
--- End diff --

move this to .sql package, and call it DataFrameStatSuite?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29404815
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -1414,4 +1415,37 @@ class DataFrame private[sql](
 val jrdd = rdd.map(EvaluatePython.rowToArray(_, 
fieldTypes)).toJavaRDD()
 SerDeUtil.javaToPython(jrdd)
   }
+
+  
/
+  // Statistic functions
+  
/
+
+  // scalastyle:off
+  object stat {
+  // scalastyle:on
+
+/**
+ * Finding frequent items for columns, possibly with false positives. 
Using the algorithm
+ * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
+ *
+ * @param cols the names of the columns to search frequent items in
+ * @param support The minimum frequency for an item to be considered 
`frequent`
+ * @return A Local DataFrame with the Array of frequent items for each 
column.
+ */
+def freqItems(cols: Array[String], support: Double): DataFrame = {
--- End diff --

in df we usually support List[String] and Seq[String]. This is one reason 
why we are using a separate name space.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29404786
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -1414,4 +1415,37 @@ class DataFrame private[sql](
 val jrdd = rdd.map(EvaluatePython.rowToArray(_, 
fieldTypes)).toJavaRDD()
 SerDeUtil.javaToPython(jrdd)
   }
+
+  
/
+  // Statistic functions
+  
/
+
+  // scalastyle:off
+  object stat {
+  // scalastyle:on
+
+/**
+ * Finding frequent items for columns, possibly with false positives. 
Using the algorithm
+ * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
--- End diff --

should do proper citation rather than giving an url, since this url might 
disappear.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5799#issuecomment-97667662
  
  [Test build #31386 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31386/consoleFull)
 for   PR 5799 at commit 
[`8279d4d`](https://github.com/apache/spark/commit/8279d4d4cb09f78e2f8f83f9a3738101b940ed40).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29404754
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala ---
@@ -0,0 +1,124 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.ml
--- End diff --

let's put this in execution.stat?

It's annoying to add a top level package because we have rules to 
specifically exclude existing packages.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5799#issuecomment-97667532
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5799#issuecomment-97667521
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5799#discussion_r29404717
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -1414,4 +1415,25 @@ class DataFrame private[sql](
 val jrdd = rdd.map(EvaluatePython.rowToArray(_, 
fieldTypes)).toJavaRDD()
 SerDeUtil.javaToPython(jrdd)
   }
+
+  
/
+  // Statistic functions
+  
/
+
+  // scalastyle:off
+  object stat {
--- End diff --

does this work in java?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7120][SPARK-7121] Closure cleaner nesti...

2015-04-29 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/5685#discussion_r29404655
  
--- Diff: 
core/src/test/scala/org/apache/spark/util/ClosureCleanerSuite2.scala ---
@@ -0,0 +1,562 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+import java.io.NotSerializableException
+
+import scala.collection.mutable
+
+import org.scalatest.{BeforeAndAfterAll, FunSuite, PrivateMethodTester}
+
+import org.apache.spark.{SparkContext, SparkException}
+import org.apache.spark.serializer.SerializerInstance
+
+/**
+ * Another test suite for the closure cleaner that is finer-grained.
+ * For tests involving end-to-end Spark jobs, see {{ClosureCleanerSuite}}.
+ */
+class ClosureCleanerSuite2 extends FunSuite with BeforeAndAfterAll with 
PrivateMethodTester {
+
+  // Start a SparkContext so that the closure serializer is accessible
+  // We do not actually use this explicitly otherwise
+  private var sc: SparkContext = null
+  private var closureSerializer: SerializerInstance = null
+
+  override def beforeAll(): Unit = {
+sc = new SparkContext("local", "test")
+closureSerializer = sc.env.closureSerializer.newInstance()
+  }
+
+  override def afterAll(): Unit = {
+sc.stop()
+sc = null
+closureSerializer = null
+  }
+
+  // Some fields and methods to reference in inner closures later
+  private val someSerializableValue = 1
+  private val someNonSerializableValue = new NonSerializable
+  private def someSerializableMethod() = 1
+  private def someNonSerializableMethod() = new NonSerializable
+
+  /** Assert that the given closure is serializable (or not). */
+  private def assertSerializable(closure: AnyRef, serializable: Boolean): 
Unit = {
+if (serializable) {
+  closureSerializer.serialize(closure)
+} else {
+  intercept[NotSerializableException] {
+closureSerializer.serialize(closure)
+  }
+}
+  }
+
+  /**
+   * Helper method for testing whether closure cleaning works as expected.
+   * This cleans the given closure twice, with and without transitive 
cleaning.
+   */
+  private def testClean(
--- End diff --

Maybe these should be called `verifyCleaning` or something. Because they 
are really verifying behavior according to what you pass as 
`serializableBefore`/`serializableAfter`. Also it would be good to document 
those parameters... it's not super clear what they mean. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

2015-04-29 Thread brkyvz

GitHub user brkyvz opened a pull request:

https://github.com/apache/spark/pull/5799

[SPARK-7242][SQL][MLLIB] Frequent items for DataFrames

Finding frequent items with possibly false positives, using the algorithm 
described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
public API under:
```
df.stat.freqItems(cols: Array[String], support: Double = 0.001): DataFrame
```

The output is a local DataFrame having the input column names with 
`-freqItems` appended to it. This is a single pass algorithm that may return 
false positives, but no false negatives.

cc @mengxr @rxin 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/brkyvz/spark freq-items

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5799.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5799


commit 3d82168544a29e7e1ae1326ab933db7e78a72dcc
Author: Burak Yavuz 
Date:   2015-04-29T23:07:48Z

made base implementation

implemented frequent items




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7225][SQL] CombineLimits optimizer does...

2015-04-29 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/5770#issuecomment-97665645
  
Thanks. Merging in master.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Build] Enable MiMa checks for SQL

2015-04-29 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/5727#issuecomment-97665454
  
LGTM. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7225][SQL] CombineLimits optimizer does...

2015-04-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5770#issuecomment-97665312
  
  [Test build #31377 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31377/consoleFull)
 for   PR 5770 at commit 
[`c68eaa7`](https://github.com/apache/spark/commit/c68eaa76eccdeb35d99dcbbedb30a9d6bed8eafd).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7225][SQL] CombineLimits optimizer does...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5770#issuecomment-97665315
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7225][SQL] CombineLimits optimizer does...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5770#issuecomment-97665316
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31377/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5521] PCA wrapper for easy transform ve...

2015-04-29 Thread catap

Github user catap commented on a diff in the pull request:

https://github.com/apache/spark/pull/4304#discussion_r29404295
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.linalg._
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+import org.apache.spark.rdd.RDD
+
+/**
+ * A feature transformer that projects vectors to a low-dimensional space 
using PCA.
+ *
+ * @param k num of principal components
+ */
+
+class PCA(val k: Int) {
+  require(k >= 1)
+
+  /**
+   * Computes a [[PCAModel]] that contains the principal components of the 
input vectors.
+   *
+   * @param sources source vectors
+   */
+  def fit(sources: RDD[Vector]): PCAModel = {
+require(k <= sources.first().size)
+
+val mat = new RowMatrix(sources)
+val pc = mat.computePrincipalComponents(k) match {
+  case dm: DenseMatrix =>
+dm
+  /*
+   * Convert a sparse matrix to dense.
+   *
+   * RowMatrix.computePrincipalComponents always returns a dense 
matrix.
+   * The following code is a safeguard.
+   */
+  case sm: SparseMatrix =>
+sm.toDense
+  case _ =>
+throw new IllegalArgumentException("Unsupported matrix format. 
Expected " +
+  s"SparseMatrix or DenseMatrix. Instead got: ${mat.getClass}")
+
+}
+new PCAModel(k, pc)
+  }
+
+  /**
+   * Computes a [[PCAModel]] that contains the principal components of the 
input vectors.
+   *
+   * @param sources source vectors
+   */
+  def fit(k: Int, sources: JavaRDD[Vector]): PCAModel =
+fit(k, sources.rdd)
+}
+
+/**
+ * Model fitted by [[PCA]] that can project vectors to a low-dimensional 
space space using PCA.
+ *
+ * @param k num of principal components.
+ * @param pc a principal components Matrix
+ */
+class PCAModel private[mllib](val k: Int, val pc: DenseMatrix) extends 
VectorTransformer {
+  /**
+   * Transform a vector by computed Principal Components.
+   *
+   * @param vector vector to be transformed.
+   * @return transformed vector.
+   */
+  override def transform(vector: Vector): Vector =
+vector match {
+  case dv: DenseVector =>
+pc.transpose.multiply(dv)
+  case sv: SparseVector =>
+/* SparseVector -> single row SparseMatrix */
+val rowIndices = new Array[Int](sv.indices.length)
--- End diff --

Thanks!

Here I made a matrix transpose by hands :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...

2015-04-29 Thread scwf

Github user scwf commented on a diff in the pull request:

https://github.com/apache/spark/pull/5798#discussion_r29403917
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveResolutionSuite.scala
 ---
@@ -81,9 +81,13 @@ class HiveResolutionSuite extends HiveComparisonTest {
   .toDF().registerTempTable("caseSensitivityTest")
 
 val query = sql("SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM 
caseSensitivityTest")
-assert(query.schema.fields.map(_.name) === Seq("a", "b", "A", "B", 
"a", "b", "A", "B"),
+assert(query.schema.fields.map(_.name) === Seq("a", "b", "a", "b", 
"a", "b", "a", "b"),
   "The output schema did not preserve the case of the query.")
--- End diff --

The output schema should be lower case. ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7232] [SQL] Add a Substitution batch fo...

2015-04-29 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5776#issuecomment-97663859
  
  [Test build #31385 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31385/consoleFull)
 for   PR 5776 at commit 
[`553005a`](https://github.com/apache/spark/commit/553005a4e9aebcbb42c712efd833118235d205dc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...

2015-04-29 Thread scwf

Github user scwf commented on a diff in the pull request:

https://github.com/apache/spark/pull/5798#discussion_r29403863
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/package.scala
 ---
@@ -29,12 +29,23 @@ package object analysis {
 
   /**
* Resolver should return true if the first string refers to the same 
entity as the second string.
-   * For example, by using case insensitive equality.
+   * For example, by using case insensitive equality. Besides, Resolver 
also provides the ability
+   * to normalize the string according to its semantic.
*/
-  type Resolver = (String, String) => Boolean
+  trait Resolver {
+def apply(a: String, b: String): Boolean
+def apply(a: String): String
+  }
+
+  val caseInsensitiveResolution = new Resolver {
+override def apply(a: String, b: String): Boolean = 
a.equalsIgnoreCase(b)
+override def apply(a: String): String = a.toLowerCase // as Hive does
--- End diff --

how about rename the first apply -> resolve and the second rename to 
normalize


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5609#issuecomment-97663305
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7232] [SQL] Add a Substitution batch fo...

2015-04-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5776#issuecomment-97663531
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1454 matches

Mail list logo