[GitHub] spark pull request: [SPARK-7232] [SQL] Add a Substitution batch fo...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/5776#issuecomment-97682532 Updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user scwf commented on a diff in the pull request: https://github.com/apache/spark/pull/5798#discussion_r29407019 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/package.scala --- @@ -29,12 +29,23 @@ package object analysis { /** * Resolver should return true if the first string refers to the same entity as the second string. - * For example, by using case insensitive equality. + * For example, by using case insensitive equality. Besides, Resolver also provides the ability + * to normalize the string according to its semantic. */ - type Resolver = (String, String) => Boolean + trait Resolver { +def apply(a: String, b: String): Boolean +def apply(a: String): String + } + + val caseInsensitiveResolution = new Resolver { +override def apply(a: String, b: String): Boolean = a.equalsIgnoreCase(b) +override def apply(a: String): String = a.toLowerCase // as Hive does --- End diff -- /cc @rxin may has concern about this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29407007 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala --- @@ -0,0 +1,127 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.execution.stat + +import org.apache.spark.Logging +import org.apache.spark.sql.{Column, DataFrame, Row} +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.types.{ArrayType, StructField, StructType} + +import scala.collection.mutable.{Map => MutableMap} + +private[sql] object FrequentItems extends Logging { + + /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */ + private class FreqItemCounter(size: Int) extends Serializable { +val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long] + +/** + * Add a new example to the counts if it exists, otherwise deduct the count + * from existing items. + */ +def add(key: Any, count: Long): this.type = { + if (baseMap.contains(key)) { +baseMap(key) += count + } else { +if (baseMap.size < size) { + baseMap += key -> count +} else { + // TODO: Make this more efficient... A flatMap? + baseMap.retain((k, v) => v > count) + baseMap.transform((k, v) => v - count) +} + } + this +} + +/** + * Merge two maps of counts. + * @param other The map containing the counts for that partition + */ +def merge(other: FreqItemCounter): this.type = { + other.toSeq.foreach { case (k, v) => +add(k, v) + } + this +} + +def toSeq: Seq[(Any, Long)] = baseMap.toSeq + +def foldLeft[A, B](start: A)(f: (A, (Any, Long)) => A): A = baseMap.foldLeft(start)(f) + +def freqItems: Seq[Any] = baseMap.keys.toSeq + } + + /** + * Finding frequent items for columns, possibly with false positives. Using the + * frequent element count algorithm described in + * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]]. + * For Internal use only. + * + * @param df The input DataFrame + * @param cols the names of the columns to search frequent items in + * @param support The minimum frequency for an item to be considered `frequent` + * @return A Local DataFrame with the Array of frequent items for each column. + */ + private[sql] def singlePassFreqItems( + df: DataFrame, + cols: Seq[String], + support: Double): DataFrame = { +if (support < 1e-6) { --- End diff -- I talked more with @mengxr. Let's do default e-2, and bound it at e-4 initially. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97681270 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31391/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29406990 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala --- @@ -0,0 +1,127 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.execution.stat + +import org.apache.spark.Logging +import org.apache.spark.sql.{Column, DataFrame, Row} +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.types.{ArrayType, StructField, StructType} + +import scala.collection.mutable.{Map => MutableMap} + +private[sql] object FrequentItems extends Logging { + + /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */ + private class FreqItemCounter(size: Int) extends Serializable { +val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long] + +/** + * Add a new example to the counts if it exists, otherwise deduct the count + * from existing items. + */ +def add(key: Any, count: Long): this.type = { + if (baseMap.contains(key)) { +baseMap(key) += count + } else { +if (baseMap.size < size) { + baseMap += key -> count +} else { + // TODO: Make this more efficient... A flatMap? + baseMap.retain((k, v) => v > count) + baseMap.transform((k, v) => v - count) +} + } + this +} + +/** + * Merge two maps of counts. + * @param other The map containing the counts for that partition + */ +def merge(other: FreqItemCounter): this.type = { + other.toSeq.foreach { case (k, v) => +add(k, v) + } + this +} + +def toSeq: Seq[(Any, Long)] = baseMap.toSeq + +def foldLeft[A, B](start: A)(f: (A, (Any, Long)) => A): A = baseMap.foldLeft(start)(f) + +def freqItems: Seq[Any] = baseMap.keys.toSeq + } + + /** + * Finding frequent items for columns, possibly with false positives. Using the + * frequent element count algorithm described in + * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]]. + * For Internal use only. + * + * @param df The input DataFrame + * @param cols the names of the columns to search frequent items in + * @param support The minimum frequency for an item to be considered `frequent` + * @return A Local DataFrame with the Array of frequent items for each column. + */ + private[sql] def singlePassFreqItems( + df: DataFrame, + cols: Seq[String], + support: Double): DataFrame = { +if (support < 1e-6) { --- End diff -- might as well do e-5 since e-6 can be super large and not very practical. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97681263 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29406968 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala --- @@ -0,0 +1,127 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.execution.stat + +import org.apache.spark.Logging +import org.apache.spark.sql.{Column, DataFrame, Row} +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.types.{ArrayType, StructField, StructType} + +import scala.collection.mutable.{Map => MutableMap} + +private[sql] object FrequentItems extends Logging { + + /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */ + private class FreqItemCounter(size: Int) extends Serializable { +val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long] + +/** + * Add a new example to the counts if it exists, otherwise deduct the count + * from existing items. + */ +def add(key: Any, count: Long): this.type = { + if (baseMap.contains(key)) { +baseMap(key) += count + } else { +if (baseMap.size < size) { + baseMap += key -> count +} else { + // TODO: Make this more efficient... A flatMap? + baseMap.retain((k, v) => v > count) + baseMap.transform((k, v) => v - count) +} + } + this +} + +/** + * Merge two maps of counts. + * @param other The map containing the counts for that partition + */ +def merge(other: FreqItemCounter): this.type = { + other.toSeq.foreach { case (k, v) => +add(k, v) + } + this +} + +def toSeq: Seq[(Any, Long)] = baseMap.toSeq + +def foldLeft[A, B](start: A)(f: (A, (Any, Long)) => A): A = baseMap.foldLeft(start)(f) + +def freqItems: Seq[Any] = baseMap.keys.toSeq + } + + /** + * Finding frequent items for columns, possibly with false positives. Using the + * frequent element count algorithm described in + * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]]. + * For Internal use only. + * + * @param df The input DataFrame + * @param cols the names of the columns to search frequent items in + * @param support The minimum frequency for an item to be considered `frequent` + * @return A Local DataFrame with the Array of frequent items for each column. + */ + private[sql] def singlePassFreqItems( + df: DataFrame, + cols: Seq[String], + support: Double): DataFrame = { +if (support < 1e-6) { --- End diff -- ```scala require(support >= 1e-6, s"support ($support) must be greater than 1e-6.") ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user scwf commented on a diff in the pull request: https://github.com/apache/spark/pull/5798#discussion_r29406960 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveResolutionSuite.scala --- @@ -81,9 +81,13 @@ class HiveResolutionSuite extends HiveComparisonTest { .toDF().registerTempTable("caseSensitivityTest") val query = sql("SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM caseSensitivityTest") -assert(query.schema.fields.map(_.name) === Seq("a", "b", "A", "B", "a", "b", "A", "B"), +assert(query.schema.fields.map(_.name) === Seq("a", "b", "a", "b", "a", "b", "a", "b"), "The output schema did not preserve the case of the query.") --- End diff -- Yes I think for caseInSensitivity case we should normalize the table name and attribute name --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/5798#discussion_r29406952 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/package.scala --- @@ -29,12 +29,23 @@ package object analysis { /** * Resolver should return true if the first string refers to the same entity as the second string. - * For example, by using case insensitive equality. + * For example, by using case insensitive equality. Besides, Resolver also provides the ability + * to normalize the string according to its semantic. */ - type Resolver = (String, String) => Boolean + trait Resolver { +def apply(a: String, b: String): Boolean +def apply(a: String): String + } + + val caseInsensitiveResolution = new Resolver { +override def apply(a: String, b: String): Boolean = a.equalsIgnoreCase(b) +override def apply(a: String): String = a.toLowerCase // as Hive does --- End diff -- I'd like keep the first `apply` as it was, because I don't want to impact a lots of existed code. I agree we should rename the second `apply` => `normalize`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29406904 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -0,0 +1,55 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import org.apache.spark.annotation.Experimental +import org.apache.spark.sql.execution.stat.FrequentItems + +/** + * :: Experimental :: + * Statistic functions for [[DataFrame]]s. + */ +@Experimental +final class DataFrameStatFunctions private[sql](df: DataFrame) { + + /** + * Finding frequent items for columns, possibly with false positives. Using the + * frequent element count algorithm described in + * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]]. + * + * @param cols the names of the columns to search frequent items in + * @param support The minimum frequency for an item to be considered `frequent` + * @return A Local DataFrame with the Array of frequent items for each column. + */ + def freqItems(cols: Seq[String], support: Double): DataFrame = { --- End diff -- also make sure you add a test to the JavaDataFrameSuite --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29406864 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala --- @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.sql.test.TestSQLContext +import org.apache.spark.sql.types._ +import org.scalatest.FunSuite + +class DataFrameStatSuite extends FunSuite { + + val sqlCtx = TestSQLContext + + test("Frequent Items") { +def toLetter(i: Int): String = (i + 96).toChar.toString +val rows = Array.tabulate(1000)(i => if (i % 3 == 0) (1, toLetter(1)) else (i, toLetter(i))) +val rowRdd = sqlCtx.sparkContext.parallelize(rows.map(v => Row(v._1, v._2))) --- End diff -- this can be a lot simpler. ```scala val df = sqlCtx.sparkContext.parallelize(rows.map(v =>(v._1, v._2))).toDF("numbers", "letters") ``` remove the struct type and stuff. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29406886 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala --- @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.sql.test.TestSQLContext +import org.apache.spark.sql.types._ +import org.scalatest.FunSuite + +class DataFrameStatSuite extends FunSuite { + + val sqlCtx = TestSQLContext + + test("Frequent Items") { +def toLetter(i: Int): String = (i + 96).toChar.toString +val rows = Array.tabulate(1000)(i => if (i % 3 == 0) (1, toLetter(1)) else (i, toLetter(i))) +val rowRdd = sqlCtx.sparkContext.parallelize(rows.map(v => Row(v._1, v._2))) +val schema = StructType(StructField("numbers", IntegerType, false) :: +StructField("letters", StringType, false) :: Nil) +val df = sqlCtx.createDataFrame(rowRdd, schema) + +val results = df.stat.freqItems(Array("numbers", "letters"), 0.1) +val items = results.collect().head +assert(items.getSeq(0).contains(1), --- End diff -- use the should matcher for scala test so it prints a better error message. (you don't need your custom error message anymore) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/5798#discussion_r29406841 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveResolutionSuite.scala --- @@ -81,9 +81,13 @@ class HiveResolutionSuite extends HiveComparisonTest { .toDF().registerTempTable("caseSensitivityTest") val query = sql("SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM caseSensitivityTest") -assert(query.schema.fields.map(_.name) === Seq("a", "b", "A", "B", "a", "b", "A", "B"), +assert(query.schema.fields.map(_.name) === Seq("a", "b", "a", "b", "a", "b", "a", "b"), "The output schema did not preserve the case of the query.") --- End diff -- In Hive ``` hive> create table ddDD as select Key, valUe from src; hive> desc extended ; OK key string value string Detailed Table Information Table(tableName:, dbName:default, owner:hcheng, createTime:1430368423, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:string, comment:null), FieldSchema(name:value, type:string, comment:null)], location:file:/home/hcheng/warehouse/, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{numFiles=1, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1430368423, numRows=0, totalSize=5824, rawDataSize=0}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE) Time taken: 0.111 seconds, Fetched: 4 row(s) ``` You will see both table name & column names are normalized (to lower case), so I think it's probably not necessary for the preservation (Normalized name is what we want, doesn't it?) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7232] [SQL] Add a Substitution batch fo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5776#issuecomment-97680177 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31385/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7232] [SQL] Add a Substitution batch fo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5776#issuecomment-97680167 [Test build #31385 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31385/consoleFull) for PR 5776 at commit [`553005a`](https://github.com/apache/spark/commit/553005a4e9aebcbb42c712efd833118235d205dc). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7232] [SQL] Add a Substitution batch fo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5776#issuecomment-97680176 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29406784 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala --- @@ -0,0 +1,127 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.execution.stat + +import org.apache.spark.Logging +import org.apache.spark.sql.{Column, DataFrame, Row} +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.types.{ArrayType, StructField, StructType} + +import scala.collection.mutable.{Map => MutableMap} + +private[sql] object FrequentItems extends Logging { + + /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */ + private class FreqItemCounter(size: Int) extends Serializable { +val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long] + +/** + * Add a new example to the counts if it exists, otherwise deduct the count + * from existing items. + */ +def add(key: Any, count: Long): this.type = { + if (baseMap.contains(key)) { +baseMap(key) += count + } else { +if (baseMap.size < size) { + baseMap += key -> count +} else { + // TODO: Make this more efficient... A flatMap? + baseMap.retain((k, v) => v > count) + baseMap.transform((k, v) => v - count) +} + } + this +} + +/** + * Merge two maps of counts. + * @param other The map containing the counts for that partition + */ +def merge(other: FreqItemCounter): this.type = { + other.toSeq.foreach { case (k, v) => +add(k, v) + } + this +} + +def toSeq: Seq[(Any, Long)] = baseMap.toSeq --- End diff -- u don't need this, do you? you can just operate on the map directly. i'm asking because i'm not sure whether baseMap.toSeq materializes a whole seq, which might be unnecessary --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29406755 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -0,0 +1,55 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import org.apache.spark.annotation.Experimental +import org.apache.spark.sql.execution.stat.FrequentItems + +/** + * :: Experimental :: + * Statistic functions for [[DataFrame]]s. + */ +@Experimental +final class DataFrameStatFunctions private[sql](df: DataFrame) { + + /** + * Finding frequent items for columns, possibly with false positives. Using the + * frequent element count algorithm described in + * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]]. + * --- End diff -- make sure you document the range of support allowed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5112. Expose SizeEstimator as a develope...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/3913#issuecomment-97679846 Pending an update LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97679720 [Test build #31392 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31392/consoleFull) for PR 5799 at commit [`482e741`](https://github.com/apache/spark/commit/482e74180445d30d0b5a769cd5f9bd0e94abfd17). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29406677 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -0,0 +1,55 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import org.apache.spark.annotation.Experimental +import org.apache.spark.sql.execution.stat.FrequentItems + +/** + * :: Experimental :: + * Statistic functions for [[DataFrame]]s. + */ +@Experimental +final class DataFrameStatFunctions private[sql](df: DataFrame) { + + /** + * Finding frequent items for columns, possibly with false positives. Using the + * frequent element count algorithm described in + * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]]. + * + * @param cols the names of the columns to search frequent items in + * @param support The minimum frequency for an item to be considered `frequent` + * @return A Local DataFrame with the Array of frequent items for each column. + */ + def freqItems(cols: Seq[String], support: Double): DataFrame = { --- End diff -- don't forget to add java.util.List ones --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97679663 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97679655 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5798#issuecomment-97679301 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31383/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5798#issuecomment-97679293 [Test build #31383 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31383/consoleFull) for PR 5798 at commit [`c00f1ad`](https://github.com/apache/spark/commit/c00f1adda402f139b41d63fbe20dd9f1b4d6677e). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` trait Resolver ` * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5798#issuecomment-97679300 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97679097 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97679106 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/5798#discussion_r29406469 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveResolutionSuite.scala --- @@ -81,9 +81,13 @@ class HiveResolutionSuite extends HiveComparisonTest { .toDF().registerTempTable("caseSensitivityTest") val query = sql("SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM caseSensitivityTest") -assert(query.schema.fields.map(_.name) === Seq("a", "b", "A", "B", "a", "b", "A", "B"), +assert(query.schema.fields.map(_.name) === Seq("a", "b", "a", "b", "a", "b", "a", "b"), "The output schema did not preserve the case of the query.") --- End diff -- Supporting normalization is good. However, when explicitly specifying the case in the query, should we need to preserve the case of the query, instead of normalizing it like this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5744#discussion_r29406389 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1166,7 +1166,7 @@ def __init__(self, jc): # container operators __contains__ = _bin_op("contains") -__getitem__ = _bin_op("getItem") +__getitem__ = _bin_op("apply") --- End diff -- can we add a unit test? you can add it in tests.py --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5744#issuecomment-97678166 [Test build #31390 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31390/consoleFull) for PR 5744 at commit [`51719b7`](https://github.com/apache/spark/commit/51719b7f612859219ba31658da4e9582c6ef2856). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5744#issuecomment-97677743 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/5744#issuecomment-97677728 @rxin , already done :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5744#issuecomment-97677730 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
GitHub user sven0726 opened a pull request: https://github.com/apache/spark/pull/5800 Merge pull request #1 from apache/master You can merge this pull request into a Git repository by running: $ git pull https://github.com/sven0726/spark-1 master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5800.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5800 commit 5da8c543ff542951c3fefe6e123b891f66edf4b6 Author: sven0726 Date: 2015-04-27T08:21:55Z Merge pull request #1 from apache/master 2015-04-27第一次merge --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
Github user sven0726 closed the pull request at: https://github.com/apache/spark/pull/5800 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/5744#issuecomment-97677458 I will let @marmbrus take a look at this tomorrow. Meantime, can you add the apply method and Python getitem method? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1406] Mllib pmml model export
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3062#issuecomment-97677239 LGTM. Merged into master. Thanks! @selvinsource I created SPARK-7272 for the user guide and assigned it to you. Let me know if you don't have time to do it. Btw, please submit your PMML evaluator code as a Spark package at spark-packages.org. Due to license issues, we cannot include them as example code in the Spark codebase. But PMML model scoring is quite important. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5112. Expose SizeEstimator as a develope...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3913#issuecomment-97676897 Looking good though this needs a rebase now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5744#issuecomment-97676693 [Test build #31380 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31380/consoleFull) for PR 5744 at commit [`727fbc1`](https://github.com/apache/spark/commit/727fbc1a5891c481bd382dd30fee4452c5d341c8). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class UnresolvedGetField(child: Expression, fieldExpr: Expression) extends UnaryExpression ` * `trait GetField extends UnaryExpression ` * `case class SimpleStructGetField(child: Expression, field: StructField, ordinal: Int)` * `case class ArrayStructGetField(` * `abstract class OrdinalGetField extends GetField ` * `case class ArrayOrdinalGetField(child: Expression, ordinal: Expression)` * `case class MapOrdinalGetField(child: Expression, ordinal: Expression)` * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5744#issuecomment-97676722 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5744#issuecomment-97676725 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31380/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6846 [WEBUI] Stage kill URL easy to acci...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/5528#issuecomment-97676514 That's fine, in the sense that the endpoint returns no data. OK, so it works except for this proxying. Hm, surely the YARN proxy can pass on a POST. We'll have to look into this. Any wisdom from YARN folks about where to look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1406] Mllib pmml model export
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3062 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6612] [MLLib] [PySpark] Python KMeans p...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5647#issuecomment-97675656 [Test build #31382 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31382/consoleFull) for PR 5647 at commit [`0319821`](https://github.com/apache/spark/commit/0319821db7406f3cca359af5bc021d2f3fd92a17). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6612] [MLLib] [PySpark] Python KMeans p...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5647#issuecomment-97675701 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31382/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6612] [MLLib] [PySpark] Python KMeans p...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5647#issuecomment-97675700 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7120][SPARK-7121] Closure cleaner nesti...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/5685#issuecomment-97672308 Hey Andrew, This is looking good. The code is quite dense but as far as I can tell, this is correct. I left some more surface level comments, if you can get to those I think it will be good to go. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1442][SQL][WIP] Window Function Support...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/5604#issuecomment-97671446 I have been working on it a few days. I believe that my update will cover most of your comments. Please hold your comments until my update. Thanks :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5609#issuecomment-97671182 [Test build #31384 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31384/consoleFull) for PR 5609 at commit [`8d3fc16`](https://github.com/apache/spark/commit/8d3fc16dd22c87fbf768951b64dabe7d121731ec). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5609#issuecomment-97671186 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29405622 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala --- @@ -0,0 +1,124 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.ml + + +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.types.{StructType, ArrayType, StructField} + +import scala.collection.mutable.{Map => MutableMap} + +import org.apache.spark.Logging +import org.apache.spark.sql.{Row, DataFrame, functions} + +private[sql] object FrequentItems extends Logging { + + /** + * Merge two maps of counts. Subtracts the sum of `otherMap` from `baseMap`, and fills in + * any emptied slots with the most frequent of `otherMap`. + * @param baseMap The map containing the global counts + * @param otherMap The map containing the counts for that partition + * @param maxSize The maximum number of counts to keep in memory + */ + private def mergeCounts[A]( --- End diff -- I think the implementation could be cleaner if we wrap `MutableMap[A, Long]` with a utility class: ~~~scala class FreqItemCounter(size: k) { def add(any: Any, count: Long = 1L): this.type def merge(other: FreqItemCounter): this.type = { other.toSeq.foreach { case (k, c) => add(k, c) } } def items: Array[Any] def toSeq: Seq[(Any, Long)] } ~~~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5609#issuecomment-97671187 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31384/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1442][SQL][WIP] Window Function Support...
Github user guowei2 commented on the pull request: https://github.com/apache/spark/pull/5604#issuecomment-97671084 @scwf I think it is a good choice, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5609#issuecomment-97670982 [Test build #31389 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31389/consoleFull) for PR 5609 at commit [`8d3fc16`](https://github.com/apache/spark/commit/8d3fc16dd22c87fbf768951b64dabe7d121731ec). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5609#issuecomment-97670934 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5609#issuecomment-97670920 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29405486 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -1414,4 +1415,25 @@ class DataFrame private[sql]( val jrdd = rdd.map(EvaluatePython.rowToArray(_, fieldTypes)).toJavaRDD() SerDeUtil.javaToPython(jrdd) } + + / + // Statistic functions + / + + // scalastyle:off + object stat { --- End diff -- aha! I like it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29405470 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala --- @@ -0,0 +1,124 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.ml + + +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.types.{StructType, ArrayType, StructField} + +import scala.collection.mutable.{Map => MutableMap} + +import org.apache.spark.Logging +import org.apache.spark.sql.{Row, DataFrame, functions} + +private[sql] object FrequentItems extends Logging { + + /** + * Merge two maps of counts. Subtracts the sum of `otherMap` from `baseMap`, and fills in + * any emptied slots with the most frequent of `otherMap`. + * @param baseMap The map containing the global counts + * @param otherMap The map containing the counts for that partition + * @param maxSize The maximum number of counts to keep in memory + */ + private def mergeCounts[A]( + baseMap: MutableMap[A, Long], + otherMap: MutableMap[A, Long], + maxSize: Int): Unit = { +val otherSum = otherMap.foldLeft(0L) { case (sum, (k, v)) => + if (!baseMap.contains(k)) sum + v else sum +} +baseMap.retain((k, v) => v > otherSum) +// sort in decreasing order, so that we will add the most frequent items first +val sorted = otherMap.toSeq.sortBy(-_._2) +var i = 0 +val otherSize = sorted.length +while (i < otherSize && baseMap.size < maxSize) { + val keyVal = sorted(i) + baseMap += keyVal._1 -> keyVal._2 + i += 1 +} + } + + + /** + * Finding frequent items for columns, possibly with false positives. Using the algorithm + * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`. + * For Internal use only. + * + * @param df The input DataFrame + * @param cols the names of the columns to search frequent items in + * @param support The minimum frequency for an item to be considered `frequent` + * @return A Local DataFrame with the Array of frequent items for each column. + */ + private[sql] def singlePassFreqItems( + df: DataFrame, + cols: Array[String], + support: Double): DataFrame = { +val numCols = cols.length +// number of max items to keep counts for +val sizeOfMap = math.floor(1 / support).toInt +val countMaps = Array.tabulate(numCols)(i => MutableMap.empty[Any, Long]) +val originalSchema = df.schema +val colInfo = cols.map { name => + val index = originalSchema.fieldIndex(name) + val dataType = originalSchema.fields(index) + (index, dataType.dataType) +} +val colIndices = colInfo.map(_._1) + +val freqItems: Array[MutableMap[Any, Long]] = df.rdd.aggregate(countMaps)( + seqOp = (counts, row) => { +var i = 0 +colIndices.foreach { index => + val thisMap = counts(i) + val key = row.get(index) + if (thisMap.contains(key)) { +thisMap(key) += 1 + } else { +if (thisMap.size < sizeOfMap) { + thisMap += key -> 1 +} else { + // TODO: Make this more efficient... A flatMap? + thisMap.retain((k, v) => v > 1) + thisMap.transform((k, v) => v - 1) +} + } + i += 1 +} +counts + }, + combOp = (baseCounts, counts) => { +var i = 0 +while (i < numCols) { + mergeCounts(baseCounts(i), counts(i), sizeOfMap) + i += 1 +} +baseCounts + } +) +// +val justItems = freqItems.map(m => m.keys.toSeq) +val resultRow = Row(justItems:_*) +// appe
[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...
Github user WangTaoTheTonic commented on the pull request: https://github.com/apache/spark/pull/5609#issuecomment-97670648 Looks like the failed test case is flacky. Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6602][Core] Update Master, Worker, Clie...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5392#issuecomment-97670499 [Test build #31388 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31388/consoleFull) for PR 5392 at commit [`72304f0`](https://github.com/apache/spark/commit/72304f0150e74eb6432fc3141d3d5bc71bb93d61). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6602][Core] Update Master, Worker, Clie...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5392#issuecomment-97670459 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6602][Core] Update Master, Worker, Clie...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5392#issuecomment-97670448 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7120][SPARK-7121] Closure cleaner nesti...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/5685#discussion_r29405393 --- Diff: core/src/test/scala/org/apache/spark/util/ClosureCleanerSuite2.scala --- @@ -0,0 +1,562 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +import java.io.NotSerializableException + +import scala.collection.mutable + +import org.scalatest.{BeforeAndAfterAll, FunSuite, PrivateMethodTester} + +import org.apache.spark.{SparkContext, SparkException} +import org.apache.spark.serializer.SerializerInstance + +/** + * Another test suite for the closure cleaner that is finer-grained. + * For tests involving end-to-end Spark jobs, see {{ClosureCleanerSuite}}. + */ +class ClosureCleanerSuite2 extends FunSuite with BeforeAndAfterAll with PrivateMethodTester { + + // Start a SparkContext so that the closure serializer is accessible + // We do not actually use this explicitly otherwise + private var sc: SparkContext = null + private var closureSerializer: SerializerInstance = null + + override def beforeAll(): Unit = { +sc = new SparkContext("local", "test") +closureSerializer = sc.env.closureSerializer.newInstance() + } + + override def afterAll(): Unit = { +sc.stop() +sc = null +closureSerializer = null + } + + // Some fields and methods to reference in inner closures later + private val someSerializableValue = 1 + private val someNonSerializableValue = new NonSerializable + private def someSerializableMethod() = 1 + private def someNonSerializableMethod() = new NonSerializable + + /** Assert that the given closure is serializable (or not). */ + private def assertSerializable(closure: AnyRef, serializable: Boolean): Unit = { +if (serializable) { + closureSerializer.serialize(closure) +} else { + intercept[NotSerializableException] { +closureSerializer.serialize(closure) + } +} + } + + /** + * Helper method for testing whether closure cleaning works as expected. + * This cleans the given closure twice, with and without transitive cleaning. + */ + private def testClean( + closure: AnyRef, + serializableBefore: Boolean, + serializableAfter: Boolean): Unit = { +testClean(closure, serializableBefore, serializableAfter, transitive = true) +testClean(closure, serializableBefore, serializableAfter, transitive = false) + } + + /** Helper method for testing whether closure cleaning works as expected. */ + private def testClean( + closure: AnyRef, + serializableBefore: Boolean, + serializableAfter: Boolean, + transitive: Boolean): Unit = { +assertSerializable(closure, serializableBefore) +// If the resulting closure is not serializable even after +// cleaning, we expect ClosureCleaner to throw a SparkException +if (serializableAfter) { + ClosureCleaner.clean(closure, checkSerializable = true, transitive) +} else { + intercept[SparkException] { +ClosureCleaner.clean(closure, checkSerializable = true, transitive) + } +} +assertSerializable(closure, serializableAfter) + } + + /** + * Return the fields accessed by the given closure by class. + * This also optionally finds the fields transitively referenced through methods invocations. + */ + private def findAccessedFields( + closure: AnyRef, + outerClasses: Seq[Class[_]], + findTransitively: Boolean): Map[Class[_], Set[String]] = { +val fields = new mutable.HashMap[Class[_], mutable.Set[String]] +outerClasses.foreach { c => fields(c) = new mutable.HashSet[String] } +ClosureCleaner.getClassReader(closure.getClass) + .accept(new FieldAccessFinder(fields, findTransitively), 0) +fields.mapV
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29405344 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala --- @@ -0,0 +1,124 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.ml + + +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.types.{StructType, ArrayType, StructField} + +import scala.collection.mutable.{Map => MutableMap} + +import org.apache.spark.Logging +import org.apache.spark.sql.{Row, DataFrame, functions} + +private[sql] object FrequentItems extends Logging { + + /** + * Merge two maps of counts. Subtracts the sum of `otherMap` from `baseMap`, and fills in + * any emptied slots with the most frequent of `otherMap`. + * @param baseMap The map containing the global counts + * @param otherMap The map containing the counts for that partition + * @param maxSize The maximum number of counts to keep in memory + */ + private def mergeCounts[A]( + baseMap: MutableMap[A, Long], + otherMap: MutableMap[A, Long], + maxSize: Int): Unit = { +val otherSum = otherMap.foldLeft(0L) { case (sum, (k, v)) => + if (!baseMap.contains(k)) sum + v else sum +} +baseMap.retain((k, v) => v > otherSum) +// sort in decreasing order, so that we will add the most frequent items first +val sorted = otherMap.toSeq.sortBy(-_._2) +var i = 0 +val otherSize = sorted.length +while (i < otherSize && baseMap.size < maxSize) { + val keyVal = sorted(i) + baseMap += keyVal._1 -> keyVal._2 + i += 1 +} + } + + + /** + * Finding frequent items for columns, possibly with false positives. Using the algorithm + * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`. + * For Internal use only. + * + * @param df The input DataFrame + * @param cols the names of the columns to search frequent items in + * @param support The minimum frequency for an item to be considered `frequent` + * @return A Local DataFrame with the Array of frequent items for each column. + */ + private[sql] def singlePassFreqItems( + df: DataFrame, + cols: Array[String], + support: Double): DataFrame = { +val numCols = cols.length +// number of max items to keep counts for +val sizeOfMap = math.floor(1 / support).toInt +val countMaps = Array.tabulate(numCols)(i => MutableMap.empty[Any, Long]) +val originalSchema = df.schema +val colInfo = cols.map { name => + val index = originalSchema.fieldIndex(name) + val dataType = originalSchema.fields(index) + (index, dataType.dataType) +} +val colIndices = colInfo.map(_._1) + +val freqItems: Array[MutableMap[Any, Long]] = df.rdd.aggregate(countMaps)( + seqOp = (counts, row) => { +var i = 0 +colIndices.foreach { index => + val thisMap = counts(i) + val key = row.get(index) + if (thisMap.contains(key)) { +thisMap(key) += 1 --- End diff -- `1` -> `1L` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6939][Streaming][WebUI] Add timeline an...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/5533#issuecomment-97670189 new screenshot ![g8](https://cloud.githubusercontent.com/assets/1000778/7407142/57adb2a0-eec3-11e4-8492-7304da7baaa3.png) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6939][Streaming][WebUI] Add timeline an...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5533#issuecomment-97670140 [Test build #31387 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31387/consoleFull) for PR 5533 at commit [`a2972e9`](https://github.com/apache/spark/commit/a2972e988053168b7299f40d16cd2896a6eb8110). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29405307 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala --- @@ -0,0 +1,124 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.ml + + +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.types.{StructType, ArrayType, StructField} + +import scala.collection.mutable.{Map => MutableMap} + +import org.apache.spark.Logging +import org.apache.spark.sql.{Row, DataFrame, functions} + +private[sql] object FrequentItems extends Logging { + + /** + * Merge two maps of counts. Subtracts the sum of `otherMap` from `baseMap`, and fills in + * any emptied slots with the most frequent of `otherMap`. + * @param baseMap The map containing the global counts + * @param otherMap The map containing the counts for that partition + * @param maxSize The maximum number of counts to keep in memory + */ + private def mergeCounts[A]( + baseMap: MutableMap[A, Long], + otherMap: MutableMap[A, Long], + maxSize: Int): Unit = { +val otherSum = otherMap.foldLeft(0L) { case (sum, (k, v)) => + if (!baseMap.contains(k)) sum + v else sum +} +baseMap.retain((k, v) => v > otherSum) +// sort in decreasing order, so that we will add the most frequent items first +val sorted = otherMap.toSeq.sortBy(-_._2) +var i = 0 +val otherSize = sorted.length +while (i < otherSize && baseMap.size < maxSize) { + val keyVal = sorted(i) + baseMap += keyVal._1 -> keyVal._2 + i += 1 +} + } + + + /** + * Finding frequent items for columns, possibly with false positives. Using the algorithm + * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`. + * For Internal use only. + * + * @param df The input DataFrame + * @param cols the names of the columns to search frequent items in + * @param support The minimum frequency for an item to be considered `frequent` + * @return A Local DataFrame with the Array of frequent items for each column. + */ + private[sql] def singlePassFreqItems( + df: DataFrame, + cols: Array[String], + support: Double): DataFrame = { +val numCols = cols.length +// number of max items to keep counts for +val sizeOfMap = math.floor(1 / support).toInt +val countMaps = Array.tabulate(numCols)(i => MutableMap.empty[Any, Long]) +val originalSchema = df.schema +val colInfo = cols.map { name => + val index = originalSchema.fieldIndex(name) + val dataType = originalSchema.fields(index) + (index, dataType.dataType) +} +val colIndices = colInfo.map(_._1) + +val freqItems: Array[MutableMap[Any, Long]] = df.rdd.aggregate(countMaps)( --- End diff -- `df.select(cols).rdd.aggregate` (then you don't need to skip elements) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6939][Streaming][WebUI] Add timeline an...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5533#issuecomment-97670038 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6939][Streaming][WebUI] Add timeline an...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5533#issuecomment-97670024 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29405254 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala --- @@ -0,0 +1,124 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.ml + + +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.types.{StructType, ArrayType, StructField} + +import scala.collection.mutable.{Map => MutableMap} + +import org.apache.spark.Logging +import org.apache.spark.sql.{Row, DataFrame, functions} + +private[sql] object FrequentItems extends Logging { + + /** + * Merge two maps of counts. Subtracts the sum of `otherMap` from `baseMap`, and fills in + * any emptied slots with the most frequent of `otherMap`. + * @param baseMap The map containing the global counts + * @param otherMap The map containing the counts for that partition + * @param maxSize The maximum number of counts to keep in memory + */ + private def mergeCounts[A]( + baseMap: MutableMap[A, Long], + otherMap: MutableMap[A, Long], + maxSize: Int): Unit = { +val otherSum = otherMap.foldLeft(0L) { case (sum, (k, v)) => + if (!baseMap.contains(k)) sum + v else sum +} +baseMap.retain((k, v) => v > otherSum) +// sort in decreasing order, so that we will add the most frequent items first +val sorted = otherMap.toSeq.sortBy(-_._2) +var i = 0 +val otherSize = sorted.length +while (i < otherSize && baseMap.size < maxSize) { + val keyVal = sorted(i) + baseMap += keyVal._1 -> keyVal._2 + i += 1 +} + } + + + /** + * Finding frequent items for columns, possibly with false positives. Using the algorithm + * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`. + * For Internal use only. + * + * @param df The input DataFrame + * @param cols the names of the columns to search frequent items in + * @param support The minimum frequency for an item to be considered `frequent` + * @return A Local DataFrame with the Array of frequent items for each column. + */ + private[sql] def singlePassFreqItems( + df: DataFrame, + cols: Array[String], + support: Double): DataFrame = { +val numCols = cols.length --- End diff -- Check the range of `support`. Warn if the it is too small (e.g., 1e-6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7267][SQL]Push down Project when it's c...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/5797#issuecomment-97669998 Retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29405251 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala --- @@ -0,0 +1,124 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.ml + + +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.types.{StructType, ArrayType, StructField} + +import scala.collection.mutable.{Map => MutableMap} + +import org.apache.spark.Logging +import org.apache.spark.sql.{Row, DataFrame, functions} + +private[sql] object FrequentItems extends Logging { + + /** + * Merge two maps of counts. Subtracts the sum of `otherMap` from `baseMap`, and fills in + * any emptied slots with the most frequent of `otherMap`. + * @param baseMap The map containing the global counts + * @param otherMap The map containing the counts for that partition + * @param maxSize The maximum number of counts to keep in memory + */ + private def mergeCounts[A]( + baseMap: MutableMap[A, Long], + otherMap: MutableMap[A, Long], + maxSize: Int): Unit = { +val otherSum = otherMap.foldLeft(0L) { case (sum, (k, v)) => + if (!baseMap.contains(k)) sum + v else sum +} +baseMap.retain((k, v) => v > otherSum) +// sort in decreasing order, so that we will add the most frequent items first +val sorted = otherMap.toSeq.sortBy(-_._2) +var i = 0 +val otherSize = sorted.length +while (i < otherSize && baseMap.size < maxSize) { + val keyVal = sorted(i) + baseMap += keyVal._1 -> keyVal._2 + i += 1 +} + } + + + /** + * Finding frequent items for columns, possibly with false positives. Using the algorithm + * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`. + * For Internal use only. + * + * @param df The input DataFrame + * @param cols the names of the columns to search frequent items in + * @param support The minimum frequency for an item to be considered `frequent` + * @return A Local DataFrame with the Array of frequent items for each column. + */ + private[sql] def singlePassFreqItems( + df: DataFrame, + cols: Array[String], --- End diff -- If multiple columns are provided, shall we search the combination of them instead of each individually? For example, if I call ~~~scala freqItems(Array("gender", "title"), 0.01) ~~~ I'm expecting the frequent combinations instead of each of them. The current implementation is more flexible because users can create a struct from multiple columns, and this allows to find frequent items on multiple columns in parallel. But I'm a little worried about what users expect when they call `freqItems(Array("gender", "title"))` @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29405257 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala --- @@ -0,0 +1,124 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.ml + + +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.types.{StructType, ArrayType, StructField} + +import scala.collection.mutable.{Map => MutableMap} + +import org.apache.spark.Logging +import org.apache.spark.sql.{Row, DataFrame, functions} + +private[sql] object FrequentItems extends Logging { + + /** + * Merge two maps of counts. Subtracts the sum of `otherMap` from `baseMap`, and fills in + * any emptied slots with the most frequent of `otherMap`. + * @param baseMap The map containing the global counts + * @param otherMap The map containing the counts for that partition + * @param maxSize The maximum number of counts to keep in memory + */ + private def mergeCounts[A]( + baseMap: MutableMap[A, Long], + otherMap: MutableMap[A, Long], + maxSize: Int): Unit = { +val otherSum = otherMap.foldLeft(0L) { case (sum, (k, v)) => + if (!baseMap.contains(k)) sum + v else sum +} +baseMap.retain((k, v) => v > otherSum) +// sort in decreasing order, so that we will add the most frequent items first +val sorted = otherMap.toSeq.sortBy(-_._2) +var i = 0 +val otherSize = sorted.length +while (i < otherSize && baseMap.size < maxSize) { + val keyVal = sorted(i) + baseMap += keyVal._1 -> keyVal._2 + i += 1 +} + } + + + /** + * Finding frequent items for columns, possibly with false positives. Using the algorithm + * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`. + * For Internal use only. + * + * @param df The input DataFrame + * @param cols the names of the columns to search frequent items in + * @param support The minimum frequency for an item to be considered `frequent` + * @return A Local DataFrame with the Array of frequent items for each column. + */ + private[sql] def singlePassFreqItems( + df: DataFrame, + cols: Array[String], + support: Double): DataFrame = { +val numCols = cols.length +// number of max items to keep counts for +val sizeOfMap = math.floor(1 / support).toInt --- End diff -- `math.floor` is not necessary: `(1.0 / support).toInt` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29405043 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -1414,4 +1415,37 @@ class DataFrame private[sql]( val jrdd = rdd.map(EvaluatePython.rowToArray(_, fieldTypes)).toJavaRDD() SerDeUtil.javaToPython(jrdd) } + + / + // Statistic functions + / + + // scalastyle:off + object stat { + // scalastyle:on + +/** + * Finding frequent items for columns, possibly with false positives. Using the algorithm + * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`. --- End diff -- Use DOI link: http://dx.doi.org/10.1145/762471.762473 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29405009 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -1414,4 +1415,25 @@ class DataFrame private[sql]( val jrdd = rdd.map(EvaluatePython.rowToArray(_, fieldTypes)).toJavaRDD() SerDeUtil.javaToPython(jrdd) } + + / + // Statistic functions + / + + // scalastyle:off + object stat { --- End diff -- take a look at how we implemented na. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29404981 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -1414,4 +1415,25 @@ class DataFrame private[sql]( val jrdd = rdd.map(EvaluatePython.rowToArray(_, fieldTypes)).toJavaRDD() SerDeUtil.javaToPython(jrdd) } + + / + // Statistic functions + / + + // scalastyle:off + object stat { --- End diff -- I think it looks like `df.stat$.MODULE$.freqItems()`. I don't know how we can otherwise make it `df.stat.freqItems` in scala. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7225][SQL] CombineLimits optimizer does...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/5770 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97667923 I'm going to let @mengxr to comment on the actual algorithm implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29404857 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/ml/FrequentItemsSuite.scala --- @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.ml + +import org.apache.spark.sql.Row +import org.apache.spark.sql.types._ +import org.scalatest.FunSuite + +import org.apache.spark.sql.test.TestSQLContext + +class FrequentItemsSuite extends FunSuite { --- End diff -- move this to .sql package, and call it DataFrameStatSuite? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29404815 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -1414,4 +1415,37 @@ class DataFrame private[sql]( val jrdd = rdd.map(EvaluatePython.rowToArray(_, fieldTypes)).toJavaRDD() SerDeUtil.javaToPython(jrdd) } + + / + // Statistic functions + / + + // scalastyle:off + object stat { + // scalastyle:on + +/** + * Finding frequent items for columns, possibly with false positives. Using the algorithm + * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`. + * + * @param cols the names of the columns to search frequent items in + * @param support The minimum frequency for an item to be considered `frequent` + * @return A Local DataFrame with the Array of frequent items for each column. + */ +def freqItems(cols: Array[String], support: Double): DataFrame = { --- End diff -- in df we usually support List[String] and Seq[String]. This is one reason why we are using a separate name space. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29404786 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -1414,4 +1415,37 @@ class DataFrame private[sql]( val jrdd = rdd.map(EvaluatePython.rowToArray(_, fieldTypes)).toJavaRDD() SerDeUtil.javaToPython(jrdd) } + + / + // Statistic functions + / + + // scalastyle:off + object stat { + // scalastyle:on + +/** + * Finding frequent items for columns, possibly with false positives. Using the algorithm + * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`. --- End diff -- should do proper citation rather than giving an url, since this url might disappear. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97667662 [Test build #31386 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31386/consoleFull) for PR 5799 at commit [`8279d4d`](https://github.com/apache/spark/commit/8279d4d4cb09f78e2f8f83f9a3738101b940ed40). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29404754 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala --- @@ -0,0 +1,124 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.ml --- End diff -- let's put this in execution.stat? It's annoying to add a top level package because we have rules to specifically exclude existing packages. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97667532 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97667521 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29404717 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -1414,4 +1415,25 @@ class DataFrame private[sql]( val jrdd = rdd.map(EvaluatePython.rowToArray(_, fieldTypes)).toJavaRDD() SerDeUtil.javaToPython(jrdd) } + + / + // Statistic functions + / + + // scalastyle:off + object stat { --- End diff -- does this work in java? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7120][SPARK-7121] Closure cleaner nesti...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/5685#discussion_r29404655 --- Diff: core/src/test/scala/org/apache/spark/util/ClosureCleanerSuite2.scala --- @@ -0,0 +1,562 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +import java.io.NotSerializableException + +import scala.collection.mutable + +import org.scalatest.{BeforeAndAfterAll, FunSuite, PrivateMethodTester} + +import org.apache.spark.{SparkContext, SparkException} +import org.apache.spark.serializer.SerializerInstance + +/** + * Another test suite for the closure cleaner that is finer-grained. + * For tests involving end-to-end Spark jobs, see {{ClosureCleanerSuite}}. + */ +class ClosureCleanerSuite2 extends FunSuite with BeforeAndAfterAll with PrivateMethodTester { + + // Start a SparkContext so that the closure serializer is accessible + // We do not actually use this explicitly otherwise + private var sc: SparkContext = null + private var closureSerializer: SerializerInstance = null + + override def beforeAll(): Unit = { +sc = new SparkContext("local", "test") +closureSerializer = sc.env.closureSerializer.newInstance() + } + + override def afterAll(): Unit = { +sc.stop() +sc = null +closureSerializer = null + } + + // Some fields and methods to reference in inner closures later + private val someSerializableValue = 1 + private val someNonSerializableValue = new NonSerializable + private def someSerializableMethod() = 1 + private def someNonSerializableMethod() = new NonSerializable + + /** Assert that the given closure is serializable (or not). */ + private def assertSerializable(closure: AnyRef, serializable: Boolean): Unit = { +if (serializable) { + closureSerializer.serialize(closure) +} else { + intercept[NotSerializableException] { +closureSerializer.serialize(closure) + } +} + } + + /** + * Helper method for testing whether closure cleaning works as expected. + * This cleans the given closure twice, with and without transitive cleaning. + */ + private def testClean( --- End diff -- Maybe these should be called `verifyCleaning` or something. Because they are really verifying behavior according to what you pass as `serializableBefore`/`serializableAfter`. Also it would be good to document those parameters... it's not super clear what they mean. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
GitHub user brkyvz opened a pull request: https://github.com/apache/spark/pull/5799 [SPARK-7242][SQL][MLLIB] Frequent items for DataFrames Finding frequent items with possibly false positives, using the algorithm described in `http://www.cs.umd.edu/~samir/498/karp.pdf`. public API under: ``` df.stat.freqItems(cols: Array[String], support: Double = 0.001): DataFrame ``` The output is a local DataFrame having the input column names with `-freqItems` appended to it. This is a single pass algorithm that may return false positives, but no false negatives. cc @mengxr @rxin You can merge this pull request into a Git repository by running: $ git pull https://github.com/brkyvz/spark freq-items Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5799.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5799 commit 3d82168544a29e7e1ae1326ab933db7e78a72dcc Author: Burak Yavuz Date: 2015-04-29T23:07:48Z made base implementation implemented frequent items --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7225][SQL] CombineLimits optimizer does...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/5770#issuecomment-97665645 Thanks. Merging in master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Build] Enable MiMa checks for SQL
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/5727#issuecomment-97665454 LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7225][SQL] CombineLimits optimizer does...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5770#issuecomment-97665312 [Test build #31377 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31377/consoleFull) for PR 5770 at commit [`c68eaa7`](https://github.com/apache/spark/commit/c68eaa76eccdeb35d99dcbbedb30a9d6bed8eafd). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7225][SQL] CombineLimits optimizer does...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5770#issuecomment-97665315 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7225][SQL] CombineLimits optimizer does...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5770#issuecomment-97665316 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31377/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5521] PCA wrapper for easy transform ve...
Github user catap commented on a diff in the pull request: https://github.com/apache/spark/pull/4304#discussion_r29404295 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.linalg.distributed.RowMatrix +import org.apache.spark.rdd.RDD + +/** + * A feature transformer that projects vectors to a low-dimensional space using PCA. + * + * @param k num of principal components + */ + +class PCA(val k: Int) { + require(k >= 1) + + /** + * Computes a [[PCAModel]] that contains the principal components of the input vectors. + * + * @param sources source vectors + */ + def fit(sources: RDD[Vector]): PCAModel = { +require(k <= sources.first().size) + +val mat = new RowMatrix(sources) +val pc = mat.computePrincipalComponents(k) match { + case dm: DenseMatrix => +dm + /* + * Convert a sparse matrix to dense. + * + * RowMatrix.computePrincipalComponents always returns a dense matrix. + * The following code is a safeguard. + */ + case sm: SparseMatrix => +sm.toDense + case _ => +throw new IllegalArgumentException("Unsupported matrix format. Expected " + + s"SparseMatrix or DenseMatrix. Instead got: ${mat.getClass}") + +} +new PCAModel(k, pc) + } + + /** + * Computes a [[PCAModel]] that contains the principal components of the input vectors. + * + * @param sources source vectors + */ + def fit(k: Int, sources: JavaRDD[Vector]): PCAModel = +fit(k, sources.rdd) +} + +/** + * Model fitted by [[PCA]] that can project vectors to a low-dimensional space space using PCA. + * + * @param k num of principal components. + * @param pc a principal components Matrix + */ +class PCAModel private[mllib](val k: Int, val pc: DenseMatrix) extends VectorTransformer { + /** + * Transform a vector by computed Principal Components. + * + * @param vector vector to be transformed. + * @return transformed vector. + */ + override def transform(vector: Vector): Vector = +vector match { + case dv: DenseVector => +pc.transpose.multiply(dv) + case sv: SparseVector => +/* SparseVector -> single row SparseMatrix */ +val rowIndices = new Array[Int](sv.indices.length) --- End diff -- Thanks! Here I made a matrix transpose by hands :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user scwf commented on a diff in the pull request: https://github.com/apache/spark/pull/5798#discussion_r29403917 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveResolutionSuite.scala --- @@ -81,9 +81,13 @@ class HiveResolutionSuite extends HiveComparisonTest { .toDF().registerTempTable("caseSensitivityTest") val query = sql("SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM caseSensitivityTest") -assert(query.schema.fields.map(_.name) === Seq("a", "b", "A", "B", "a", "b", "A", "B"), +assert(query.schema.fields.map(_.name) === Seq("a", "b", "a", "b", "a", "b", "a", "b"), "The output schema did not preserve the case of the query.") --- End diff -- The output schema should be lower case. ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7232] [SQL] Add a Substitution batch fo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5776#issuecomment-97663859 [Test build #31385 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31385/consoleFull) for PR 5776 at commit [`553005a`](https://github.com/apache/spark/commit/553005a4e9aebcbb42c712efd833118235d205dc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user scwf commented on a diff in the pull request: https://github.com/apache/spark/pull/5798#discussion_r29403863 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/package.scala --- @@ -29,12 +29,23 @@ package object analysis { /** * Resolver should return true if the first string refers to the same entity as the second string. - * For example, by using case insensitive equality. + * For example, by using case insensitive equality. Besides, Resolver also provides the ability + * to normalize the string according to its semantic. */ - type Resolver = (String, String) => Boolean + trait Resolver { +def apply(a: String, b: String): Boolean +def apply(a: String): String + } + + val caseInsensitiveResolution = new Resolver { +override def apply(a: String, b: String): Boolean = a.equalsIgnoreCase(b) +override def apply(a: String): String = a.toLowerCase // as Hive does --- End diff -- how about rename the first apply -> resolve and the second rename to normalize --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5609#issuecomment-97663305 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7232] [SQL] Add a Substitution batch fo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5776#issuecomment-97663531 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org