[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1741#discussion_r15733260 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -280,7 +281,7 @@ private[hive] case class MetastoreRelation // of RPCs are involved. Besides `totalSize`, there are also `numFiles`, `numRows`, // `rawDataSize` keys that we can look at in the future. BigInt( -Option(hiveQlTable.getParameters.get("totalSize")) +Option(hiveQlTable.getParameters.get(StatsSetupConst.TOTAL_SIZE)) --- End diff -- Oh wow, this is a hard-to-find class! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1741#discussion_r15733265 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -280,7 +281,7 @@ private[hive] case class MetastoreRelation // of RPCs are involved. Besides `totalSize`, there are also `numFiles`, `numRows`, --- End diff -- Perhaps update the comments here to say other fields in `StatsSetupConst` might be useful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1741#discussion_r15733255 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala --- @@ -21,12 +21,15 @@ import java.io.{BufferedReader, File, InputStreamReader, PrintStream} import java.sql.Timestamp import java.util.{ArrayList => JArrayList} +import org.apache.hadoop.hive.ql.stats.StatsSetupConst --- End diff -- Alphabetize imports --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733253 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import breeze.linalg.{DenseVector => BDV} + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.distributed.RowMatrix +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +/** + * :: DeveloperApi :: + * Standardizes features by removing the mean and scaling to unit variance using column summary + * statistics on the samples in the training set. + * + * @param withMean True by default. Centers the data with mean before scaling. It will build a dense + * output, so this does not work on sparse input and will raise an exception. + * @param withStd True by default. Scales the data to unit standard deviation. --- End diff -- It is okay to keep it. I'm thinking about the use cases. We might need centering without standardizing the columns. But it is a little weird to use this transformer, because it is called `StandardScaler` while centering is neither `standarding` nor `scaling`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733248 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/VectorTransformer.scala --- @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.rdd.RDD + +/** + * :: DeveloperApi :: + * Trait for transformation of a vector + */ +@DeveloperApi +trait VectorTransformer { + + /** + * Applies transformation on a vector. + * + * @param vector vector to be transformed. + * @return transformed vector. + */ + def transform(vector: Vector): Vector + + /** + * Applies transformation on a RDD[Vector]. + * + * @param data RDD[Vector] to be transformed. + * @return transformed RDD[Vector]. + */ + def transform(data: RDD[Vector]): RDD[Vector] = data.map(x => this.transform(x)) --- End diff -- Can you elaborate this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733244 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import breeze.linalg.{DenseVector => BDV} + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.distributed.RowMatrix +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +/** + * :: DeveloperApi :: + * Standardizes features by removing the mean and scaling to unit variance using column summary + * statistics on the samples in the training set. + * + * @param withMean True by default. Centers the data with mean before scaling. It will build a dense + * output, so this does not work on sparse input and will raise an exception. + * @param withStd True by default. Scales the data to unit standard deviation. --- End diff -- sklearn.preprocessing.StandardScaler has this API. If we want to minimize the set of parameters now, we can remove it for this release. http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-50983421 @pwendell I didn't see `Closes #1379` in the merged commit. Is something wrong with asfgit? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2792. Fix reading too much or too little...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1722#issuecomment-50983403 @aarondav / @mridulm any other comments on this, or is it okay to merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-50983381 ... I have no idea. Let me check. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1477]: Add the lifecycle interface
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/991#issuecomment-50983352 QA tests have started for PR 991. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17806/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1477]: Add the lifecycle interface
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/991#issuecomment-50983325 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733224 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.{Vector, Vectors} + +/** + * :: DeveloperApi :: + * Normalizes samples individually to unit L^n norm + * + * @param n L^2 norm by default. Normalization in L^n space. + */ +@DeveloperApi +class Normalizer(n: Int) extends VectorTransformer with Serializable { + + def this() = this(2) + + require(n > 0) --- End diff -- Ah, we should use Double for norm and also accept `Double.PositiveInfinity`. 1, 2, and `inf` are the popular norms. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733221 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.{Vector, Vectors} + +/** + * :: DeveloperApi :: + * Normalizes samples individually to unit L^n norm + * + * @param n L^2 norm by default. Normalization in L^n space. + */ +@DeveloperApi +class Normalizer(n: Int) extends VectorTransformer with Serializable { + + def this() = this(2) + + require(n > 0) --- End diff -- I made it more explicit for not saving one cpu cycle. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733217 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.{Vector, Vectors} + +/** + * :: DeveloperApi :: + * Normalizes samples individually to unit L^n norm + * + * @param n L^2 norm by default. Normalization in L^n space. + */ +@DeveloperApi +class Normalizer(n: Int) extends VectorTransformer with Serializable { + + def this() = this(2) + + require(n > 0) --- End diff -- This is Int. As long as we require p > 0; it implies p >= 0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733213 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/NormalizerSuite.scala --- @@ -0,0 +1,134 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import breeze.linalg.{DenseVector => BDV, SparseVector => BSV} +import org.scalatest.FunSuite + +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.LocalSparkContext +import org.apache.spark.mllib.util.TestingUtils._ + +class NormalizerSuite extends FunSuite with LocalSparkContext { + + private def norm(v: Array[Double], n: Int): Double = { +v.foldLeft[Double](0.0)((acc, value) => acc + Math.pow(Math.abs(value), n)) + } + + val data = Array( +Vectors.sparse(3, Seq((0, -2.0), (1, 2.3))), +Vectors.dense(0.0, 0.0, 0.0), +Vectors.dense(0.6, -1.1, -3.0), +Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))), +Vectors.dense(5.7, 0.72, 2.7), +Vectors.sparse(3, Seq()) + ) + + lazy val dataRDD = sc.parallelize(data, 3) + + test("Normalization using L1 distance") { +val l1Normalizer = new Normalizer(1) + +val data1 = data.map(l1Normalizer.transform(_)) +val data1RDD = l1Normalizer.transform(dataRDD) + +assert((data.map(_.toBreeze), data1.map(_.toBreeze), data1RDD.collect().map(_.toBreeze)) + .zipped.forall( +(v1, v2, v3) => (v1, v2, v3) match { + case (v1: BDV[Double], v2: BDV[Double], v3: BDV[Double]) => true + case (v1: BSV[Double], v2: BSV[Double], v3: BSV[Double]) => true + case _ => false +} + ), "The vector type should be preserved after normalization.") + +assert((data1, data1RDD.collect()).zipped.forall((v1, v2) => v1 ~== v2 absTol 1E-5)) + +assert(norm(data1(0).toArray, 1) ~== 1.0 absTol 1E-5) +assert(norm(data1(2).toArray, 1) ~== 1.0 absTol 1E-5) +assert(norm(data1(3).toArray, 1) ~== 1.0 absTol 1E-5) +assert(norm(data1(4).toArray, 1) ~== 1.0 absTol 1E-5) + +assert(data1(0) ~== Vectors.sparse(3, Seq((0, -0.465116279), (1, 0.53488372))) absTol 1E-5) +assert(data1(1) ~== Vectors.dense(0.0, 0.0, 0.0) absTol 1E-5) +assert(data1(2) ~== Vectors.dense(0.12765957, -0.23404255, -0.63829787) absTol 1E-5) +assert(data1(3) ~== Vectors.sparse(3, Seq((1, 0.22141119), (2, 0.7785888))) absTol 1E-5) +assert(data1(4) ~== Vectors.dense(0.625, 0.07894737, 0.29605263) absTol 1E-5) +assert(data1(5) ~== Vectors.sparse(3, Seq()) absTol 1E-5) + } + + test("Normalization using L2 distance") { +val l2Normalizer = new Normalizer() + +val data2 = data.map(l2Normalizer.transform(_)) +val data2RDD = l2Normalizer.transform(dataRDD) + +assert((data.map(_.toBreeze), data2.map(_.toBreeze), data2RDD.collect().map(_.toBreeze)) + .zipped.forall( +(v1, v2, v3) => (v1, v2, v3) match { + case (v1: BDV[Double], v2: BDV[Double], v3: BDV[Double]) => true + case (v1: BSV[Double], v2: BSV[Double], v3: BSV[Double]) => true + case _ => false +} + ), "The vector type should be preserved after normalization.") + +assert((data2, data2RDD.collect()).zipped.forall((v1, v2) => v1 ~== v2 absTol 1E-5)) + +assert(norm(data2(0).toArray, 2) ~== 1.0 absTol 1E-5) +assert(norm(data2(2).toArray, 2) ~== 1.0 absTol 1E-5) +assert(norm(data2(3).toArray, 2) ~== 1.0 absTol 1E-5) +assert(norm(data2(4).toArray, 2) ~== 1.0 absTol 1E-5) + +assert(data2(0) ~== Vectors.sparse(3, Seq((0, -0.65617871), (1, 0.75460552))) absTol 1E-5) +assert(data2(1) ~== Vectors.dense(0.0, 0.0, 0.0) absTol 1E-5) +assert(data2(2) ~== Vectors.dense(0.184549876, -0.3383414, -0.922749378) absTol 1E-5) +assert(data2(3) ~== Vectors.sparse(3, Seq((1, 0
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1207#issuecomment-50983292 QA results for PR 1207:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):class Normalizer(n: Int) extends VectorTransformer with Serializable {class StandardScaler(withMean: Boolean, withStd: Boolean)trait VectorTransformer {For more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17804/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733202 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import breeze.linalg.{DenseVector => BDV} + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.distributed.RowMatrix +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +/** + * :: DeveloperApi :: + * Standardizes features by removing the mean and scaling to unit variance using column summary + * statistics on the samples in the training set. + * + * @param withMean True by default. Centers the data with mean before scaling. It will build a dense --- End diff -- I would set `withMean` default to `false` because almost all datasets are sparse. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733204 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import breeze.linalg.{DenseVector => BDV} + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.distributed.RowMatrix +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +/** + * :: DeveloperApi :: + * Standardizes features by removing the mean and scaling to unit variance using column summary + * statistics on the samples in the training set. + * + * @param withMean True by default. Centers the data with mean before scaling. It will build a dense + * output, so this does not work on sparse input and will raise an exception. + * @param withStd True by default. Scales the data to unit standard deviation. + */ +@DeveloperApi +class StandardScaler(withMean: Boolean, withStd: Boolean) + extends VectorTransformer with Serializable { + + def this() = this(true, true) + + var mean: Vector = _ + var variance: Vector = _ + + /** + * Computes the mean and variance and stores as a model to be used for later scaling. + * + * @param data The data used to compute the mean and variance to build the transformation model. + * @return This StandardScalar object. + */ + def fit(data: RDD[Vector]): this.type = { +val summary = new RowMatrix(data).computeColumnSummaryStatistics --- End diff -- Using `OnlineSummarizer` here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733206 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/VectorTransformer.scala --- @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.rdd.RDD + +/** + * :: DeveloperApi :: + * Trait for transformation of a vector + */ +@DeveloperApi +trait VectorTransformer { --- End diff -- add `Serializable`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733207 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/VectorTransformer.scala --- @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.rdd.RDD + +/** + * :: DeveloperApi :: + * Trait for transformation of a vector + */ +@DeveloperApi +trait VectorTransformer { + + /** + * Applies transformation on a vector. + * + * @param vector vector to be transformed. + * @return transformed vector. + */ + def transform(vector: Vector): Vector + + /** + * Applies transformation on a RDD[Vector]. + * + * @param data RDD[Vector] to be transformed. + * @return transformed RDD[Vector]. + */ + def transform(data: RDD[Vector]): RDD[Vector] = data.map(x => this.transform(x)) --- End diff -- Note: to transform an RDD, we should broadcast the data we need instead of serialize it into the task closure. (This may become unnecessary because we broadcast RDD objects in Spark v1.1.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733203 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import breeze.linalg.{DenseVector => BDV} + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.distributed.RowMatrix +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +/** + * :: DeveloperApi :: + * Standardizes features by removing the mean and scaling to unit variance using column summary + * statistics on the samples in the training set. + * + * @param withMean True by default. Centers the data with mean before scaling. It will build a dense + * output, so this does not work on sparse input and will raise an exception. + * @param withStd True by default. Scales the data to unit standard deviation. --- End diff -- Are there use cases for `withStd == false`? (I'm trying to make a minimal set of parameters.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733205 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import breeze.linalg.{DenseVector => BDV} + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.distributed.RowMatrix +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +/** + * :: DeveloperApi :: + * Standardizes features by removing the mean and scaling to unit variance using column summary + * statistics on the samples in the training set. + * + * @param withMean True by default. Centers the data with mean before scaling. It will build a dense + * output, so this does not work on sparse input and will raise an exception. + * @param withStd True by default. Scales the data to unit standard deviation. + */ +@DeveloperApi +class StandardScaler(withMean: Boolean, withStd: Boolean) + extends VectorTransformer with Serializable { + + def this() = this(true, true) + + var mean: Vector = _ + var variance: Vector = _ + + /** + * Computes the mean and variance and stores as a model to be used for later scaling. + * + * @param data The data used to compute the mean and variance to build the transformation model. + * @return This StandardScalar object. + */ + def fit(data: RDD[Vector]): this.type = { +val summary = new RowMatrix(data).computeColumnSummaryStatistics +this.mean = summary.mean +this.variance = summary.variance +require(mean.toBreeze.length == variance.toBreeze.length) +this + } + + /** + * Applies standardization transformation on a vector. + * + * @param vector Vector to be standardized. + * @return Standardized vector. If the variance of a column is zero, it will return default `0.0` + * for the column with zero variance. + */ + override def transform(vector: Vector): Vector = { +require(mean != null || variance != null, s"Please `fit` the model with training set first.") +require(vector.toBreeze.length == mean.toBreeze.length) + +if (withMean) { + vector.toBreeze match { +case dv: BDV[Double] => // pass +case v: Any => + throw new IllegalArgumentException("Do not support vector type " + v.getClass) + } +} + +val output = vector.toBreeze.copy +output.activeIterator.foreach { + case (i, value) => { +val shift = if (withMean) mean(i) else 0.0 +if (variance(i) != 0.0 && withStd) { + output(i) = (value - shift) / Math.sqrt(variance(i)) --- End diff -- ditto: same issue with random access --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733196 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.{Vector, Vectors} + +/** + * :: DeveloperApi :: + * Normalizes samples individually to unit L^n norm + * + * @param n L^2 norm by default. Normalization in L^n space. + */ +@DeveloperApi +class Normalizer(n: Int) extends VectorTransformer with Serializable { --- End diff -- `n` -> `p`, which is commonly used for norms. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733200 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.{Vector, Vectors} + +/** + * :: DeveloperApi :: + * Normalizes samples individually to unit L^n norm + * + * @param n L^2 norm by default. Normalization in L^n space. + */ +@DeveloperApi +class Normalizer(n: Int) extends VectorTransformer with Serializable { + + def this() = this(2) + + require(n > 0) + + /** + * Applies unit length normalization on a vector. + * + * @param vector vector to be normalized. + * @return normalized vector. If all the elements in vector are zeros, it will return as it. + */ + override def transform(vector: Vector): Vector = { +var sum = 0.0 +vector.toBreeze.activeIterator.foreach { + case (i, value) => sum += Math.pow(Math.abs(value), n) +} + +val output = vector.toBreeze.copy +if (sum != 0.0) { + sum = Math.pow(sum, 1.0 / n) + output.activeIterator.foreach { +case (i, value) => output(i) = value / sum --- End diff -- For sparse vectors, `apply(Int)` is implemented using binary search. So we should operate on the values array directly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733198 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.{Vector, Vectors} + +/** + * :: DeveloperApi :: + * Normalizes samples individually to unit L^n norm + * + * @param n L^2 norm by default. Normalization in L^n space. + */ +@DeveloperApi +class Normalizer(n: Int) extends VectorTransformer with Serializable { + + def this() = this(2) + + require(n > 0) + + /** + * Applies unit length normalization on a vector. + * + * @param vector vector to be normalized. + * @return normalized vector. If all the elements in vector are zeros, it will return as it. + */ + override def transform(vector: Vector): Vector = { +var sum = 0.0 +vector.toBreeze.activeIterator.foreach { --- End diff -- We can use breeze's norm directly, e.g., `norm(v, 2.0)`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733197 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.{Vector, Vectors} + +/** + * :: DeveloperApi :: + * Normalizes samples individually to unit L^n norm + * + * @param n L^2 norm by default. Normalization in L^n space. + */ +@DeveloperApi +class Normalizer(n: Int) extends VectorTransformer with Serializable { + + def this() = this(2) + + require(n > 0) --- End diff -- `p >= 1`? Any use case for `p \in (0, 1)`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1207#discussion_r15733199 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.feature + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.mllib.linalg.{Vector, Vectors} + +/** + * :: DeveloperApi :: + * Normalizes samples individually to unit L^n norm + * + * @param n L^2 norm by default. Normalization in L^n space. + */ +@DeveloperApi +class Normalizer(n: Int) extends VectorTransformer with Serializable { + + def this() = this(2) + + require(n > 0) + + /** + * Applies unit length normalization on a vector. + * + * @param vector vector to be normalized. + * @return normalized vector. If all the elements in vector are zeros, it will return as it. + */ + override def transform(vector: Vector): Vector = { +var sum = 0.0 +vector.toBreeze.activeIterator.foreach { + case (i, value) => sum += Math.pow(Math.abs(value), n) +} + +val output = vector.toBreeze.copy --- End diff -- Should be faster if we branch on the vector type here. If the vector is sparse, we only need to copy its value array. Also, the `activeIterator` is not very efficient. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1477]: Add the lifecycle interface
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/991#issuecomment-50983223 QA results for PR 991:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):trait Lifecycle extends Service {trait Service extends java.io.Closeable {class SparkContext(config: SparkConf) extends Logging with Lifecycle {class JavaStreamingContext(val ssc: StreamingContext) extends Lifecycle {class JobGenerator(jobScheduler: JobScheduler) extends Logging with Lifecycle {class JobScheduler(val ssc: StreamingContext) extends Logging with Lifecycle {class ReceiverTracker(ssc: StreamingContext) extends Logging with Lifecycle {class ReceiverLauncher extends Lifecycle {For more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17805/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2197] [mllib] Java DecisionTree bug fix...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1740#discussion_r15733162 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala --- @@ -60,4 +62,31 @@ class Strategy ( val isMulticlassWithCategoricalFeatures = isMulticlassClassification && (categoricalFeaturesInfo.size > 0) + /** + * Java-friendly constructor. + * + * @param algo classification or regression + * @param impurity criterion used for information gain calculation + * @param maxDepth Maximum depth of the tree. + * E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. + * @param numClassesForClassification number of classes for classification. Default value is 2 + *leads to binary classification + * @param maxBins maximum number of bins used for splitting features + * @param categoricalFeaturesInfo A map storing information about the categorical variables and + *the number of discrete values they take. For example, an entry + *(n -> k) implies the feature n is categorical with k categories + *0, 1, 2, ... , k-1. It's important to note that features are + *zero-indexed. + */ + def this( + algo: Algo, + impurity: Impurity, + maxDepth: Int, + numClassesForClassification: Int, + maxBins: Int, + categoricalFeaturesInfo: java.util.Map[java.lang.Integer, java.lang.Integer]) { +this(algo, impurity, maxDepth, numClassesForClassification, maxBins, Sort, + categoricalFeaturesInfo.map{ case (a, b) => (a.toInt, b.toInt) }.toMap) --- End diff -- This seems to work: ~~~ import scala.collection.JavaConverters._ categoricalFeaturesInfo.asInstanceOf[java.util.Map[Int, Int]].asScala.toMap ~~~ `JavaConverters` is preferred because the conversion is explicit via `asScala` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1477]: Add the lifecycle interface
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/991#issuecomment-50982869 QA tests have started for PR 991. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17805/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1750#issuecomment-50982709 QA results for PR 1750:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):case class Sqrt(child: Expression) extends UnaryExpression {For more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17801/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-50982699 @mengxr Is there any problem with asfgit? This is not finished yet, why asfgit said it's merged into apache:master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1207#issuecomment-50982624 QA tests have started for PR 1207. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17804/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1207#issuecomment-50982574 QA results for PR 1207:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):class Normalizer(n: Int) extends VectorTransformer with Serializable {class StandardScaler(withMean: Boolean, withStd: Boolean)trait VectorTransformer {For more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17803/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1207#issuecomment-50982572 QA tests have started for PR 1207. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17803/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1207#issuecomment-50982521 QA results for PR 1207:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):class Normalizer(n: Int) extends VectorTransformer with Serializable {class StandardScaler(withMean: Boolean, withStd: Boolean)trait VectorTransformer {For more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17802/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2627] have the build enforce PEP 8 auto...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1744#issuecomment-50982524 QA results for PR 1744:- This patch PASSES unit tests.For more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17800/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2272 [MLlib] Feature scaling which stand...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1207#issuecomment-50982519 QA tests have started for PR 1207. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17802/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2602 [BUILD] Tests steal focus under Jav...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1747 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2414 [BUILD] Add LICENSE entry for jquer...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1748 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor] Fixes on top of #1679
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1736 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1379 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2678][Core] Added "--" to prevent spark...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/1715#issuecomment-50982344 @andrewor14 Interesting, actually this is almost exactly the same solution I came across at the very beginning :) The only difference is that we chose `--` rather than some more intuitive option name like `--spark-application-args`. And `--` was chosen because it's an idiomatic way among UNIX-like systems to pass this kind of "user application options". The reason that we (@pwendell and me) gave it up after discussion is that this solution is actually not fully downward compatible, it breaks existing user applications which already recognize `--` as a valid option. Turning `--` into something more specific like `--spark-application-args` does reduce the probability of name collision. Especially, after this change, we won't have similar compatibility issue whenever we add any new options to `spark-submit` in the future. @pwendell Maybe this is acceptable? And I agree with your arguments about the drawbacks of putting application jar into `--jars`. Similar arguments applies to Python application. That is also an important reason that I introduced `--primary` at the first place. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1741#issuecomment-50982297 QA results for PR 1741:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17799/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2414 [BUILD] Add LICENSE entry for jquer...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1748#issuecomment-50982167 Thanks Sean - I'll merge this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2627] have the build enforce PEP 8 auto...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1744#issuecomment-50982162 Hey nick - thanks for taking a crack at this. It's great to see us adding more automated code quality checks. Couple things: 1. Could you add `[PySpark]` to the title of this PR? We are using tags like that to do sorting amongst the committership and it will get noticed that way. 2. In terms of the dependency on pep8, we've tried really hard to avoid having exogenous dependencies in Spark. It makes porting things like our QA environment very difficult. So one idea - could this have a script that just lazily fetches the pep8 library directly? For instance, this is what we do with our sbt tool - we just wget the sbt jar... it seems like you could do something similar for pep8. Not sure if that totally works, but just an idea. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1750#issuecomment-50981820 QA tests have started for PR 1750. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17801/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/1750 SPARK-2813: [SQL] Implement SQRT() directly in Catalyst This PR adds a native implementation for SQL SQRT() and thus avoids delegating this function to Hive. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark spark-2813 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1750.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1750 commit 18d63f93316e56b9f0e137e272981b5a2eb84074 Author: William Benton Date: 2014-08-02T15:30:26Z Added native SQRT implementation commit bb8022612c468ae99531fbcc9ddff8a5f45bcf36 Author: William Benton Date: 2014-08-02T16:22:40Z added SQRT test to SqlQuerySuite --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2627] have the build enforce PEP 8 auto...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1744#issuecomment-50981650 QA tests have started for PR 1744. This patch DID NOT merge cleanly! View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17800/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1362#issuecomment-50981523 Great, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1741#issuecomment-50981507 QA tests have started for PR 1741. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17799/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-50981508 Sorry, I'm away from home and had limited time / access to try and do the merge last night - which I didn't finish, and as you mentioned messed up the included commits. I'll post an explicit comment here when the merge is ready. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1741#issuecomment-50981485 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2481: The environment variables SPARK_HI...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1341#issuecomment-50981393 QA results for PR 1341:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17798/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2197] [mllib] Java DecisionTree bug fix...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1740#issuecomment-50981238 QA results for PR 1740:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17797/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Spark 2017
Github user carlosfuertes commented on the pull request: https://github.com/apache/spark/pull/1682#issuecomment-50980868 I added a configuration property "spark.ui.jsRenderingEnabled" that controls whether the rendering of the tables happens using Javascript or not. It is enable by default. This ensures that people that cannot or do not want to run javascript to do the rendering, they can use the web ui as before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2784][SQL] Deprecate hql() method in fa...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1746#issuecomment-50980687 QA results for PR 1746:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):* into Spark SQL's query functions (i.e. sql()). Otherwise, users of this trait canFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17795/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2481: The environment variables SPARK_HI...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1341#issuecomment-50980619 QA tests have started for PR 1341. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17798/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2739][SQL] Rename registerAsTable to re...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1743#issuecomment-50980616 QA results for PR 1743:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17794/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2481: The environment variables SPARK_HI...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1341#issuecomment-50980426 QA results for PR 1341:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17796/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2197] [mllib] Java DecisionTree bug fix...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1740#issuecomment-50980415 QA tests have started for PR 1740. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17797/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2197] [mllib] Java DecisionTree bug fix...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/1740#issuecomment-50980396 Thanks for the comments! I pushed the changes. The only remaining item is JavaConverters for Strategy; I'm not sure how to get it to work there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2197] [mllib] Java DecisionTree bug fix...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/1740#discussion_r15732689 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala --- @@ -60,4 +62,31 @@ class Strategy ( val isMulticlassWithCategoricalFeatures = isMulticlassClassification && (categoricalFeaturesInfo.size > 0) + /** + * Java-friendly constructor. + * + * @param algo classification or regression + * @param impurity criterion used for information gain calculation + * @param maxDepth Maximum depth of the tree. + * E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. + * @param numClassesForClassification number of classes for classification. Default value is 2 + *leads to binary classification + * @param maxBins maximum number of bins used for splitting features + * @param categoricalFeaturesInfo A map storing information about the categorical variables and + *the number of discrete values they take. For example, an entry + *(n -> k) implies the feature n is categorical with k categories + *0, 1, 2, ... , k-1. It's important to note that features are + *zero-indexed. + */ + def this( + algo: Algo, + impurity: Impurity, + maxDepth: Int, + numClassesForClassification: Int, + maxBins: Int, + categoricalFeaturesInfo: java.util.Map[java.lang.Integer, java.lang.Integer]) { +this(algo, impurity, maxDepth, numClassesForClassification, maxBins, Sort, + categoricalFeaturesInfo.map{ case (a, b) => (a.toInt, b.toInt) }.toMap) --- End diff -- I tried using that, but could not figure out how to make it work. The issue is that the integer type used by the map is not converted properly. Is there a good way to do that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2739][SQL] Rename registerAsTable to re...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1743 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2481: The environment variables SPARK_HI...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1341#issuecomment-50979713 QA tests have started for PR 1341. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17796/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2784][SQL] Deprecate hql() method in fa...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1746#issuecomment-50979645 QA tests have started for PR 1746. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17795/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2784][SQL] Deprecate hql() method in fa...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1746#issuecomment-50979639 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2481: The environment variables SPARK_HI...
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/1341#issuecomment-50979611 @pwendell I think the PR can be merged into 1.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2784][SQL] Deprecate hql() method in fa...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1746#issuecomment-50979618 Github is pretty confused about this one now since apache is lagging... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2739][SQL] Rename registerAsTable to re...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1743#issuecomment-50979512 QA tests have started for PR 1743. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17794/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2739][SQL] Rename registerAsTable to re...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1743#issuecomment-50979490 Thanks for looking this over! I've merged to master and 1.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2804: Remove scalalogging-slf4j dependen...
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/1208#issuecomment-50979440 Cool thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1470][SPARK-1842] Use the scala-logging...
Github user witgo closed the pull request at: https://github.com/apache/spark/pull/332 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2739][SQL] Rename registerAsTable to re...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1743#issuecomment-50979387 QA results for PR 1743:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):trait OverrideFunctionRegistry extends FunctionRegistry {class SimpleFunctionRegistry extends FunctionRegistry {protected[sql] trait UDFRegistration {class JavaSQLContext(val sqlContext: SQLContext) extends UDFRegistration {case class EvaluatePython(udf: PythonUDF, child: LogicalPlan) extends logical.UnaryNode {case class BatchPythonEvaluation(udf: PythonUDF, output: Seq[Attribute], child: SparkPlan)For more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17791/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1997] mllib - upgrade to breeze 0.8.1
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1749#issuecomment-50979270 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1997] mllib - upgrade to breeze 0.8.1
GitHub user avati opened a pull request: https://github.com/apache/spark/pull/1749 [SPARK-1997] mllib - upgrade to breeze 0.8.1 Signed-off-by: Anand Avati You can merge this pull request into a Git repository by running: $ git pull https://github.com/avati/spark SPARK-1997-breeze-0.8.1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1749.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1749 commit 5a9e6ba7694fb67e50a9cd469ce65ed5f7b91b0d Author: Anand Avati Date: 2014-07-26T04:06:48Z [SPARK-1997] mllib - upgrade to breeze 0.8.1 Signed-off-by: Anand Avati --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2797] [SQL] SchemaRDDs don't support un...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1745 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2729][SQL] Added test case for SPARK-27...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1738 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor] Fixes on top of #1679
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/1736#discussion_r15732470 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerSource.scala --- @@ -46,9 +46,8 @@ private[spark] class BlockManagerSource(val blockManager: BlockManager, sc: Spar metricRegistry.register(MetricRegistry.name("memory", "memUsed_MB"), new Gauge[Long] { override def getValue: Long = { val storageStatusList = blockManager.master.getStorageStatus - val maxMem = storageStatusList.map(_.maxMem).sum - val remainingMem = storageStatusList.map(_.memRemaining).sum - (maxMem - remainingMem) / 1024 / 1024 + val memUsed = storageStatusList.map(_.memUsed).sum + memUsed / 1024 / 1024 --- End diff -- Btw, it is just a nit - so please dont let this block a commit ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2784][SQL] Deprecate hql() method in fa...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1746#issuecomment-50979067 LGTM. Two places in the programming guide need to be updated. ``` .//docs/sql-programming-guide.md:the `sql` method a `JavaHiveContext` also provides an `hql` methods, which allows queries to be .//docs/sql-programming-guide.md:the `sql` method a `HiveContext` also provides an `hql` methods, which allows queries to be ``` But, since we will work on doc next week, we can update these in our PR for doc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2797] [SQL] SchemaRDDs don't support un...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1745#issuecomment-50979052 Thanks! I've merged this to master and 1.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2602 [BUILD] Tests steal focus under Jav...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1747#issuecomment-50979051 QA results for PR 1747:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17793/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2797] [SQL] SchemaRDDs don't support un...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1745#issuecomment-50979025 QA results for PR 1745:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17790/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2414 [BUILD] Add LICENSE entry for jquer...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1748#issuecomment-50979032 QA results for PR 1748:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17792/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor] Fixes on top of #1679
Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/1736#discussion_r15732438 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerSource.scala --- @@ -46,9 +46,8 @@ private[spark] class BlockManagerSource(val blockManager: BlockManager, sc: Spar metricRegistry.register(MetricRegistry.name("memory", "memUsed_MB"), new Gauge[Long] { override def getValue: Long = { val storageStatusList = blockManager.master.getStorageStatus - val maxMem = storageStatusList.map(_.maxMem).sum - val remainingMem = storageStatusList.map(_.memRemaining).sum - (maxMem - remainingMem) / 1024 / 1024 + val memUsed = storageStatusList.map(_.memUsed).sum + memUsed / 1024 / 1024 --- End diff -- bad code is bad code :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2784][SQL] Deprecate hql() method in fa...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1746#issuecomment-50978890 QA results for PR 1746:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):* into Spark SQL's query functions (i.e. sql()). Otherwise, users of this trait canFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17789/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2566. Update ShuffleWriteMetrics increme...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1481#issuecomment-50978548 The best way might be to do something like this. ``` /** * Alias for WriterMetrics for compatibility reasons */ @DeveloperApi class ShuffleWriteMetrics extends WriterMetrics /** * :: DeveloperApi :: * Metrics pertaining to data written through a BlockObjectWriter. */ @DeveloperApi private[spark] class WriterMetrics extends Serializable { /** * Number of bytes written for the this task */ var shuffleBytesWritten: Long = _ /** * Time the task spent blocking on writes to disk or buffer cache, in nanoseconds */ var shuffleWriteTime: Long = _ } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user CodingCat commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-50978501 I might give more explanation on the trace printed above Set() ANY,NODE_LOCAL task 1, ArrayBuffer() task 0, ArrayBuffer(TaskLocation(localhost, None)) miss task == the first line is speculative tasks, the second line, maxLocality, allowedLocality the third to the second last line are the tasks in the allPendingTasks and their locality preference the last line is whether the tasksetManager finds a task From the trace above, we can see that, the nonPref tasks indeed experience unnecessary delay, causing the test case being interrupted --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2729][SQL] Added test case for SPARK-27...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1738#issuecomment-50978394 Thanks! I've merged this into master and 1.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-50978374 This seems to have captured a bunch of unrelated changes during the rebase. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2566. Update ShuffleWriteMetrics increme...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1481#issuecomment-50978354 @sryza Can't the `ExeternalSorter` and `ExternalAppendOnlyMap` just pass their own `ShuffleWriteMetrics` when they create a disk writer and then read back the bytes written? We could also change the name of `ShuffleWriteMetrics` to just be `WriteMetrics` - or we could leave it for now and just put a TODO. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2478] [mllib] DecisionTree Python API
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1727#discussion_r15732257 --- Diff: examples/src/main/python/mllib/tree.py --- @@ -0,0 +1,129 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +Decision tree classification and regression using MLlib. +""" + +import numpy, os, sys + +from operator import add + +from pyspark import SparkContext +from pyspark.mllib.regression import LabeledPoint +from pyspark.mllib.tree import DecisionTree +from pyspark.mllib.util import MLUtils + + +def getAccuracy(dtModel, data): +""" +Return accuracy of DecisionTreeModel on the given RDD[LabeledPoint]. +""" +seqOp = (lambda acc, x: acc + (x[0] == x[1])) +predictions = dtModel.predict(data.map(lambda x: x.features)) +truth = data.map(lambda p: p.label) +trainCorrect = predictions.zip(truth).aggregate(0, seqOp, add) +return trainCorrect / (0.0 + data.count()) + + +def getMSE(dtModel, data): +""" +Return mean squared error (MSE) of DecisionTreeModel on the given +RDD[LabeledPoint]. +""" +seqOp = (lambda acc, x: acc + numpy.square(x[0] - x[1])) +predictions = dtModel.predict(data.map(lambda x: x.features)) +truth = data.map(lambda p: p.label) +trainMSE = predictions.zip(truth).aggregate(0, seqOp, add) +return trainMSE / (0.0 + data.count()) + + +def reindexClassLabels(data): +""" +Re-index class labels in a dataset to the range {0,...,numClasses-1}. +If all labels in that range already appear at least once, + then the returned RDD is the same one (without a mapping). +Note: If a label simply does not appear in the data, + the index will not include it. + Be aware of this when reindexing subsampled data. +:param data: RDD of LabeledPoint where labels are integer values + denoting labels for a classification problem. +:return: Pair (reindexedData, origToNewLabels) where + reindexedData is an RDD of LabeledPoint with labels in + the range {0,...,numClasses-1}, and + origToNewLabels is a dictionary mapping original labels + to new labels. +""" +# classCounts: class --> # examples in class +classCounts = data.map(lambda x: x.label).countByValue() +numExamples = sum(classCounts.values()) +sortedClasses = sorted(classCounts.keys()) +numClasses = len(classCounts) +# origToNewLabels: class --> index in 0,...,numClasses-1 +if (numClasses < 2): +print >> sys.stderr, \ +"Dataset for classification should have at least 2 classes." + \ +" The given dataset had only %d classes." % numClasses +exit(-1) +origToNewLabels = dict([(sortedClasses[i], i) for i in range(0,numClasses)]) + +print "numClasses = %d" % numClasses +print "Per-class example fractions, counts:" +print "Class\tFrac\tCount" +for c in sortedClasses: +frac = classCounts[c] / (numExamples + 0.0) +print "%g\t%g\t%d" % (c, frac, classCounts[c]) + +if (sortedClasses[0] == 0 and sortedClasses[-1] == numClasses - 1): --- End diff -- Only the first and the last were checked. The values in the middle could be something like `0.5`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2097][SQL] UDF Support
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1063 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2785][SQL] Remove assertions that throw...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1742 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2602 [BUILD] Tests steal focus under Jav...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1747#issuecomment-50978057 QA tests have started for PR 1747. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17793/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2414 [BUILD] Add LICENSE entry for jquer...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1748#issuecomment-50978055 QA tests have started for PR 1748. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17792/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2414 [BUILD] Add LICENSE entry for jquer...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/1748 SPARK-2414 [BUILD] Add LICENSE entry for jquery The JIRA concerned removing jquery, and this does not remove jquery. While it is distributed by Spark it should have an accompanying line in LICENSE, very technically, as per http://www.apache.org/dev/licensing-howto.html You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-2414 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1748.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1748 commit 2fdb03c99d6d802c85c4d2033d670eafd4bcb118 Author: Sean Owen Date: 2014-08-02T23:51:15Z Add LICENSE entry for jquery --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2739][SQL] Rename registerAsTable to re...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1743#issuecomment-50978019 Good catch @yhuai. I've updated the java files as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1741#issuecomment-50977990 Hmmm, linux vs mac file size problems? ``` [info] StatisticsSuite: [info] - analyze MetastoreRelations *** FAILED *** [info] 11768 did not equal 11624 (StatisticsSuite.scala:42) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2602 [BUILD] Tests steal focus under Jav...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/1747 SPARK-2602 [BUILD] Tests steal focus under Java 6 As per https://issues.apache.org/jira/browse/SPARK-2602 , this may be resolved for Java 6 with the java.awt.headless system property, which never hurt anyone running a command line app. I tested it and seemed to get rid of focus stealing. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-2602 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1747.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1747 commit b141018061365cb42f2991506ee4ecd4bd4f377b Author: Sean Owen Date: 2014-08-02T23:47:24Z Set java.awt.headless during tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org