[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/2634 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-103156795 Do you mind closing this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-103169922 We can't directly but there's an automated process that will eventually. Don't worry about it and/or get to it later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-103169503 Sure, but I'm traveling now. Would you mind closing it for me? Sent from my iPhone On May 18, 2015, at 8:19 PM, Sean Owen notificati...@github.com wrote: Do you mind closing this PR? â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-103178283 Ok thx ! Sent from my iPhone On May 18, 2015, at 8:47 PM, Sean Owen notificati...@github.com wrote: We can't directly but there's an automated process that will eventually. Don't worry about it and/or get to it later. â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r23507379 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala --- @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering.metrics + +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = BV } + +import org.apache.spark.mllib.base._ +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector } +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, Zero } + +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, weight) { + val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight) --- End diff -- @mengxr By separating the initialization of cluster centers from the clustering, one may trivially implement several variants of the standard K-means clustering, including the Wavelet-Based Anytime Algorithm for K-Means Clustering of Time Series, http://www.cs.gmu.edu/~jessica/publications/ikmeans_sdm_workshop03.pdf which has been demonstrated to give better quality clusterings in less time than the standard algorithm. To demonstrate this, I have implemented the aforementioned algorithm in roughly 50 lines of new code. See com.massivedatascience.clusterer.WaveletKMeans. On Sat, Jan 24, 2015 at 1:45 AM, Derrick Burns derrickrbu...@gmail.com wrote: @mengxr I have rewritten the README file https://github.com/derrickburns/generalized-kmeans-clustering for my clustering GtHub project. I think that you may find it useful. On Thu, Jan 22, 2015 at 7:43 PM, Derrick Burns derrickrbu...@gmail.com wrote: @mengxr I spent some time thinking about clustering sparse data. I realized that there may be a need to embed the sparse data into lower dimensional spaces. Then, I realized that one of the Bregman divergences (i.e. the generalized symmetrized Bregman divergence), also embeds data into a different space before clustering. This led me to the conclusion that embedding the data into a convenient space just prior to clustering is a very useful generic feature. Adding it was trivial. See the README on my GitHub project for details. I implemented two generic embeddings: 1) a sparse to dense embedding that maps points isomorphically and 2) random indexing https://www.sics.se/~mange/papers/RI_intro.pdf that maps high dimensional data into lower dimension in a way that preserves similarity per the Johnson-Lindenstrauss lemma http://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma. I also recoded the embedding that I did for a generalized symmetrized KL divergence to use the new embedding design. I think that with these improvements one can check the clusterer supports sparse data feature box in a practical sense. I am about to being performance testing of the randoming indexing embedding on large volumes of high dimensional sparse data. Please take a look at the new PointOps trait definition. On Mon, Jan 19, 2015 at 5:36 PM, Derrick Burns derrickrbu...@gmail.com wrote: For the squared Euclidean distance case, weights would naturally start at 1.0 for each input point, so that a cluster with 100 points would have weight 100.0. However, for the KL divergence where the points are derived from frequencies, the initial weight could be the sum of the frequencies. This puts the points on the simplex without having to perform the division. Sent from my iPhone On Jan 19, 2015, at 5:02 PM, Xiangrui Meng notificati...@github.com wrote: In mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala https://github.com/apache/spark/pull/2634#discussion-diff-23197316: + *
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r23495418 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala --- @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering.metrics + +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = BV } + +import org.apache.spark.mllib.base._ +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector } +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, Zero } + +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, weight) { + val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight) --- End diff -- @mengxr I have rewritten the README file https://github.com/derrickburns/generalized-kmeans-clustering for my clustering GtHub project. I think that you may find it useful. On Thu, Jan 22, 2015 at 7:43 PM, Derrick Burns derrickrbu...@gmail.com wrote: @mengxr I spent some time thinking about clustering sparse data. I realized that there may be a need to embed the sparse data into lower dimensional spaces. Then, I realized that one of the Bregman divergences (i.e. the generalized symmetrized Bregman divergence), also embeds data into a different space before clustering. This led me to the conclusion that embedding the data into a convenient space just prior to clustering is a very useful generic feature. Adding it was trivial. See the README on my GitHub project for details. I implemented two generic embeddings: 1) a sparse to dense embedding that maps points isomorphically and 2) random indexing https://www.sics.se/~mange/papers/RI_intro.pdf that maps high dimensional data into lower dimension in a way that preserves similarity per the Johnson-Lindenstrauss lemma http://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma. I also recoded the embedding that I did for a generalized symmetrized KL divergence to use the new embedding design. I think that with these improvements one can check the clusterer supports sparse data feature box in a practical sense. I am about to being performance testing of the randoming indexing embedding on large volumes of high dimensional sparse data. Please take a look at the new PointOps trait definition. On Mon, Jan 19, 2015 at 5:36 PM, Derrick Burns derrickrbu...@gmail.com wrote: For the squared Euclidean distance case, weights would naturally start at 1.0 for each input point, so that a cluster with 100 points would have weight 100.0. However, for the KL divergence where the points are derived from frequencies, the initial weight could be the sum of the frequencies. This puts the points on the simplex without having to perform the division. Sent from my iPhone On Jan 19, 2015, at 5:02 PM, Xiangrui Meng notificati...@github.com wrote: In mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala https://github.com/apache/spark/pull/2634#discussion-diff-23197316: + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering.metrics + +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = BV } + +import org.apache.spark.mllib.base._ +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector } +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, Zero } + +class FastEUPoint(raw: BV[Double], weight: Double) extends
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r23430407 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala --- @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering.metrics + +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = BV } + +import org.apache.spark.mllib.base._ +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector } +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, Zero } + +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, weight) { + val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight) --- End diff -- @mengxr I spent some time thinking about clustering sparse data. I realized that there may be a need to embed the sparse data into lower dimensional spaces. Then, I realized that one of the Bregman divergences (i.e. the generalized symmetrized Bregman divergence), also embeds data into a different space before clustering. This led me to the conclusion that embedding the data into a convenient space just prior to clustering is a very useful generic feature. Adding it was trivial. See the README on my GitHub project for details. I implemented two generic embeddings: 1) a sparse to dense embedding that maps points isomorphically and 2) random indexing https://www.sics.se/~mange/papers/RI_intro.pdf that maps high dimensional data into lower dimension in a way that preserves similarity per the Johnson-Lindenstrauss lemma http://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma. I also recoded the embedding that I did for a generalized symmetrized KL divergence to use the new embedding design. I think that with these improvements one can check the clusterer supports sparse data feature box in a practical sense. I am about to being performance testing of the randoming indexing embedding on large volumes of high dimensional sparse data. Please take a look at the new PointOps trait definition. On Mon, Jan 19, 2015 at 5:36 PM, Derrick Burns derrickrbu...@gmail.com wrote: For the squared Euclidean distance case, weights would naturally start at 1.0 for each input point, so that a cluster with 100 points would have weight 100.0. However, for the KL divergence where the points are derived from frequencies, the initial weight could be the sum of the frequencies. This puts the points on the simplex without having to perform the division. Sent from my iPhone On Jan 19, 2015, at 5:02 PM, Xiangrui Meng notificati...@github.com wrote: In mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala https://github.com/apache/spark/pull/2634#discussion-diff-23197316: + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering.metrics + +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = BV } + +import org.apache.spark.mllib.base._ +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector } +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, Zero } + +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, weight) { + val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight) Could you elaborate more about the benefit of using homogeneous coordinates? It should be at least documented that what weight stands for here. If a cluster have 100 points, do you
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-70588270 @derrickburns We use dense vectors to store cluster centers, because the centers are very likely to become dense during aggregation. If there are zeros, they can be efficiently compressed before sending back to the driver. For performance, we never add two sparse vectors together. I'm not sure whether this answers your question. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-70583162 @mengxr One more thing regarding sparse vectors. Sparse vectors can become dense under cluster creation, which, in turn, can cause the running time of the K-means clustering to skyrocket. To address this problem, one can project clusters onto a sparse vector before performing distance calculation. My current version of the clusterer does this when the appropriate distance object is selected. On Sun, Jan 18, 2015 at 7:59 PM, Derrick Burns derrickrbu...@gmail.com wrote: @mengxr I have implemented several variants of Kullback-Leibler divergence in my separate GitHub repository https://github.com/derrickburns/generalized-kmeans-clustering. These variants are more efficient that the standard KL-divergence which is defined on R+ ^ n because they take advantage of extra knowledge of the domain. I have used these variants with much success (i.e. much faster running time) in my large scale clustering runs. On Sat, Jan 17, 2015 at 7:02 PM, UCB AMPLab notificati...@github.com wrote: Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25711/ Test FAILed. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/2634#issuecomment-70394598. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-70589276 In my application (n-gram contexts), the sparse vectors can be of extremely high dimension. To make the problem manageable, I select the k most important dimensions per point. For a cluster of m points, I can have a sparse vector of m*k non-zero values. Since some clusters can become quite large ( O(n) is size), I can get sparse vectors of O(nk) non-zero values. Still, the vector is sparse, since the dimension of the vector is potentially 2^31-1. So, I cannot treat the vector as dense. To deal with the growth problem, I have implemented a centroid that only retains the k most important features. Sent from my iPhone On Jan 19, 2015, at 5:07 PM, Xiangrui Meng notificati...@github.com wrote: @derrickburns We use dense vectors to store cluster centers, because the centers are very likely to become dense during aggregation. If there are zeros, they can be efficiently compressed before sending back to the driver. For performance, we never add two sparse vectors together. I'm not sure whether this answers your question. â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r23198417 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala --- @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering.metrics + +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = BV } + +import org.apache.spark.mllib.base._ +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector } +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, Zero } + +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, weight) { + val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight) --- End diff -- For the squared Euclidean distance case, weights would naturally start at 1.0 for each input point, so that a cluster with 100 points would have weight 100.0. However, for the KL divergence where the points are derived from frequencies, the initial weight could be the sum of the frequencies. This puts the points on the simplex without having to perform the division. Sent from my iPhone On Jan 19, 2015, at 5:02 PM, Xiangrui Meng notificati...@github.com wrote: In mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala: + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering.metrics + +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = BV } + +import org.apache.spark.mllib.base._ +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector } +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, Zero } + +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, weight) { + val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight) Could you elaborate more about the benefit of using homogeneous coordinates? It should be at least documented that what weight stands for here. If a cluster have 100 points, do you assign its center with weight 100? If so, how to compute the distance between this center and an input point? â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r23197318 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala --- @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib + +import breeze.linalg.{ Vector = BV } + +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.rdd.RDD + +package object base { --- End diff -- 1) modifications to BLAS to support arbitrary functions and axpy on SparseVectors We already support axpy on SparseVectors in `mllib.linalg.BLAS`. What do you mean by `arbitrary functions`? If the target is to improve k-means, we only need to implement BLAS ops required by k-means. 2) definition of PointOps trait and various implementations for different divergences 3) rewrite of clusterer to use PointOps trait but only using the squared Euclidean distance implementation 4) generalization of clustering interface to select different divergences. Sounds good. 5) store seeds This is already covered in https://github.com/apache/spark/pull/3610, which is almost ready to merge. 6) maintain exactly K clusters by splitting clusters heuristicly. We had an example in `StreamingKMeans`. But if we want to support different distances, we need to consider how to keep the new centers inside the domain. Thanks for improving your implementation! Before we merge features one by one, it would be nice to maintain your implementation as a 3rd-party Spark package (spark-packages.org). That helps people find it and send feedbacks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r23197316 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala --- @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering.metrics + +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = BV } + +import org.apache.spark.mllib.base._ +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector } +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, Zero } + +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, weight) { + val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight) --- End diff -- Could you elaborate more about the benefit of using homogeneous coordinates? It should be at least documented that what weight stands for here. If a cluster have 100 points, do you assign its center with weight 100? If so, how to compute the distance between this center and an input point? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r23198020 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala --- @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering.metrics + +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = BV } + +import org.apache.spark.mllib.base._ +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector } +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, Zero } + +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, weight) { + val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight) --- End diff -- Simply put, computing the distance (and the cached values per point and centered used in the distance) is faster in homogeneous coordinates. Imagine computing dot(x,y), where x and y are vectors. If x and y are in homogeneous coordinates, then it is faster to perform the fit product in homogeneous coordinates then divide by the weights of the two vectors than it is to divide each homogenous coordinate by the weights first and the do the dot product. Sent from my iPhone On Jan 19, 2015, at 5:02 PM, Xiangrui Meng notificati...@github.com wrote: In mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala: + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering.metrics + +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = BV } + +import org.apache.spark.mllib.base._ +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector } +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, Zero } + +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, weight) { + val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight) Could you elaborate more about the benefit of using homogeneous coordinates? It should be at least documented that what weight stands for here. If a cluster have 100 points, do you assign its center with weight 100? If so, how to compute the distance between this center and an input point? â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-70443890 @mengxr I have implemented several variants of Kullback-Leibler divergence in my separate GitHub repository https://github.com/derrickburns/generalized-kmeans-clustering. These variants are more efficient that the standard KL-divergence which is defined on R+ ^ n because they take advantage of extra knowledge of the domain. I have used these variants with much success (i.e. much faster running time) in my large scale clustering runs. On Sat, Jan 17, 2015 at 7:02 PM, UCB AMPLab notificati...@github.com wrote: Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25711/ Test FAILed. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/2634#issuecomment-70394598. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-70394597 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25711/consoleFull) for PR 2634 at commit [`35da8e9`](https://github.com/apache/spark/commit/35da8e9e188e66946d5799d061ecc3ca150f). * This patch **fails** unit tests. * This patch **does not** merge cleanly! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-70394598 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25711/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-70392876 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25711/consoleFull) for PR 2634 at commit [`35da8e9`](https://github.com/apache/spark/commit/35da8e9e188e66946d5799d061ecc3ca150f). * This patch **does not** merge cleanly! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-70345238 I see the problem with sparse vectors and the KL divergence. I implemented a smoothing operation to approximate KL divergence. Sent from my iPhone On Jan 8, 2015, at 3:55 PM, Xiangrui Meng notificati...@github.com wrote: @derrickburns I like the improvements implemented in this PR. But as @srowen mentioned, we have to resolve conflicts with the master branch before we can merge any PR. I compared the performance of this PR with master on minist-digits (6x784, sparse, 10 clusters) locally and found the master runs 2-3x faster. I guess this is majorly caused by two changes. We replaced breeze operations by our own implementation. The latter is about 2-3x faster. Running k-means++ distributively has noticeable overhead with small k and feature dimension. I think it is still feasible to include features through separate PRs: remember previously computed best distances in k-means++ initialization allow fixing the random seed (addressed in #3610) variable number of clusters. We should discuss whether we want to have less than k clusters or split the biggest one if there are more than k points. parallelize k-means++. I think whether we should replace local k-means++ or make it configurable requires some discussion and performance comparison. support Bregman divergences Putting all of them together would certainly delay the review process and require resolving conflicts. I may have some time to prepare PRs for some of the features here, if you don't mind. For Bregman divergences, I'm thinking we can alter the formulation to support sparse vectors: d(x, y) = f(x) - f(y) - x - y, g(y) = f(x) - (f(y) - y, g(y)) - x, g(y) where f(x), g(y), and f(y) - y, g(y) could be pre-computed and cached, and x, g(y) can take advantage of sparse x. But I'm not sure whether this formulation is really useful on any Bregman divergence rather than the squared distance and the Mahalanobis distance. For KL-divergence and generalized I-divergence, the domain is R^d_+ and hence the points cannot be sparse. Besides those comments, I'm going to make some minor comments inline. â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r22768878 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala --- @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib + +import breeze.linalg.{ Vector = BV } + +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.rdd.RDD + +package object base { --- End diff -- I'm willing to take one last crack at contributing the Bregman modifications to Spark. Here is an layering of the pull requests: 1) modifications to BLAS to support arbitrary functions and axpy on SparseVectors 2) definition of PointOps trait and various implementations for different divergences 3) rewrite of clusterer to use PointOps trait but only using the squared Euclidean distance implementation 4) generalization of clustering interface to select different divergences. 5) store seeds 6) maintain exactly K clusters by splitting clusters heuristicly. I can submit PRs for 1-4. Thoughts? Sent from my iPhone On Jan 8, 2015, at 4:00 PM, Xiangrui Meng notificati...@github.com wrote: In mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala: + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib + +import breeze.linalg.{ Vector = BV } + +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.rdd.RDD + +package object base { Though private, having some doc would make the code easy to understand. For example, What does FP mean? inh? ... â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r22774169 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala --- @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib + +import breeze.linalg.{ Vector = BV } + +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.rdd.RDD + +package object base { --- End diff -- @mengxr I tried to build the Spark master branch, but was unsuccessful. So, instead of integrating my Bregman clusterer into the latest master branch of Spark, I integrated them into my public GitHub project https://github.com/derrickburns/generalized-kmeans-clustering. Per your suggestion, this version supports sparse vectors. Included is support for several Bregman divergences including: squared Euclidean distance, Kullback Leibler divergence, logistic loss, and generalized I-divergence. I created a general BregmanDivergence trait and a separate BregmanPointOps trait. When mixed in together, one gets a specific BregmanPointOps object that computes Bregman divergences very efficiently, caching one value per point and two values per center (per your observation). This version is easier to understand that the previous one. Included also are three different implementations of K-Means clustering: 1) one that caches distances (CachingKMeans) 2) one that tracks which clusters moved and which points are assigned to which clusters and the distance to the closest cluster (TrackingKMeans); and, 3) one that recomputes all clusters and distances (SingleKMeans). I have found the the TrackingKMeans clusterer has the best performance on last data sets. I have also found that maintaining cluster centers in homogeneous coordinates is more efficient than using inhomogeneous coordinates for moderate to high dimensional data. There is no penalty for low dimensional data. If/when you include such changes in Spark, I would appreciate being included in the list of contributors. Regards, Derrick Burns On Sun, Jan 11, 2015 at 10:52 AM, Derrick Burns derrickrbu...@gmail.com wrote: I'm willing to take one last crack at contributing the Bregman modifications to Spark. Here is an layering of the pull requests: 1) modifications to BLAS to support arbitrary functions and axpy on SparseVectors 2) definition of PointOps trait and various implementations for different divergences 3) rewrite of clusterer to use PointOps trait but only using the squared Euclidean distance implementation 4) generalization of clustering interface to select different divergences. 5) store seeds 6) maintain exactly K clusters by splitting clusters heuristicly. I can submit PRs for 1-4. Thoughts? Sent from my iPhone On Jan 8, 2015, at 4:00 PM, Xiangrui Meng notificati...@github.com wrote: In mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala https://github.com/apache/spark/pull/2634#discussion-diff-22693266: + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib + +import breeze.linalg.{ Vector = BV } + +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.rdd.RDD + +package object base { Though private, having some doc would make the code easy to understand. For example, 1. What does FP mean? 2. inh? 3. ... â
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r22764827 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala --- @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering.metrics + +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = BV } + +import org.apache.spark.mllib.base._ +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector } +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, Zero } + +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, weight) { + val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight) --- End diff -- It is most efficient to maintain the cluster centers in homogeneous coordinates. On Thu, Jan 8, 2015 at 3:56 PM, Xiangrui Meng notificati...@github.com wrote: In mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala https://github.com/apache/spark/pull/2634#discussion-diff-22693112: + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering.metrics + +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = BV } + +import org.apache.spark.mllib.base._ +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector } +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, Zero } + +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, weight) { + val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight) Should weight be only used in aggregation rather than distance computation? â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/2634/files#r22693112. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-69404412 Thanks for the information on the speedup that you obtained by eliminating Breeze. I was unaware that the performance is so poor. To what do you attribute the poor performance of Breeze? One case certainly split out 1-4 AFTER doing 5. My point was that doing 5 (supporting Bregman divergences) requires touching practically every method in the clustering package. 3. Supporting a variable number of clusters requires a significant rewrite at well, since the assumption that there are K clusters is made repeatedly in the code. If seems like adding a way to split clusters introduces a whole other dimension of flexibility -- which I am not against -- that would complicate the interface. For example, what heuristic would you use to identify a cluster to split and how would you split it. It seems to me that once you go down that path, you are headed toward a divisive clustering algorithm. 4. As for the overhead of running k-means++ distributively, will someone use Spark for small k and feature dimension? 5. Bregman divergences As it turns out, I now have a need to perform clustering on sparse data, so I am making a pass over my private implementation to 1) eliminate the use of Breeze and 2) use the Spark Vector classes. When I am done, I can include these modification in an updated pull request. I do not understand you conclusion that for KL-divergence and generalized I-divergence the points cannot be sparse. The domain need not be R^d+. Rather, the domain can be R^infinity+. For example, in the problem that I am currently working on, I am creating sparse vectors where the vectors represent the frequency of occurrence of the contexts for given n-grams. The space of the contexts is extremely large, however, I am only interested in the M most frequent contexts per n-gram. I use a sketch to identify the M most frequent contexts per n-gram. This gives me an M length SparseVector of non-negative frequency values. Thus, for each n-gram, I will have a SparseVector that represents the distribution of the more frequently occurring contexts. I do not see a problem using this SparseVector with the KL-divergence metric. Am I missing something? On Thu, Jan 8, 2015 at 3:55 PM, Xiangrui Meng notificati...@github.com wrote: @derrickburns https://github.com/derrickburns I like the improvements implemented in this PR. But as @srowen https://github.com/srowen mentioned, we have to resolve conflicts with the master branch before we can merge any PR. I compared the performance of this PR with master on minist-digits (6x784, sparse, 10 clusters) locally and found the master runs 2-3x faster. I guess this is majorly caused by two changes. 1. We replaced breeze operations by our own implementation. The latter is about 2-3x faster. 2. Running k-means++ distributively has noticeable overhead with small k and feature dimension. I think it is still feasible to include features through separate PRs: 1. remember previously computed best distances in k-means++ initialization 2. allow fixing the random seed (addressed in #3610 https://github.com/apache/spark/pull/3610) 3. variable number of clusters. We should discuss whether we want to have less than k clusters or split the biggest one if there are more than k points. 4. parallelize k-means++. I think whether we should replace local k-means++ or make it configurable requires some discussion and performance comparison. 5. support Bregman divergences Putting all of them together would certainly delay the review process and require resolving conflicts. I may have some time to prepare PRs for some of the features here, if you don't mind. For Bregman divergences, I'm thinking we can alter the formulation to support sparse vectors: d(x, y) = f(x) - f(y) - x - y, g(y) = f(x) - (f(y) - y, g(y)) - x, g(y) where f(x), g(y), and f(y) - y, g(y) could be pre-computed and cached, and x, g(y) can take advantage of sparse x. But I'm not sure whether this formulation is really useful on any Bregman divergence rather than the squared distance and the Mahalanobis distance. For KL-divergence and generalized I-divergence, the domain is R^d_+ and hence the points cannot be sparse. Besides those comments, I'm going to make some minor comments inline. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/2634#issuecomment-69272196. --- If your project is set up for
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r22693139 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala --- @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering.metrics + +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = BV } + +import org.apache.spark.mllib.base._ +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector } +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, Zero } + +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, weight) { + val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight) +} + +/** + * Euclidean distance measure, expedited by pre-computing vector norms + */ +class FastEuclideanOps extends PointOps[FastEUPoint, FastEUPoint] with Serializable { + + type C = FastEUPoint + type P = FastEUPoint + + val epsilon = 1e-4 + + /* compute a lower bound on the euclidean distance distance */ + + def distance(p: P, c: C, upperBound: Double): Double = { +val d = if (p.weight == Zero || c.weight == Zero) { + p.norm + c.norm +} else { + val x = p.raw.dot(c.raw) / (p.weight * c.weight) --- End diff -- same question about using `weight` in `distance` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r22693112 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala --- @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering.metrics + +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = BV } + +import org.apache.spark.mllib.base._ +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector } +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, Zero } + +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, weight) { + val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight) --- End diff -- Should `weight` be only used in aggregation rather than distance computation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-69272196 @derrickburns I like the improvements implemented in this PR. But as @srowen mentioned, we have to resolve conflicts with the master branch before we can merge any PR. I compared the performance of this PR with master on minist-digits (6x784, sparse, 10 clusters) locally and found the master runs 2-3x faster. I guess this is majorly caused by two changes. 1. We replaced breeze operations by our own implementation. The latter is about 2-3x faster. 1. Running k-means++ distributively has noticeable overhead with small k and feature dimension. I think it is still feasible to include features through separate PRs: 1. remember previously computed best distances in k-means++ initialization 1. allow fixing the random seed (addressed in #3610) 1. variable number of clusters. We should discuss whether we want to have less than k clusters or split the biggest one if there are more than k points. 1. parallelize k-means++. I think whether we should replace local k-means++ or make it configurable requires some discussion and performance comparison. 1. support Bregman divergences Putting all of them together would certainly delay the review process and require resolving conflicts. I may have some time to prepare PRs for some of the features here, if you don't mind. For Bregman divergences, I'm thinking we can alter the formulation to support sparse vectors: ~~~ d(x, y) = f(x) - f(y) - x - y, g(y) = f(x) - (f(y) - y, g(y)) - x, g(y) ~~~ where `f(x)`, `g(y)`, and `f(y) - y, g(y)` could be pre-computed and cached, and `x, g(y)` can take advantage of sparse `x`. But I'm not sure whether this formulation is really useful on any Bregman divergence rather than the squared distance and the Mahalanobis distance. For KL-divergence and generalized I-divergence, the domain is R^d_+ and hence the points cannot be sparse. Besides those comments, I'm going to make some minor comments inline. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r22693062 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala --- @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib --- End diff -- Should it be `mllib.clustering` as the file is under `clustering/`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r22693266 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala --- @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib + +import breeze.linalg.{ Vector = BV } + +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.rdd.RDD + +package object base { --- End diff -- Though private, having some doc would make the code easy to understand. For example, 1. What does `FP` mean? 1. `inh`? 1. ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-68990204 (@derrickburns this PR can't be merged as-is. It contains merge conflicts with master now, as Github notes here. You would need to rebase it.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-68985129 The pull request that you integrated on December 3 is redundant to this one. Therefore you need not worry about merging in those changes. Simply select this version. Are you willing integrate this pull request as is? On Mon, Jan 5, 2015 at 4:01 PM, Derrick Burns derrickrbu...@gmail.com wrote: I've said this before, so please forgive me for being repetitive. The new implementation is a rewrite, not a patch, so it is not possible to parcel out the changes into distinct PRs. On Mon, Jan 5, 2015 at 1:09 PM, Xiangrui Meng notificati...@github.com wrote: @derrickburns https://github.com/derrickburns I tried to merge this PR with the current master. But there are many conflicts, majorly because we removed breeze from the implementation. To merge the changes, I think we should make separate PRs for the JIRAs mentioned here, from fixes to new features. Do you mind me splitting this PR? Thanks! â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/2634#issuecomment-68781838. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-68804845 I've said this before, so please forgive me for being repetitive. The new implementation is a rewrite, not a patch, so it is not possible to parcel out the changes into distinct PRs. On Mon, Jan 5, 2015 at 1:09 PM, Xiangrui Meng notificati...@github.com wrote: @derrickburns https://github.com/derrickburns I tried to merge this PR with the current master. But there are many conflicts, majorly because we removed breeze from the implementation. To merge the changes, I think we should make separate PRs for the JIRAs mentioned here, from fixes to new features. Do you mind me splitting this PR? Thanks! â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/2634#issuecomment-68781838. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-68781838 @derrickburns I tried to merge this PR with the current master. But there are many conflicts, majorly because we removed breeze from the implementation. To merge the changes, I think we should make separate PRs for the JIRAs mentioned here, from fixes to new features. Do you mind me splitting this PR? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-68409940 @derrickburns I'm going to check the implementation during the break. Since it is hard to locate the serialization issue, do you mind me making changes to your implementation and sending you a pull request after? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-68414436 No problem. Thx! Sent from my iPhone On Dec 30, 2014, at 3:23 PM, Xiangrui Meng notificati...@github.com wrote: @derrickburns I'm going to check the implementation during the break. Since it is hard to locate the serialization issue, do you mind me making changes to your implementation and sending you a pull request after? â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-68425655 Thx for taking this on. Sent from my iPhone On Dec 30, 2014, at 3:23 PM, Xiangrui Meng notificati...@github.com wrote: @derrickburns I'm going to check the implementation during the break. Since it is hard to locate the serialization issue, do you mind me making changes to your implementation and sending you a pull request after? â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-68218597 That would be great! On Sat, Dec 27, 2014 at 12:59 PM, Nicholas Chammas notificati...@github.com wrote: @mengxr https://github.com/mengxr Now that 1.2.0 is out, can we schedule a rough timeframe for reviewing this patch? â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/2634#issuecomment-68190043. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user nchammas commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-68190043 @mengxr Now that 1.2.0 is out, can we schedule a rough timeframe for reviewing this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-60351026 @derrickburns The features are useful, so please don't delete the PR. Since this is a major refactor of `KMeans`, I need to allocate a block of time to review the code and run performance tests. But I'm running out of bandwidth now. Do you mind me doing the review after the v1.2 feature freeze? Btw, it would be really helpful if you can post some performance testing results against the current implementation in MLlib, for large-scale sparse and dense data. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-60336780 @mengxr Is there interest in this pull request? Should I delete it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-58433653 @derrickburns Is it possible to split this PR into small ones? So we can trace down the closure serialization issue. Another question I have is on the convergence with other distance measures, as discussed in #1964 . If we use the weighted mean for new centers, does it guarantee to converge for distance rather than L2? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-58449199 The closure data capture problem occurs MultiKMeans.scala:105. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-58104576 @derrickburns The style test doesn't capture all, unfortunately. The Spark Code Style Guide is the first place to check. I will mark a few examples inline. I think the best way to trace down the problem is to split this PR into small ones and see whether we can find it. Is there a good way to split this? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r18488327 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansParallel.scala --- @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import org.apache.spark.Logging +import org.apache.spark.SparkContext._ +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.mllib.base.{ PointOps, FP, Zero } +import org.apache.spark.rdd.RDD +import org.apache.spark.util.random.XORShiftRandom + +import scala.collection.mutable.ArrayBuffer --- End diff -- move scala import before spark imports --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r18488318 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -17,429 +17,57 @@ package org.apache.spark.mllib.clustering -import scala.collection.mutable.ArrayBuffer -import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm} - -import org.apache.spark.annotation.Experimental -import org.apache.spark.Logging -import org.apache.spark.SparkContext._ -import org.apache.spark.mllib.linalg.{Vector, Vectors} -import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.mllib.base.{FP, PointOps} +import org.apache.spark.mllib.clustering.metrics.FastEuclideanOps import org.apache.spark.rdd.RDD -import org.apache.spark.storage.StorageLevel -import org.apache.spark.util.random.XORShiftRandom - -/** - * K-means clustering with support for multiple parallel runs and a k-means++ like initialization - * mode (the k-means|| algorithm by Bahmani et al). When multiple concurrent runs are requested, - * they are executed together with joint passes over the data for efficiency. - * - * This is an iterative algorithm that will make multiple passes over the data, so any RDDs given - * to it should be cached by the user. - */ -class KMeans private ( -private var k: Int, -private var maxIterations: Int, -private var runs: Int, -private var initializationMode: String, -private var initializationSteps: Int, -private var epsilon: Double) extends Serializable with Logging { - - /** - * Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, runs: 1, - * initializationMode: k-means||, initializationSteps: 5, epsilon: 1e-4}. - */ - def this() = this(2, 20, 1, KMeans.K_MEANS_PARALLEL, 5, 1e-4) - - /** Set the number of clusters to create (k). Default: 2. */ - def setK(k: Int): this.type = { -this.k = k -this - } - - /** Set maximum number of iterations to run. Default: 20. */ - def setMaxIterations(maxIterations: Int): this.type = { -this.maxIterations = maxIterations -this - } - - /** - * Set the initialization algorithm. This can be either random to choose random points as - * initial cluster centers, or k-means|| to use a parallel variant of k-means++ - * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||. - */ - def setInitializationMode(initializationMode: String): this.type = { -if (initializationMode != KMeans.RANDOM initializationMode != KMeans.K_MEANS_PARALLEL) { - throw new IllegalArgumentException(Invalid initialization mode: + initializationMode) -} -this.initializationMode = initializationMode -this - } - - /** - * :: Experimental :: - * Set the number of runs of the algorithm to execute in parallel. We initialize the algorithm - * this many times with random starting conditions (configured by the initialization mode), then - * return the best clustering found over any run. Default: 1. - */ - @Experimental - def setRuns(runs: Int): this.type = { -if (runs = 0) { - throw new IllegalArgumentException(Number of runs must be positive) -} -this.runs = runs -this - } - - /** - * Set the number of steps for the k-means|| initialization mode. This is an advanced - * setting -- the default of 5 is almost always enough. Default: 5. - */ - def setInitializationSteps(initializationSteps: Int): this.type = { -if (initializationSteps = 0) { - throw new IllegalArgumentException(Number of initialization steps must be positive) -} -this.initializationSteps = initializationSteps -this - } - - /** - * Set the distance threshold within which we've consider centers to have converged. - * If all centers move less than this Euclidean distance, we stop iterating one run. - */ - def setEpsilon(epsilon: Double): this.type = { -this.epsilon = epsilon -this - } - - /** Whether a warning should be logged if the input RDD is uncached. */ - private var warnOnUncachedInput = true - - /** Disable warnings about uncached input. */ - private[spark] def disableUncachedWarning(): this.type = { -warnOnUncachedInput = false -this - } - - /** - * Train a K-means model on the given set of points; `data` should be cached for high - * performance, because this is an iterative algorithm. - */ - def run(data: RDD[Vector]): KMeansModel = { - -if (warnOnUncachedInput
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r18488296 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GeneralizedKMeansModel.scala --- @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import org.apache.spark.SparkContext._ +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.mllib.base.{PointOps, FP} +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.rdd.RDD + +/** + * A clustering model for K-means. Each point belongs to the cluster with the closest center. + */ +private[mllib] class GeneralizedKMeansModel[P:FP, C:FP]( + val pointOps: PointOps[P, C], + val centers: Array[C]) + extends Serializable { --- End diff -- merge this line to 31 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r18488309 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -17,429 +17,57 @@ package org.apache.spark.mllib.clustering -import scala.collection.mutable.ArrayBuffer -import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm} - -import org.apache.spark.annotation.Experimental -import org.apache.spark.Logging -import org.apache.spark.SparkContext._ -import org.apache.spark.mllib.linalg.{Vector, Vectors} -import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.mllib.base.{FP, PointOps} +import org.apache.spark.mllib.clustering.metrics.FastEuclideanOps import org.apache.spark.rdd.RDD -import org.apache.spark.storage.StorageLevel -import org.apache.spark.util.random.XORShiftRandom - -/** - * K-means clustering with support for multiple parallel runs and a k-means++ like initialization - * mode (the k-means|| algorithm by Bahmani et al). When multiple concurrent runs are requested, - * they are executed together with joint passes over the data for efficiency. - * - * This is an iterative algorithm that will make multiple passes over the data, so any RDDs given - * to it should be cached by the user. - */ -class KMeans private ( -private var k: Int, -private var maxIterations: Int, -private var runs: Int, -private var initializationMode: String, -private var initializationSteps: Int, -private var epsilon: Double) extends Serializable with Logging { - - /** - * Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, runs: 1, - * initializationMode: k-means||, initializationSteps: 5, epsilon: 1e-4}. - */ - def this() = this(2, 20, 1, KMeans.K_MEANS_PARALLEL, 5, 1e-4) - - /** Set the number of clusters to create (k). Default: 2. */ - def setK(k: Int): this.type = { -this.k = k -this - } - - /** Set maximum number of iterations to run. Default: 20. */ - def setMaxIterations(maxIterations: Int): this.type = { -this.maxIterations = maxIterations -this - } - - /** - * Set the initialization algorithm. This can be either random to choose random points as - * initial cluster centers, or k-means|| to use a parallel variant of k-means++ - * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||. - */ - def setInitializationMode(initializationMode: String): this.type = { -if (initializationMode != KMeans.RANDOM initializationMode != KMeans.K_MEANS_PARALLEL) { - throw new IllegalArgumentException(Invalid initialization mode: + initializationMode) -} -this.initializationMode = initializationMode -this - } - - /** - * :: Experimental :: - * Set the number of runs of the algorithm to execute in parallel. We initialize the algorithm - * this many times with random starting conditions (configured by the initialization mode), then - * return the best clustering found over any run. Default: 1. - */ - @Experimental - def setRuns(runs: Int): this.type = { -if (runs = 0) { - throw new IllegalArgumentException(Number of runs must be positive) -} -this.runs = runs -this - } - - /** - * Set the number of steps for the k-means|| initialization mode. This is an advanced - * setting -- the default of 5 is almost always enough. Default: 5. - */ - def setInitializationSteps(initializationSteps: Int): this.type = { -if (initializationSteps = 0) { - throw new IllegalArgumentException(Number of initialization steps must be positive) -} -this.initializationSteps = initializationSteps -this - } - - /** - * Set the distance threshold within which we've consider centers to have converged. - * If all centers move less than this Euclidean distance, we stop iterating one run. - */ - def setEpsilon(epsilon: Double): this.type = { -this.epsilon = epsilon -this - } - - /** Whether a warning should be logged if the input RDD is uncached. */ - private var warnOnUncachedInput = true - - /** Disable warnings about uncached input. */ - private[spark] def disableUncachedWarning(): this.type = { -warnOnUncachedInput = false -this - } - - /** - * Train a K-means model on the given set of points; `data` should be cached for high - * performance, because this is an iterative algorithm. - */ - def run(data: RDD[Vector]): KMeansModel = { - -if (warnOnUncachedInput
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r18488331 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansPlusPlus.scala --- @@ -0,0 +1,200 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import org.apache.spark.mllib.base.{PointOps, FP, Infinity, One, Zero} +import org.apache.spark.util.random.XORShiftRandom +import org.apache.spark.{Logging, SparkContext} + +import scala.collection.mutable.ArrayBuffer +import scala.reflect.ClassTag + +/** + * + * The KMeans++ initialization algorithm + * + * @param pointOps distance function + * @tparam P point type + * @tparam C center type + */ +private[mllib] class KMeansPlusPlus[P : FP : ClassTag, C : FP : ClassTag]( + pointOps: PointOps[P, C]) extends Serializable with Logging { + + /** + * We will maintain for each point the distance to its closest cluster center. + * Since only one center is added on each iteration, recomputing the closest cluster center + * only requires computing the distance to the new cluster center if + * that distance is less than the closest cluster center. + */ + case class FatPoint(location: P, index: Int, weight: Double, distance: Double) + + /** + * K-means++ on the weighted point set `points`. This first does the K-means++ + * initialization procedure and then rounds of Lloyd's algorithm. + */ + + def cluster( + sc: SparkContext, --- End diff -- 4-space indentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r18488320 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -17,429 +17,57 @@ package org.apache.spark.mllib.clustering -import scala.collection.mutable.ArrayBuffer -import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm} - -import org.apache.spark.annotation.Experimental -import org.apache.spark.Logging -import org.apache.spark.SparkContext._ -import org.apache.spark.mllib.linalg.{Vector, Vectors} -import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.mllib.base.{FP, PointOps} +import org.apache.spark.mllib.clustering.metrics.FastEuclideanOps import org.apache.spark.rdd.RDD -import org.apache.spark.storage.StorageLevel -import org.apache.spark.util.random.XORShiftRandom - -/** - * K-means clustering with support for multiple parallel runs and a k-means++ like initialization - * mode (the k-means|| algorithm by Bahmani et al). When multiple concurrent runs are requested, - * they are executed together with joint passes over the data for efficiency. - * - * This is an iterative algorithm that will make multiple passes over the data, so any RDDs given - * to it should be cached by the user. - */ -class KMeans private ( -private var k: Int, -private var maxIterations: Int, -private var runs: Int, -private var initializationMode: String, -private var initializationSteps: Int, -private var epsilon: Double) extends Serializable with Logging { - - /** - * Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, runs: 1, - * initializationMode: k-means||, initializationSteps: 5, epsilon: 1e-4}. - */ - def this() = this(2, 20, 1, KMeans.K_MEANS_PARALLEL, 5, 1e-4) - - /** Set the number of clusters to create (k). Default: 2. */ - def setK(k: Int): this.type = { -this.k = k -this - } - - /** Set maximum number of iterations to run. Default: 20. */ - def setMaxIterations(maxIterations: Int): this.type = { -this.maxIterations = maxIterations -this - } - - /** - * Set the initialization algorithm. This can be either random to choose random points as - * initial cluster centers, or k-means|| to use a parallel variant of k-means++ - * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||. - */ - def setInitializationMode(initializationMode: String): this.type = { -if (initializationMode != KMeans.RANDOM initializationMode != KMeans.K_MEANS_PARALLEL) { - throw new IllegalArgumentException(Invalid initialization mode: + initializationMode) -} -this.initializationMode = initializationMode -this - } - - /** - * :: Experimental :: - * Set the number of runs of the algorithm to execute in parallel. We initialize the algorithm - * this many times with random starting conditions (configured by the initialization mode), then - * return the best clustering found over any run. Default: 1. - */ - @Experimental - def setRuns(runs: Int): this.type = { -if (runs = 0) { - throw new IllegalArgumentException(Number of runs must be positive) -} -this.runs = runs -this - } - - /** - * Set the number of steps for the k-means|| initialization mode. This is an advanced - * setting -- the default of 5 is almost always enough. Default: 5. - */ - def setInitializationSteps(initializationSteps: Int): this.type = { -if (initializationSteps = 0) { - throw new IllegalArgumentException(Number of initialization steps must be positive) -} -this.initializationSteps = initializationSteps -this - } - - /** - * Set the distance threshold within which we've consider centers to have converged. - * If all centers move less than this Euclidean distance, we stop iterating one run. - */ - def setEpsilon(epsilon: Double): this.type = { -this.epsilon = epsilon -this - } - - /** Whether a warning should be logged if the input RDD is uncached. */ - private var warnOnUncachedInput = true - - /** Disable warnings about uncached input. */ - private[spark] def disableUncachedWarning(): this.type = { -warnOnUncachedInput = false -this - } - - /** - * Train a K-means model on the given set of points; `data` should be cached for high - * performance, because this is an iterative algorithm. - */ - def run(data: RDD[Vector]): KMeansModel = { - -if (warnOnUncachedInput
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r18488293 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GeneralizedKMeansModel.scala --- @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import org.apache.spark.SparkContext._ +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.mllib.base.{PointOps, FP} +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.rdd.RDD + +/** + * A clustering model for K-means. Each point belongs to the cluster with the closest center. + */ +private[mllib] class GeneralizedKMeansModel[P:FP, C:FP]( + val pointOps: PointOps[P, C], --- End diff -- 4-space indentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r18488325 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala --- @@ -25,37 +25,28 @@ import org.apache.spark.mllib.linalg.Vector /** * A clustering model for K-means. Each point belongs to the cluster with the closest center. */ -class KMeansModel (val clusterCenters: Array[Vector]) extends Serializable { - /** Total number of clusters. */ - def k: Int = clusterCenters.length + +class KMeansModel(specific: GeneralizedKMeansModel[_,_]) { --- End diff -- space after `,` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r18488297 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GeneralizedKMeansModel.scala --- @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import org.apache.spark.SparkContext._ +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.mllib.base.{PointOps, FP} +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.rdd.RDD + +/** + * A clustering model for K-means. Each point belongs to the cluster with the closest center. + */ +private[mllib] class GeneralizedKMeansModel[P:FP, C:FP]( + val pointOps: PointOps[P, C], + val centers: Array[C]) + extends Serializable { + + val k: Int = clusterCenters.length + + def clusterCenters : Array[Vector] = centers.map{ c = pointOps.centerToVector(c) } --- End diff -- use `.map(c = ...)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r18488364 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -255,4 +253,4 @@ class KMeansClusterSuite extends FunSuite with LocalClusterSparkContext { val cost = model.computeCost(points) } } -} +} --- End diff -- newline at the end --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r18488334 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansPlusPlus.scala --- @@ -0,0 +1,200 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import org.apache.spark.mllib.base.{PointOps, FP, Infinity, One, Zero} +import org.apache.spark.util.random.XORShiftRandom +import org.apache.spark.{Logging, SparkContext} + +import scala.collection.mutable.ArrayBuffer +import scala.reflect.ClassTag + +/** + * + * The KMeans++ initialization algorithm + * + * @param pointOps distance function + * @tparam P point type + * @tparam C center type + */ +private[mllib] class KMeansPlusPlus[P : FP : ClassTag, C : FP : ClassTag]( + pointOps: PointOps[P, C]) extends Serializable with Logging { + + /** + * We will maintain for each point the distance to its closest cluster center. + * Since only one center is added on each iteration, recomputing the closest cluster center + * only requires computing the distance to the new cluster center if + * that distance is less than the closest cluster center. + */ + case class FatPoint(location: P, index: Int, weight: Double, distance: Double) + + /** + * K-means++ on the weighted point set `points`. This first does the K-means++ + * initialization procedure and then rounds of Lloyd's algorithm. + */ + + def cluster( + sc: SparkContext, + seed: Int, + points: Array[C], + weights: Array[Double], + k: Int, + maxIterations: Int, + numPartitions: Int): Array[C] = { +val centers: Array[C] = getCenters(sc, seed, points, weights, k, numPartitions, 1) +val pts = sc.parallelize(points.map(pointOps.centerToPoint)) +new MultiKMeans(pointOps, maxIterations).cluster(pts, Array(centers))._2.centers + } + + /** + * Select centers in rounds. On each round, select 'perRound' centers, with probability of + * selection equal to the product of the given weights and distance to the closest cluster center + * of the previous round. + * + * @param sc the Spark context + * @param seed a random number seed + * @param points the candidate centers + * @param weights the weights on the candidate centers + * @param k the total number of centers to select + * @param numPartitions the number of data partitions to use + * @param perRound the number of centers to add per round + * @return an array of at most k cluster centers + */ + def getCenters(sc: SparkContext, seed: Int, points: Array[C], weights: Array[Double], k: Int, --- End diff -- chop parameters and 4-space indentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r18488302 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -17,429 +17,57 @@ package org.apache.spark.mllib.clustering -import scala.collection.mutable.ArrayBuffer -import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm} - -import org.apache.spark.annotation.Experimental -import org.apache.spark.Logging -import org.apache.spark.SparkContext._ -import org.apache.spark.mllib.linalg.{Vector, Vectors} -import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.mllib.base.{FP, PointOps} +import org.apache.spark.mllib.clustering.metrics.FastEuclideanOps import org.apache.spark.rdd.RDD -import org.apache.spark.storage.StorageLevel -import org.apache.spark.util.random.XORShiftRandom - -/** - * K-means clustering with support for multiple parallel runs and a k-means++ like initialization - * mode (the k-means|| algorithm by Bahmani et al). When multiple concurrent runs are requested, - * they are executed together with joint passes over the data for efficiency. - * - * This is an iterative algorithm that will make multiple passes over the data, so any RDDs given - * to it should be cached by the user. - */ -class KMeans private ( -private var k: Int, -private var maxIterations: Int, -private var runs: Int, -private var initializationMode: String, -private var initializationSteps: Int, -private var epsilon: Double) extends Serializable with Logging { - - /** - * Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, runs: 1, - * initializationMode: k-means||, initializationSteps: 5, epsilon: 1e-4}. - */ - def this() = this(2, 20, 1, KMeans.K_MEANS_PARALLEL, 5, 1e-4) - - /** Set the number of clusters to create (k). Default: 2. */ - def setK(k: Int): this.type = { --- End diff -- This is a breaking change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r18488348 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/MultiKMeansClusterer.scala --- @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import org.apache.spark.Logging +import org.apache.spark.mllib.base.FP +import org.apache.spark.rdd.RDD + + +private[mllib] trait MultiKMeansClusterer[P : FP, C : FP] extends Serializable with Logging { + def cluster(data: RDD[P], centers: Array[Array[C]]): (Double, GeneralizedKMeansModel[P, C]) --- End diff -- missing doc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r18488358 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering.metrics + +import breeze.linalg.{DenseVector = BDV, SparseVector = BSV, Vector = BV} +import org.apache.spark.mllib.base._ +import org.apache.spark.mllib.linalg.{SparseVector, DenseVector, Vector} + +import org.apache.spark.mllib.base.{Centroid, FPoint, PointOps, Infinity, Zero} + + --- End diff -- remove extra empty lines --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r18488316 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -17,429 +17,57 @@ package org.apache.spark.mllib.clustering -import scala.collection.mutable.ArrayBuffer -import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm} - -import org.apache.spark.annotation.Experimental -import org.apache.spark.Logging -import org.apache.spark.SparkContext._ -import org.apache.spark.mllib.linalg.{Vector, Vectors} -import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.mllib.base.{FP, PointOps} +import org.apache.spark.mllib.clustering.metrics.FastEuclideanOps import org.apache.spark.rdd.RDD -import org.apache.spark.storage.StorageLevel -import org.apache.spark.util.random.XORShiftRandom - -/** - * K-means clustering with support for multiple parallel runs and a k-means++ like initialization - * mode (the k-means|| algorithm by Bahmani et al). When multiple concurrent runs are requested, - * they are executed together with joint passes over the data for efficiency. - * - * This is an iterative algorithm that will make multiple passes over the data, so any RDDs given - * to it should be cached by the user. - */ -class KMeans private ( -private var k: Int, -private var maxIterations: Int, -private var runs: Int, -private var initializationMode: String, -private var initializationSteps: Int, -private var epsilon: Double) extends Serializable with Logging { - - /** - * Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, runs: 1, - * initializationMode: k-means||, initializationSteps: 5, epsilon: 1e-4}. - */ - def this() = this(2, 20, 1, KMeans.K_MEANS_PARALLEL, 5, 1e-4) - - /** Set the number of clusters to create (k). Default: 2. */ - def setK(k: Int): this.type = { -this.k = k -this - } - - /** Set maximum number of iterations to run. Default: 20. */ - def setMaxIterations(maxIterations: Int): this.type = { -this.maxIterations = maxIterations -this - } - - /** - * Set the initialization algorithm. This can be either random to choose random points as - * initial cluster centers, or k-means|| to use a parallel variant of k-means++ - * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||. - */ - def setInitializationMode(initializationMode: String): this.type = { -if (initializationMode != KMeans.RANDOM initializationMode != KMeans.K_MEANS_PARALLEL) { - throw new IllegalArgumentException(Invalid initialization mode: + initializationMode) -} -this.initializationMode = initializationMode -this - } - - /** - * :: Experimental :: - * Set the number of runs of the algorithm to execute in parallel. We initialize the algorithm - * this many times with random starting conditions (configured by the initialization mode), then - * return the best clustering found over any run. Default: 1. - */ - @Experimental - def setRuns(runs: Int): this.type = { -if (runs = 0) { - throw new IllegalArgumentException(Number of runs must be positive) -} -this.runs = runs -this - } - - /** - * Set the number of steps for the k-means|| initialization mode. This is an advanced - * setting -- the default of 5 is almost always enough. Default: 5. - */ - def setInitializationSteps(initializationSteps: Int): this.type = { -if (initializationSteps = 0) { - throw new IllegalArgumentException(Number of initialization steps must be positive) -} -this.initializationSteps = initializationSteps -this - } - - /** - * Set the distance threshold within which we've consider centers to have converged. - * If all centers move less than this Euclidean distance, we stop iterating one run. - */ - def setEpsilon(epsilon: Double): this.type = { -this.epsilon = epsilon -this - } - - /** Whether a warning should be logged if the input RDD is uncached. */ - private var warnOnUncachedInput = true - - /** Disable warnings about uncached input. */ - private[spark] def disableUncachedWarning(): this.type = { -warnOnUncachedInput = false -this - } - - /** - * Train a K-means model on the given set of points; `data` should be cached for high - * performance, because this is an iterative algorithm. - */ - def run(data: RDD[Vector]): KMeansModel = { - -if (warnOnUncachedInput
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r18488342 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/MultiKMeans.scala --- @@ -0,0 +1,134 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import org.apache.spark.Accumulator +import org.apache.spark.SparkContext._ +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.mllib.base.{Centroid, FP, PointOps, Zero} +import org.apache.spark.rdd.RDD + +import scala.collection.mutable.ArrayBuffer +import scala.reflect.ClassTag + + +/** + * A K-Means clustering implementation that performs multiple K-means clusterings simultaneously, + * returning the one with the lowest cost. + * + * We only compute if a center has moved if we need to. + * + * We use null to represent empty clusters instead of an Option type to save space. + * + * The resulting clustering may contain fewer than K clusters. + */ + +private[mllib] class MultiKMeans[P : FP : ClassTag, C : FP : ClassTag]( + pointOps: PointOps[P, C], maxIterations: Int) extends MultiKMeansClusterer[P, C] { + + def cluster(data: RDD[P], centers: Array[Array[C]]): (Double, GeneralizedKMeansModel[P, C]) = { +val runs = centers.length +val active = Array.fill(runs)(true) +val costs = Array.fill(runs)(Zero) +var activeRuns = new ArrayBuffer[Int] ++ (0 until runs) +var iteration = 0 + +/* + * Execute iterations of Lloyd's algorithm until all runs have converged. + */ + +while (iteration maxIterations activeRuns.nonEmpty) { + // remove the empty clusters + log.info(iteration {}, iteration) + + val activeCenters = activeRuns.map(r = centers(r)).toArray + + if (log.isInfoEnabled) { +for (r - 0 until activeCenters.length) + log.info(run {} has {} centers, activeRuns(r), activeCenters(r).length) + } + + // Find the sum and count of points mapping to each center + val (centroids, runDistortion) = getCentroids(data, activeCenters) + + if (log.isInfoEnabled) { +for (run - activeRuns) log.info(run {} distortion {}, run, runDistortion(run)) + } + + for (run - activeRuns) active(run) = false + + for (((runIndex: Int, clusterIndex: Int), cn: Centroid) - centroids) { +val run = activeRuns(runIndex) +if (cn.isEmpty) { + active(run) = true + centers(run)(clusterIndex) = null.asInstanceOf[C] +} else { + val centroid = pointOps.centroidToPoint(cn) + active(run) = active(run) || pointOps.centerMoved(centroid, centers(run)(clusterIndex)) + centers(run)(clusterIndex) = pointOps.pointToCenter(centroid) +} + } + + // filter out null centers + for (r - activeRuns) centers(r) = centers(r).filter(_ != null) + + // update distortions and print log message if run completed during this iteration + for ((run, runIndex) - activeRuns.zipWithIndex) { +costs(run) = runDistortion(runIndex) +if (!active(run)) log.info(run {} finished in {} iterations, run, iteration + 1) + } + activeRuns = activeRuns.filter(active(_)) + iteration += 1 +} + +val best = costs.zipWithIndex.min._2 +(costs(best), new GeneralizedKMeansModel(pointOps, centers(best))) + } + + def getCentroids(data: RDD[P], activeCenters: Array[Array[C]]) = { +val runDistortion = activeCenters.map(_ = data.sparkContext.accumulator(Zero)) +val bcActiveCenters = data.sparkContext.broadcast(activeCenters) +val ops = pointOps +val result = data.mapPartitions { points = + val bcCenters = bcActiveCenters.value + val centers = bcCenters.map { +_.map { _ =
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r18488359 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala --- @@ -0,0 +1,142 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib + +import org.apache.spark.mllib.linalg.Vector +import breeze.linalg.{Vector = BV} +import org.apache.spark.rdd.RDD + +package object base { + + val Zero = 0.0 + val One = 1.0 + val Infinity = Double.MaxValue + val Unknown = -1.0 + + private[mllib] trait FP extends Serializable { +val weight: Double +val raw: BV[Double] + } + + private[mllib] class FPoint(val raw: BV[Double], val weight: Double) extends FP { +override def toString: String = weight + , + (raw.toArray mkString ,) +lazy val inh = (raw :* (1.0 / weight)).toArray + } + + /** + * A mutable point in homogeneous coordinates that is lazily initialized. + */ + private[mllib] class Centroid extends Serializable { +override def toString: String = weight + + (if(raw != null) (, + raw.toArray mkString ,) else ) + +def isEmpty = weight == Zero + +var raw: BV[Double] = null + +var weight: Double = Zero + +def add(p: Centroid): this.type = add(p.raw, p.weight) + +def add(p: FP): this.type = add(p.raw, p.weight) + +def sub(p: Centroid): this.type = sub(p.raw, p.weight) + +def sub(p: FP): this.type = sub(p.raw, p.weight) + +def sub(r: BV[Double], w: Double): this.type = { + if (r != null) { +if (raw == null) { + raw = r.toVector :*= -1.0 + weight = w * -1 +} else { + raw -= r + weight = weight - w +} + } + this +} + +def add(r: BV[Double], w: Double) : this.type = { + if (r != null) { +if (raw == null) { + raw = r.toVector + weight = w +} else { + raw += r + weight = weight + w +} + } + this +} + } + + private[mllib] trait PointOps[P : FP, C : FP] { + +def distance(p: P, c: C, upperBound: Double): Double --- End diff -- missing doc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2634#discussion_r18488352 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/EuclideanOps.scala --- @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering.metrics + +import breeze.linalg.{DenseVector = BDV, SparseVector = BSV, Vector = BV} +import org.apache.spark.mllib.base.{Centroid, FPoint, PointOps, Infinity, Zero} +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector} + + +/** + * Euclidean distance measure + * + * This is the slow implementation of the squared Euclidean distance function, + * shown here simply for clarity. + */ +class EuclideanOps extends PointOps[FPoint, FPoint] with Serializable { + + type C = FPoint + type P = FPoint + + val epsilon = 1e-4 + + def distance(p: P, c: C, upperBound: Double = Infinity): Double = { +val d = p.inh.zip(c.inh).foldLeft(Zero) { + case (d: Double, (a: Double, b: Double)) = d + (a - b) * (a - b) +} +if( d Zero) Zero else d --- End diff -- `if (d Zero)` (space after `if` and no space after `(`) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-58106056 @derrickburns I marked a few style problems (not all of them). There are breaking changes in your PR, which we should avoid as much as possible. Even we want to remove some methods, we should deprecate them first and remove them in a later release. For the serialization problem, I'm not sure whether I have time looking into it this week. It would be nice to split this into small PRs, so we can trace down the problem faster. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-58106085 I know exactly which closure is exceeding the size limitation. The problem, is that I cannot see how to make the closure capture less data! On Mon, Oct 6, 2014 at 2:42 PM, Xiangrui Meng notificati...@github.com wrote: @derrickburns https://github.com/derrickburns The style test doesn't capture all, unfortunately. The Spark Code Style Guide is the first place to check. I will mark a few examples inline. I think the best way to trace down the problem is to split this PR into small ones and see whether we can find it. Is there a good way to split this? Thanks! â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/2634#issuecomment-58104576. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-58106509 @mengxr Is there an IntelliJ or Eclipse configuration that i can use to reformat the code according to the guidelines? The breaking change in the PR is for a private constructor that is, therefore, not used outside of Spark. Do these need to be deprecated first as well? On Mon, Oct 6, 2014 at 2:53 PM, Xiangrui Meng notificati...@github.com wrote: @derrickburns https://github.com/derrickburns I marked a few style problems (not all of them). There are breaking changes in your PR, which we should avoid as much as possible. Even we want to remove some methods, we should deprecate them first and remove them in a later release. For the serialization problem, I'm not sure whether I have time looking into it this week. It would be nice to split this into small PRs, so we can trace down the problem faster. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/2634#issuecomment-58106056. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-58120070 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21355/consoleFull) for PR 2634 at commit [`c603079`](https://github.com/apache/spark/commit/c6030791406238430548634677a4f9d4c59aa93a). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-58122649 @derrickburns I don't know any formatter that can do the job nicely. This has to be done by hand at this moment, unfortunately. `KMeans` has a public constructor that takes no argument besides the default constructor and then users can use setters to set parameters. So they are public APIs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-58122976 Ah, I see. Will fix. Sent from my iPhone On Oct 6, 2014, at 5:55 PM, Xiangrui Meng notificati...@github.com wrote: @derrickburns I don't know any formatter that can do the job nicely. This has to be done by hand at this moment, unfortunately. KMeans has a public constructor that takes no argument besides the default constructor and then users can use setters to set parameters. So they are public APIs. â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-58123795 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21355/consoleFull) for PR 2634 at commit [`c603079`](https://github.com/apache/spark/commit/c6030791406238430548634677a4f9d4c59aa93a). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-58123798 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21355/Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-58126789 @mengxr I restored the KMeans class and public methods on that class. I did not mark them for deprecation. I also re-formatted the code to follow the Spark style guide. I hope that I caught all the cases. I used the Scalariform plugin to IntelliJ, with some manual adjustments to handle long formal parameter lists. The closure data capture problem occurs MultiKMeans.scala:105. Thx. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-58127303 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21358/consoleFull) for PR 2634 at commit [`e2b6c02`](https://github.com/apache/spark/commit/e2b6c02f0f6ed715c7c72a1bd2db217a93deba3c). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-58127496 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21357/Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-58130332 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21358/Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-58130318 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21358/consoleFull) for PR 2634 at commit [`e2b6c02`](https://github.com/apache/spark/commit/e2b6c02f0f6ed715c7c72a1bd2db217a93deba3c). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-57841035 @derrickburns The `*ClusterSuite` was created to prevent referencing unnecessary objects into the task closure. You can try to remove `Serializable` from algorithms. While the models are serializable, the algorithm instances should stay on the driver node. If you want to use a member method in a task closure, either make it static or define it as a local method. If you want to use a member variable, assign it to a `val` first. This is something we can try. Avoiding serializing unnecessary objects is a good practice, but I'm not sure whether it is worth the effort. Btw, could you update your PR following the https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide ? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-57894117 I ran the style tests. The pass. Is there something else in the style guide that is not captured in the tests ? I have expended much effort to avoid serializing unnecessary objects. I'm still perplexed why so much data is being captured in the closure that the test fails. Anyway, what are the next steps? Omit the test and Approve the PR? Ask someone to help fix the code to avoid the unit test failure ? Thx ! Sent from my iPhone On Oct 3, 2014, at 12:17 PM, Xiangrui Meng notificati...@github.com wrote: @derrickburns The *ClusterSuite was created to prevent referencing unnecessary objects into the task closure. You can try to remove Serializable from algorithms. While the models are serializable, the algorithm instances should stay on the driver node. If you want to use a member method in a task closure, either make it static or define it as a local method. If you want to use a member variable, assign it to a val first. This is something we can try. Avoiding serializing unnecessary objects is a good practice, but I'm not sure whether it is worth the effort. Btw, could you update your PR following the https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide ? Thanks! â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user nchammas commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-57679766 @derrickburns Side note: It looks like a lot of [extraneous commits](https://github.com/apache/spark/pull/2419#commits-pushed-471d5f2) got pulled into this PR by mistake. Did you rebase your PR correctly? As it stands, this PR does not merge cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-57691032 Uh oh. Sent from my iPhone On Oct 2, 2014, at 11:35 AM, Nicholas Chammas notificati...@github.com wrote: @derrickburns Side note: It looks like a lot of extraneous commits got pulled into this PR by mistake. Did you rebase your PR correctly? As it stands, this PR does not merge cleanly. â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns closed the pull request at: https://github.com/apache/spark/pull/2419 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-57696626 I will close this pull request and create another. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
GitHub user derrickburns opened a pull request: https://github.com/apache/spark/pull/2634 [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-3424] [RESUBMIT] MLLIB K-Means Clusterer This commit introduces a general distance function trait, `PointOps`, for the Spark K-Means clusterer. There are no public API changes*. ### Issue - Data Capture Test Fails - NEED HELP The `org.apache.spark.mllib.clustering.KMeansClusterSuite` task size should be small in both training and prediction fails, suggesting that the RDD data is being captured in a closure. This is quite puzzling. My efforts to solve this problem have failed. I *need help* to solve this problem. ### Distance Function Trait The `PointOps` trait defines the distance function. The `PointOps` trait is more than a simple distance function. It also defines the types of Points and Centers for the clusterer. Standard MLLIB `Vector`s are converted into Points and Centers. In the case of the `FastEuclideanOps` implementation of `PointOps`, the Point and Center types includes vector norm members. In other distance functions such as the [Kullback-Leibler distance function](http://en.wikipedia.org/wiki/Bregman_divergence), the Point and Center types include different values that speed up the distance calculation in a similar way that caching vector norms speeds up the Euclidean distance function. This addresses SPARK-3219. ### Refactoring To understand this original code, I found it useful to refactor the original implementation into components. You may find it helpful to understand this pull request by looking at the new components and comparing them to their original implementation. Unfortunately, GitHub diff does not help very much with this. This commit splits up the clusterer into a number of components which behave (largely) like their predecessors. `KMeansParallel` implements the K-Means || initialization algorithm. `KMeansRandom` implements the K-Means Random initialization algorithm. `MultiKMeans` implements the K-Means algorithm on multiple sets of cluster centers using a given distance function. Traits for the initializer, `KMeansInitializer`, and the general K-Means clusterer, `MultiKMeansClusterer`, are provided to highlight the salient interfaces with the intent that alternative implementations of these interfaces may be provided in the future. ### Performance This pull request is not focused on performance. Nevertheless, the performance of the KMeans++ implementation was *dramatically* improved by NOT recomputing distances to clusters centers that were present in previous steps. This turns a quadratic implementation into a linear one. Second, the KMeans++ implementation uses the general K-Means clusterer in the final step. This parallelizes a step that was sequential. Together, these changes address SPARK-3424. ### Next Steps This pull request does not introduce new user-visible changes. The next step is to make different distance functions available through a user-visible API. I will provide other distance functions after this pull request has been accepted. Then, we can decide on an appropriate user-level API to access those functions. ### Compatibility While there are no user-level API changes, the behavior of the clusterer is *different* on some tests. Specifically, the handling of empty clusters has changed. Empty clusters are not filled with random points in this implementation. The former behavior is undesirable for a number a reasons, not the least of which is that there is no reasonable use for duplicate cluster centers. To accommodate the change in behavior, the test cases were changed accordingly. This addresses SPARK-3261. The private K-Means constructor which was used by some test Java code and one example was replaced with a Scala constructor that is not Java friendly. Since the constructor was not user visible, I simply changed the Java test code and the example to use the higher level interface. ### Testing This code has been tested (albeit while packaged outside of Spark) and performance measured on data sets of millions of features each with hundreds of dimensions and on tens of thousands of clusters. You can merge this pull request into a Git repository by running: $ git pull https://github.com/derrickburns/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2634.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2634 commit fbfdcd8946d8681c062ba08d71eac961db50417d Author: Derrick Burns derrickrbu...@gmail.com Date: 2014-10-02T20:01:18Z This commit introduces a general distance function trait, PointOps,
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-57698333 @mingxr I created a new clean pull request. I *still need help* to understand/fix a closure that is capturing too much data. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-57698619 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21220/consoleFull) for PR 2634 at commit [`fbfdcd8`](https://github.com/apache/spark/commit/fbfdcd8946d8681c062ba08d71eac961db50417d). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-57708661 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21220/consoleFull) for PR 2634 at commit [`fbfdcd8`](https://github.com/apache/spark/commit/fbfdcd8946d8681c062ba08d71eac961db50417d). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-57708679 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21220/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2634#issuecomment-57714029 @mengxr This is as expected. I need help in solving the closure data capture problem. Sent from my iPhone On Oct 2, 2014, at 2:10 PM, Apache Spark QA notificati...@github.com wrote: QA tests have finished for PR 2634 at commit fbfdcd8. This patch fails unit tests. This patch merges cleanly. This patch adds no public classes. â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-57403972 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21076/consoleFull) for PR 2419 at commit [`48a31dd`](https://github.com/apache/spark/commit/48a31dd36e1defa7c4279548579bc1874820a5ae). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-57404043 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21076/consoleFull) for PR 2419 at commit [`48a31dd`](https://github.com/apache/spark/commit/48a31dd36e1defa7c4279548579bc1874820a5ae). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-57404045 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21076/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-57404291 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21077/consoleFull) for PR 2419 at commit [`ebdc853`](https://github.com/apache/spark/commit/ebdc853de33373cdd92f5401288d89ab6e879741). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-57404357 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21077/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-57404356 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21077/consoleFull) for PR 2419 at commit [`ebdc853`](https://github.com/apache/spark/commit/ebdc853de33373cdd92f5401288d89ab6e879741). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-57404968 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21078/consoleFull) for PR 2419 at commit [`d6f7c66`](https://github.com/apache/spark/commit/d6f7c66dae95a0b6967637d1132bb7327ab41d8c). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-57404998 @mengxr 1. I fixed the merge issue and also remerged to capture more recent changes. 2. I did as you suggested and introduced a local variable to hold a value used in the closure. *Unfortunately, the test still fails. Since all values used in the closure are local variables, I do not know what to do at this point.* 3. The best fix to the problem with the test that selects two random centers is the one that I implemented. It will pass the test deterministically, while still using any random seed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-57408391 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21078/consoleFull) for PR 2419 at commit [`d6f7c66`](https://github.com/apache/spark/commit/d6f7c66dae95a0b6967637d1132bb7327ab41d8c). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-57408395 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21078/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2419#discussion_r18184806 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -17,16 +17,13 @@ package org.apache.spark.mllib.clustering -import scala.collection.mutable.ArrayBuffer -import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm} - -import org.apache.spark.annotation.Experimental -import org.apache.spark.Logging -import org.apache.spark.SparkContext._ -import org.apache.spark.mllib.linalg.{Vector, Vectors} -import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.mllib.base.{FP, PointOps} +import org.apache.spark.mllib.clustering.metrics.FastEuclideanOps import org.apache.spark.rdd.RDD + HEAD --- End diff -- not merged cleanly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-57233504 @derrickburns Could you merge the PR cleanly? For the closure serialization, just go with the spores style, assign a class member using to a val and reference the val in the closure. Creating a new class with a single object is not necessary. For randomization, we should try to make the algorithm deterministic given a random seed. Then we can fix the random seed for tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user derrickburns commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-57238377 Will do Sent from my iPhone On Sep 29, 2014, at 2:32 PM, Xiangrui Meng notificati...@github.com wrote: @derrickburns Could you merge the PR cleanly? For the closure serialization, just go with the spores style, assign a class member using to a val and reference the val in the closure. Creating a new class with a single object is not necessary. For randomization, we should try to make the algorithm deterministic given a random seed. Then we can fix the random seed for tests. â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2419#issuecomment-57012356 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20871/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org