[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-07-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2634


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-05-18 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-103156795
  
Do you mind closing this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-05-18 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-103169922
  
We can't directly but there's an automated process that will eventually. 
Don't worry about it and/or get to it later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-05-18 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-103169503
  
Sure, but I'm traveling now. Would you mind closing it for me?

Sent from my iPhone

 On May 18, 2015, at 8:19 PM, Sean Owen notificati...@github.com wrote:
 
 Do you mind closing this PR?
 
 —
 Reply to this email directly or view it on GitHub.
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-05-18 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-103178283
  
Ok thx !

Sent from my iPhone

 On May 18, 2015, at 8:47 PM, Sean Owen notificati...@github.com wrote:
 
 We can't directly but there's an automated process that will eventually. 
Don't worry about it and/or get to it later.
 
 —
 Reply to this email directly or view it on GitHub.
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-25 Thread derrickburns
Github user derrickburns commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r23507379
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
 ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering.metrics
+
+import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = 
BV }
+
+import org.apache.spark.mllib.base._
+import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector }
+import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, 
Zero }
+
+class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, 
weight) {
+  val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight)
--- End diff --

@mengxr
By separating the initialization of cluster centers from the clustering,
one may trivially implement several variants of the standard K-means
clustering, including the Wavelet-Based Anytime Algorithm for K-Means
Clustering of Time Series,
http://www.cs.gmu.edu/~jessica/publications/ikmeans_sdm_workshop03.pdf
which has been demonstrated to give better quality clusterings in less time
than the standard algorithm.

To demonstrate this, I have implemented the aforementioned algorithm in
roughly 50 lines of new code. See
com.massivedatascience.clusterer.WaveletKMeans.

On Sat, Jan 24, 2015 at 1:45 AM, Derrick Burns derrickrbu...@gmail.com
wrote:

 @mengxr
 I have rewritten the README file
 https://github.com/derrickburns/generalized-kmeans-clustering for my
 clustering GtHub project. I think that you may find it useful.

 On Thu, Jan 22, 2015 at 7:43 PM, Derrick Burns derrickrbu...@gmail.com
 wrote:

 @mengxr

 I spent some time thinking about clustering sparse data.  I realized that
 there may be a need to embed the sparse data into lower dimensional
 spaces.  Then, I realized that one of the Bregman divergences (i.e. the
 generalized symmetrized Bregman divergence), also embeds data into a
 different space before clustering.  This led me to the conclusion that
 embedding the data into a convenient space just prior to clustering is a
 very useful generic feature.  Adding it was trivial.  See the README on 
my
 GitHub project for details.

 I implemented two generic embeddings: 1) a sparse to dense embedding that
 maps points isomorphically and 2) random indexing
 https://www.sics.se/~mange/papers/RI_intro.pdf that maps high
 dimensional data into lower dimension in a way that preserves similarity
 per the Johnson-Lindenstrauss lemma
 http://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma. I
 also recoded the embedding that I did for a generalized symmetrized KL
 divergence to use the new embedding design.

 I think that with these improvements one can check the clusterer
 supports sparse data feature box in a practical sense.

 I am about to being performance testing of the randoming indexing
 embedding on large volumes of high dimensional sparse data.

 Please take a look at the new PointOps trait definition.

 On Mon, Jan 19, 2015 at 5:36 PM, Derrick Burns derrickrbu...@gmail.com
 wrote:

 For the squared Euclidean distance case, weights would naturally start
 at 1.0 for each input point, so that a cluster with 100 points would 
have
 weight 100.0.

 However, for the KL divergence where the points are derived from
 frequencies, the initial weight could be the sum of the frequencies. 
This
 puts the points on the simplex without having to perform the division.

 Sent from my iPhone

 On Jan 19, 2015, at 5:02 PM, Xiangrui Meng notificati...@github.com
 wrote:

 In
 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
 https://github.com/apache/spark/pull/2634#discussion-diff-23197316:

  + * 

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-24 Thread derrickburns
Github user derrickburns commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r23495418
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
 ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering.metrics
+
+import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = 
BV }
+
+import org.apache.spark.mllib.base._
+import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector }
+import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, 
Zero }
+
+class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, 
weight) {
+  val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight)
--- End diff --

@mengxr
I have rewritten the README file
https://github.com/derrickburns/generalized-kmeans-clustering for my
clustering GtHub project. I think that you may find it useful.

On Thu, Jan 22, 2015 at 7:43 PM, Derrick Burns derrickrbu...@gmail.com
wrote:

 @mengxr

 I spent some time thinking about clustering sparse data.  I realized that
 there may be a need to embed the sparse data into lower dimensional
 spaces.  Then, I realized that one of the Bregman divergences (i.e. the
 generalized symmetrized Bregman divergence), also embeds data into a
 different space before clustering.  This led me to the conclusion that
 embedding the data into a convenient space just prior to clustering is a
 very useful generic feature.  Adding it was trivial.  See the README on my
 GitHub project for details.

 I implemented two generic embeddings: 1) a sparse to dense embedding that
 maps points isomorphically and 2) random indexing
 https://www.sics.se/~mange/papers/RI_intro.pdf that maps high
 dimensional data into lower dimension in a way that preserves similarity
 per the Johnson-Lindenstrauss lemma
 http://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma. I
 also recoded the embedding that I did for a generalized symmetrized KL
 divergence to use the new embedding design.

 I think that with these improvements one can check the clusterer supports
 sparse data feature box in a practical sense.

 I am about to being performance testing of the randoming indexing
 embedding on large volumes of high dimensional sparse data.

 Please take a look at the new PointOps trait definition.

 On Mon, Jan 19, 2015 at 5:36 PM, Derrick Burns derrickrbu...@gmail.com
 wrote:

 For the squared Euclidean distance case, weights would naturally start at
 1.0 for each input point, so that a cluster with 100 points would have
 weight 100.0.

 However, for the KL divergence where the points are derived from
 frequencies, the initial weight could be the sum of the frequencies. This
 puts the points on the simplex without having to perform the division.

 Sent from my iPhone

 On Jan 19, 2015, at 5:02 PM, Xiangrui Meng notificati...@github.com
 wrote:

 In
 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
 https://github.com/apache/spark/pull/2634#discussion-diff-23197316:

  + * distributed under the License is distributed on an AS IS BASIS,
  + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
  + * See the License for the specific language governing permissions and
  + * limitations under the License.
  + */
  +
  +package org.apache.spark.mllib.clustering.metrics
  +
  +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, 
Vector = BV }
  +
  +import org.apache.spark.mllib.base._
  +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, 
Vector }
  +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, 
Infinity, Zero }
  +
  +class FastEUPoint(raw: BV[Double], weight: Double) extends 

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-22 Thread derrickburns
Github user derrickburns commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r23430407
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
 ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering.metrics
+
+import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = 
BV }
+
+import org.apache.spark.mllib.base._
+import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector }
+import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, 
Zero }
+
+class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, 
weight) {
+  val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight)
--- End diff --

@mengxr

I spent some time thinking about clustering sparse data.  I realized that
there may be a need to embed the sparse data into lower dimensional
spaces.  Then, I realized that one of the Bregman divergences (i.e. the
generalized symmetrized Bregman divergence), also embeds data into a
different space before clustering.  This led me to the conclusion that
embedding the data into a convenient space just prior to clustering is a
very useful generic feature.  Adding it was trivial.  See the README on my
GitHub project for details.

I implemented two generic embeddings: 1) a sparse to dense embedding that
maps points isomorphically and 2) random indexing
https://www.sics.se/~mange/papers/RI_intro.pdf that maps high dimensional
data into lower dimension in a way that preserves similarity per the
Johnson-Lindenstrauss
lemma http://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma. I
also recoded the embedding that I did for a generalized symmetrized KL
divergence to use the new embedding design.

I think that with these improvements one can check the clusterer supports
sparse data feature box in a practical sense.

I am about to being performance testing of the randoming indexing embedding
on large volumes of high dimensional sparse data.

Please take a look at the new PointOps trait definition.

On Mon, Jan 19, 2015 at 5:36 PM, Derrick Burns derrickrbu...@gmail.com
wrote:

 For the squared Euclidean distance case, weights would naturally start at
 1.0 for each input point, so that a cluster with 100 points would have
 weight 100.0.

 However, for the KL divergence where the points are derived from
 frequencies, the initial weight could be the sum of the frequencies. This
 puts the points on the simplex without having to perform the division.

 Sent from my iPhone

 On Jan 19, 2015, at 5:02 PM, Xiangrui Meng notificati...@github.com
 wrote:

 In
 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
 https://github.com/apache/spark/pull/2634#discussion-diff-23197316:

  + * distributed under the License is distributed on an AS IS BASIS,
  + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
  + * See the License for the specific language governing permissions and
  + * limitations under the License.
  + */
  +
  +package org.apache.spark.mllib.clustering.metrics
  +
  +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector 
= BV }
  +
  +import org.apache.spark.mllib.base._
  +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, 
Vector }
  +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, 
Infinity, Zero }
  +
  +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, 
weight) {
  +  val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * 
weight)

 Could you elaborate more about the benefit of using homogeneous
 coordinates? It should be at least documented that what weight stands 
for
 here. If a cluster have 100 points, do you 

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-19 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-70588270
  
@derrickburns We use dense vectors to store cluster centers, because the 
centers are very likely to become dense during aggregation. If there are zeros, 
they can be efficiently compressed before sending back to the driver. For 
performance, we never add two sparse vectors together. I'm not sure whether 
this answers your question.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-19 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-70583162
  
@mengxr

One more thing regarding sparse vectors.  Sparse vectors can become dense
under cluster creation, which, in turn, can cause the running time of the
K-means clustering to skyrocket.

To address this problem, one can project clusters onto a sparse vector
before performing distance calculation.  My current version of the
clusterer does this when the appropriate distance object is selected.

On Sun, Jan 18, 2015 at 7:59 PM, Derrick Burns derrickrbu...@gmail.com
wrote:

 @mengxr

 I have implemented several variants of Kullback-Leibler divergence in my 
separate
 GitHub repository
 https://github.com/derrickburns/generalized-kmeans-clustering.  These
 variants are more efficient that the standard KL-divergence which is
 defined on R+ ^ n because they take advantage of extra knowledge of the
 domain. I have used these variants with much success (i.e. much faster
 running time) in my large scale clustering runs.

 On Sat, Jan 17, 2015 at 7:02 PM, UCB AMPLab notificati...@github.com
 wrote:

 Test FAILed.
 Refer to this link for build results (access rights to CI server needed):
 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25711/
 Test FAILed.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/2634#issuecomment-70394598.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-19 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-70589276
  
In my application (n-gram contexts), the sparse vectors can be of extremely 
high dimension. To make the problem manageable, I select the k most important 
dimensions per point. For a cluster of m points, I can have a sparse vector of 
m*k non-zero values. Since some clusters can become quite large ( O(n) is 
size), I can get sparse vectors of O(nk) non-zero values.  Still, the vector is 
sparse, since the dimension of the vector is potentially 2^31-1.

So, I cannot treat the vector as dense. 

To deal with the growth problem, I have implemented a centroid that only 
retains the k most important features.

Sent from my iPhone

 On Jan 19, 2015, at 5:07 PM, Xiangrui Meng notificati...@github.com 
wrote:
 
 @derrickburns We use dense vectors to store cluster centers, because the 
centers are very likely to become dense during aggregation. If there are zeros, 
they can be efficiently compressed before sending back to the driver. For 
performance, we never add two sparse vectors together. I'm not sure whether 
this answers your question.
 
 —
 Reply to this email directly or view it on GitHub.
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-19 Thread derrickburns
Github user derrickburns commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r23198417
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
 ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering.metrics
+
+import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = 
BV }
+
+import org.apache.spark.mllib.base._
+import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector }
+import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, 
Zero }
+
+class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, 
weight) {
+  val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight)
--- End diff --

For the squared Euclidean distance case, weights would naturally start at 
1.0 for each input point, so that a cluster with 100 points would have weight 
100.0.

However, for the KL divergence where the points are derived from 
frequencies, the initial weight could be the sum of the frequencies. This puts 
the points on the simplex without having to perform the division.

Sent from my iPhone

 On Jan 19, 2015, at 5:02 PM, Xiangrui Meng notificati...@github.com 
wrote:
 
 In 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala:
 
  + * distributed under the License is distributed on an AS IS BASIS,
  + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
  + * See the License for the specific language governing permissions and
  + * limitations under the License.
  + */
  +
  +package org.apache.spark.mllib.clustering.metrics
  +
  +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector 
= BV }
  +
  +import org.apache.spark.mllib.base._
  +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, 
Vector }
  +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, 
Infinity, Zero }
  +
  +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, 
weight) {
  +  val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * 
weight)
 Could you elaborate more about the benefit of using homogeneous 
coordinates? It should be at least documented that what weight stands for 
here. If a cluster have 100 points, do you assign its center with weight 100? 
If so, how to compute the distance between this center and an input point?
 
 —
 Reply to this email directly or view it on GitHub.
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-19 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r23197318
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala ---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib
+
+import breeze.linalg.{ Vector = BV }
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+package object base {
--- End diff --

 1) modifications to BLAS to support arbitrary functions and axpy on 
SparseVectors

We already support axpy on SparseVectors in `mllib.linalg.BLAS`. What do 
you mean by `arbitrary functions`? If the target is to improve k-means, we only 
need to implement BLAS ops required by k-means.

 2) definition of PointOps trait and various implementations for different 
divergences
 3) rewrite of clusterer to use PointOps trait but only using the squared 
Euclidean distance implementation
 4) generalization of clustering interface to select different divergences.

Sounds good.

 5) store seeds

This is already covered in https://github.com/apache/spark/pull/3610, which 
is almost ready to merge.

 6) maintain exactly K clusters by splitting clusters heuristicly.

We had an example in `StreamingKMeans`. But if we want to support different 
distances, we need to consider how to keep the new centers inside the domain.

Thanks for improving your implementation! Before we merge features one by 
one, it would be nice to maintain your implementation as a 3rd-party Spark 
package (spark-packages.org). That helps people find it and send feedbacks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-19 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r23197316
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
 ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering.metrics
+
+import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = 
BV }
+
+import org.apache.spark.mllib.base._
+import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector }
+import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, 
Zero }
+
+class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, 
weight) {
+  val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight)
--- End diff --

Could you elaborate more about the benefit of using homogeneous 
coordinates? It should be at least documented that what weight stands for 
here. If a cluster have 100 points, do you assign its center with weight 100? 
If so, how to compute the distance between this center and an input point?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-19 Thread derrickburns
Github user derrickburns commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r23198020
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
 ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering.metrics
+
+import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = 
BV }
+
+import org.apache.spark.mllib.base._
+import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector }
+import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, 
Zero }
+
+class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, 
weight) {
+  val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight)
--- End diff --

Simply put, computing the distance (and the cached values per point and 
centered used in the distance) is faster in homogeneous coordinates. 

Imagine computing dot(x,y), where x and y are vectors. If x and y are in 
homogeneous coordinates, then it is faster to perform the fit product in 
homogeneous coordinates then divide by the weights of the two vectors than it 
is to divide each homogenous coordinate by the weights first and the do the dot 
product.

Sent from my iPhone

 On Jan 19, 2015, at 5:02 PM, Xiangrui Meng notificati...@github.com 
wrote:
 
 In 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala:
 
  + * distributed under the License is distributed on an AS IS BASIS,
  + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
  + * See the License for the specific language governing permissions and
  + * limitations under the License.
  + */
  +
  +package org.apache.spark.mllib.clustering.metrics
  +
  +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector 
= BV }
  +
  +import org.apache.spark.mllib.base._
  +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, 
Vector }
  +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, 
Infinity, Zero }
  +
  +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, 
weight) {
  +  val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * 
weight)
 Could you elaborate more about the benefit of using homogeneous 
coordinates? It should be at least documented that what weight stands for 
here. If a cluster have 100 points, do you assign its center with weight 100? 
If so, how to compute the distance between this center and an input point?
 
 —
 Reply to this email directly or view it on GitHub.
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-18 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-70443890
  
@mengxr

I have implemented several variants of Kullback-Leibler divergence in
my separate
GitHub repository
https://github.com/derrickburns/generalized-kmeans-clustering.  These
variants are more efficient that the standard KL-divergence which is
defined on R+ ^ n because they take advantage of extra knowledge of the
domain. I have used these variants with much success (i.e. much faster
running time) in my large scale clustering runs.

On Sat, Jan 17, 2015 at 7:02 PM, UCB AMPLab notificati...@github.com
wrote:

 Test FAILed.
 Refer to this link for build results (access rights to CI server needed):
 https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25711/
 Test FAILed.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/2634#issuecomment-70394598.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-70394597
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25711/consoleFull)
 for   PR 2634 at commit 
[`35da8e9`](https://github.com/apache/spark/commit/35da8e9e188e66946d5799d061ecc3ca150f).
 * This patch **fails** unit tests.
 * This patch **does not** merge cleanly!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-70394598
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25711/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-70392876
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25711/consoleFull)
 for   PR 2634 at commit 
[`35da8e9`](https://github.com/apache/spark/commit/35da8e9e188e66946d5799d061ecc3ca150f).
 * This patch **does not** merge cleanly!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-16 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-70345238
  
I see the problem with sparse vectors and the KL divergence. 

I implemented a smoothing operation to approximate KL divergence.

Sent from my iPhone

 On Jan 8, 2015, at 3:55 PM, Xiangrui Meng notificati...@github.com 
wrote:
 
 @derrickburns I like the improvements implemented in this PR. But as 
@srowen mentioned, we have to resolve conflicts with the master branch before 
we can merge any PR. I compared the performance of this PR with master on 
minist-digits (6x784, sparse, 10 clusters) locally and found the master 
runs 2-3x faster. I guess this is majorly caused by two changes.
 
 We replaced breeze operations by our own implementation. The latter is 
about 2-3x faster.
 Running k-means++ distributively has noticeable overhead with small k and 
feature dimension.
 I think it is still feasible to include features through separate PRs:
 
 remember previously computed best distances in k-means++ initialization
 allow fixing the random seed (addressed in #3610)
 variable number of clusters. We should discuss whether we want to have 
less than k clusters or split the biggest one if there are more than k points.
 parallelize k-means++. I think whether we should replace local k-means++ 
or make it configurable requires some discussion and performance comparison.
 support Bregman divergences
 Putting all of them together would certainly delay the review process and 
require resolving conflicts. I may have some time to prepare PRs for some of 
the features here, if you don't mind.
 
 For Bregman divergences, I'm thinking we can alter the formulation to 
support sparse vectors:
 
 d(x, y) = f(x)  - f(y) - x - y, g(y) = f(x) - (f(y) - y, g(y)) - x, 
g(y)
 where f(x), g(y), and f(y) - y, g(y) could be pre-computed and cached, 
and x, g(y) can take advantage of sparse x. But I'm not sure whether this 
formulation is really useful on any Bregman divergence rather than the squared 
distance and the Mahalanobis distance. For KL-divergence and generalized 
I-divergence, the domain is R^d_+ and hence the points cannot be sparse.
 
 Besides those comments, I'm going to make some minor comments inline.
 
 —
 Reply to this email directly or view it on GitHub.
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-11 Thread derrickburns
Github user derrickburns commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r22768878
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala ---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib
+
+import breeze.linalg.{ Vector = BV }
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+package object base {
--- End diff --

I'm willing to take one last crack at contributing the Bregman 
modifications to Spark. Here is an layering of the pull requests:

1) modifications to BLAS to support arbitrary functions and axpy on 
SparseVectors
2) definition of PointOps trait and various implementations for different 
divergences 
3) rewrite of clusterer to use PointOps trait but only using the squared 
Euclidean distance implementation 
4) generalization of clustering interface to select different divergences.
5) store seeds
6) maintain exactly K clusters by splitting clusters heuristicly.

I can submit PRs for 1-4. 

Thoughts?

Sent from my iPhone

 On Jan 8, 2015, at 4:00 PM, Xiangrui Meng notificati...@github.com 
wrote:
 
 In mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala:
 
  + *
  + * Unless required by applicable law or agreed to in writing, software
  + * distributed under the License is distributed on an AS IS BASIS,
  + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
  + * See the License for the specific language governing permissions and
  + * limitations under the License.
  + */
  +
  +package org.apache.spark.mllib
  +
  +import breeze.linalg.{ Vector = BV }
  +
  +import org.apache.spark.mllib.linalg.Vector
  +import org.apache.spark.rdd.RDD
  +
  +package object base {
 Though private, having some doc would make the code easy to understand. 
For example,
 
 What does FP mean?
 inh?
 ...
 —
 Reply to this email directly or view it on GitHub.
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-11 Thread derrickburns
Github user derrickburns commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r22774169
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala ---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib
+
+import breeze.linalg.{ Vector = BV }
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+package object base {
--- End diff --

@mengxr

I tried to build the Spark master branch, but was unsuccessful.

So, instead of integrating my Bregman clusterer into the latest master
branch of Spark, I integrated them into my public GitHub project
https://github.com/derrickburns/generalized-kmeans-clustering.

Per your suggestion, this   version supports sparse vectors.

Included is support for several Bregman divergences including: squared
Euclidean distance, Kullback Leibler divergence, logistic loss, and
generalized I-divergence.

I created a general BregmanDivergence trait and a separate BregmanPointOps
trait.  When mixed in together, one gets a specific BregmanPointOps object
that computes Bregman divergences very efficiently, caching one value per
point and two values per center (per your observation).  This version is
easier to understand that the previous one.

Included also are three different implementations of K-Means clustering: 1)
one that caches distances (CachingKMeans) 2) one that tracks which clusters
moved and which points are assigned to which clusters and the distance to
the closest cluster (TrackingKMeans); and, 3) one that recomputes all
clusters and distances (SingleKMeans).

I have found the the TrackingKMeans clusterer has the best performance on
last data sets.

I have also found that maintaining cluster centers in homogeneous
coordinates is more efficient than using inhomogeneous coordinates for
moderate to high dimensional data. There is no penalty for low dimensional
data.

If/when you include such changes in Spark, I would appreciate being
included in the list of contributors.

Regards,

Derrick Burns



On Sun, Jan 11, 2015 at 10:52 AM, Derrick Burns derrickrbu...@gmail.com
wrote:

 I'm willing to take one last crack at contributing the Bregman
 modifications to Spark. Here is an layering of the pull requests:

 1) modifications to BLAS to support arbitrary functions and axpy on
 SparseVectors
 2) definition of PointOps trait and various implementations for different
 divergences
 3) rewrite of clusterer to use PointOps trait but only using the squared
 Euclidean distance implementation
 4) generalization of clustering interface to select different divergences.
 5) store seeds
 6) maintain exactly K clusters by splitting clusters heuristicly.

 I can submit PRs for 1-4.

 Thoughts?

 Sent from my iPhone

 On Jan 8, 2015, at 4:00 PM, Xiangrui Meng notificati...@github.com
 wrote:

 In mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala
 https://github.com/apache/spark/pull/2634#discussion-diff-22693266:

  + *
  + * Unless required by applicable law or agreed to in writing, software
  + * distributed under the License is distributed on an AS IS BASIS,
  + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
  + * See the License for the specific language governing permissions and
  + * limitations under the License.
  + */
  +
  +package org.apache.spark.mllib
  +
  +import breeze.linalg.{ Vector = BV }
  +
  +import org.apache.spark.mllib.linalg.Vector
  +import org.apache.spark.rdd.RDD
  +
  +package object base {

 Though private, having some doc would make the code easy to understand.
 For example,

1. What does FP mean?
2. inh?
3. ...

 —
   

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-10 Thread derrickburns
Github user derrickburns commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r22764827
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
 ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering.metrics
+
+import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = 
BV }
+
+import org.apache.spark.mllib.base._
+import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector }
+import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, 
Zero }
+
+class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, 
weight) {
+  val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight)
--- End diff --

It is most efficient to maintain the cluster centers in homogeneous
coordinates.

On Thu, Jan 8, 2015 at 3:56 PM, Xiangrui Meng notificati...@github.com
wrote:

 In
 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
 https://github.com/apache/spark/pull/2634#discussion-diff-22693112:

  + * distributed under the License is distributed on an AS IS BASIS,
  + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
  + * See the License for the specific language governing permissions and
  + * limitations under the License.
  + */
  +
  +package org.apache.spark.mllib.clustering.metrics
  +
  +import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector 
= BV }
  +
  +import org.apache.spark.mllib.base._
  +import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, 
Vector }
  +import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, 
Infinity, Zero }
  +
  +class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, 
weight) {
  +  val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * 
weight)

 Should weight be only used in aggregation rather than distance
 computation?

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/2634/files#r22693112.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-09 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-69404412
  
Thanks for the information on the speedup that you obtained by eliminating
Breeze.  I was unaware that the performance is so poor.  To what do you
attribute the poor performance of Breeze?

One case certainly split out 1-4 AFTER doing 5. My point was that doing 5
(supporting Bregman divergences) requires touching practically every method
in the clustering package.

3. Supporting a variable number of clusters requires a significant rewrite
at well, since the assumption that there are K clusters is made repeatedly
in the code.  If seems like adding a way to split clusters introduces a
whole other dimension of flexibility -- which I am not against -- that
would complicate the interface.  For example, what heuristic would you use
to identify a cluster to split and how would you split it.  It seems to me
that once you go down that path, you are headed toward a divisive
clustering algorithm.

4. As for the overhead of running k-means++ distributively, will someone
use Spark for small k and feature dimension?

5. Bregman divergences

As it turns out, I now have a need to perform clustering on sparse data, so
I am making a pass over my private implementation to 1) eliminate the use
of Breeze and 2) use the Spark Vector classes.   When I am done, I can
include these modification in an updated pull request.

I do not understand you conclusion that for KL-divergence and generalized
I-divergence the points cannot be sparse.  The domain need not be R^d+.
Rather, the domain can be R^infinity+.

For example, in the problem that I am currently working on, I am creating
sparse vectors where the vectors represent the frequency of occurrence of
the contexts for given n-grams.  The space of the contexts is extremely
large, however, I am only interested in the M most frequent contexts per
n-gram.  I use a sketch to identify the M most frequent contexts per
n-gram.  This gives me an M length SparseVector of non-negative frequency
values.  Thus, for each n-gram, I will have a SparseVector that represents
the distribution of the more frequently occurring contexts.  I do not see a
problem using this SparseVector with the KL-divergence metric. Am I missing
something?

On Thu, Jan 8, 2015 at 3:55 PM, Xiangrui Meng notificati...@github.com
wrote:

 @derrickburns https://github.com/derrickburns I like the improvements
 implemented in this PR. But as @srowen https://github.com/srowen
 mentioned, we have to resolve conflicts with the master branch before we
 can merge any PR. I compared the performance of this PR with master on
 minist-digits (6x784, sparse, 10 clusters) locally and found the 
master
 runs 2-3x faster. I guess this is majorly caused by two changes.

1. We replaced breeze operations by our own implementation. The latter
is about 2-3x faster.
2. Running k-means++ distributively has noticeable overhead with small
k and feature dimension.

 I think it is still feasible to include features through separate PRs:

1. remember previously computed best distances in k-means++
initialization
2. allow fixing the random seed (addressed in #3610
https://github.com/apache/spark/pull/3610)
3. variable number of clusters. We should discuss whether we want to
have less than k clusters or split the biggest one if there are more 
than k
points.
4. parallelize k-means++. I think whether we should replace local
k-means++ or make it configurable requires some discussion and 
performance
comparison.
5. support Bregman divergences

 Putting all of them together would certainly delay the review process and
 require resolving conflicts. I may have some time to prepare PRs for some
 of the features here, if you don't mind.

 For Bregman divergences, I'm thinking we can alter the formulation to
 support sparse vectors:

 d(x, y) = f(x)  - f(y) - x - y, g(y) = f(x) - (f(y) - y, g(y)) - x, 
g(y)

 where f(x), g(y), and f(y) - y, g(y) could be pre-computed and cached,
 and x, g(y) can take advantage of sparse x. But I'm not sure whether
 this formulation is really useful on any Bregman divergence rather than 
the
 squared distance and the Mahalanobis distance. For KL-divergence and
 generalized I-divergence, the domain is R^d_+ and hence the points cannot
 be sparse.

 Besides those comments, I'm going to make some minor comments inline.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/2634#issuecomment-69272196.



---
If your project is set up for 

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r22693139
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
 ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering.metrics
+
+import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = 
BV }
+
+import org.apache.spark.mllib.base._
+import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector }
+import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, 
Zero }
+
+class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, 
weight) {
+  val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight)
+}
+
+/**
+ * Euclidean distance measure, expedited by pre-computing vector norms
+ */
+class FastEuclideanOps extends PointOps[FastEUPoint, FastEUPoint] with 
Serializable {
+
+  type C = FastEUPoint
+  type P = FastEUPoint
+
+  val epsilon = 1e-4
+
+  /* compute a lower bound on the euclidean distance distance */
+
+  def distance(p: P, c: C, upperBound: Double): Double = {
+val d = if (p.weight == Zero || c.weight == Zero) {
+  p.norm + c.norm
+} else {
+  val x = p.raw.dot(c.raw) / (p.weight * c.weight)
--- End diff --

same question about using `weight` in `distance`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r22693112
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
 ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering.metrics
+
+import breeze.linalg.{ DenseVector = BDV, SparseVector = BSV, Vector = 
BV }
+
+import org.apache.spark.mllib.base._
+import org.apache.spark.mllib.linalg.{ SparseVector, DenseVector, Vector }
+import org.apache.spark.mllib.base.{ Centroid, FPoint, PointOps, Infinity, 
Zero }
+
+class FastEUPoint(raw: BV[Double], weight: Double) extends FPoint(raw, 
weight) {
+  val norm = if (weight == Zero) Zero else raw.dot(raw) / (weight * weight)
--- End diff --

Should `weight` be only used in aggregation rather than distance 
computation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-08 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-69272196
  
@derrickburns I like the improvements implemented in this PR. But as 
@srowen mentioned, we have to resolve conflicts with the master branch before 
we can merge any PR. I compared the performance of this PR with master on 
minist-digits (6x784, sparse, 10 clusters) locally and found the master 
runs 2-3x faster. I guess this is majorly caused by two changes.

1. We replaced breeze operations by our own implementation. The latter is 
about 2-3x faster.
1. Running k-means++ distributively has noticeable overhead with small k 
and feature dimension.

I think it is still feasible to include features through separate PRs:

1. remember previously computed best distances in k-means++ initialization
1. allow fixing the random seed (addressed in #3610)
1. variable number of clusters. We should discuss whether we want to have 
less than k clusters or split the biggest one if there are more than k points. 
1. parallelize k-means++. I think whether we should replace local k-means++ 
or make it configurable requires some discussion and performance comparison.
1. support Bregman divergences

Putting all of them together would certainly delay the review process and 
require resolving conflicts. I may have some time to prepare PRs for some of 
the features here, if you don't mind.

For Bregman divergences, I'm thinking we can alter the formulation to 
support sparse vectors:

~~~
d(x, y) = f(x)  - f(y) - x - y, g(y) = f(x) - (f(y) - y, g(y)) - x, 
g(y)
~~~

where `f(x)`, `g(y)`, and `f(y) - y, g(y)` could be pre-computed and 
cached, and `x, g(y)` can take advantage of sparse `x`. But I'm not sure 
whether this formulation is really useful on any Bregman divergence rather than 
the squared distance and the Mahalanobis distance. For KL-divergence and 
generalized I-divergence, the domain is R^d_+ and hence the points cannot be 
sparse.

Besides those comments, I'm going to make some minor comments inline.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r22693062
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala ---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib
--- End diff --

Should it be `mllib.clustering` as the file is under `clustering/`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-08 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r22693266
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala ---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib
+
+import breeze.linalg.{ Vector = BV }
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+package object base {
--- End diff --

Though private, having some doc would make the code easy to understand. For 
example,

1. What does `FP` mean?
1. `inh`?
1. ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-07 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-68990204
  
(@derrickburns this PR can't be merged as-is. It contains merge conflicts 
with master now, as Github notes here. You would need to rebase it.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-06 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-68985129
  
The pull request that you integrated on December 3 is redundant to this
one. Therefore you need not worry about merging in those changes.  Simply
select this version.

Are you willing integrate this pull request as is?


On Mon, Jan 5, 2015 at 4:01 PM, Derrick Burns derrickrbu...@gmail.com
wrote:

 I've said this before, so please forgive me for being repetitive.

 The new implementation is a rewrite, not a patch, so it is not possible to
 parcel out the changes into distinct PRs.

 On Mon, Jan 5, 2015 at 1:09 PM, Xiangrui Meng notificati...@github.com
 wrote:

 @derrickburns https://github.com/derrickburns I tried to merge this PR
 with the current master. But there are many conflicts, majorly because we
 removed breeze from the implementation. To merge the changes, I think we
 should make separate PRs for the JIRAs mentioned here, from fixes to new
 features. Do you mind me splitting this PR? Thanks!

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/2634#issuecomment-68781838.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-05 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-68804845
  
I've said this before, so please forgive me for being repetitive.

The new implementation is a rewrite, not a patch, so it is not possible to
parcel out the changes into distinct PRs.

On Mon, Jan 5, 2015 at 1:09 PM, Xiangrui Meng notificati...@github.com
wrote:

 @derrickburns https://github.com/derrickburns I tried to merge this PR
 with the current master. But there are many conflicts, majorly because we
 removed breeze from the implementation. To merge the changes, I think we
 should make separate PRs for the JIRAs mentioned here, from fixes to new
 features. Do you mind me splitting this PR? Thanks!

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/2634#issuecomment-68781838.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2015-01-05 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-68781838
  
@derrickburns I tried to merge this PR with the current master. But there 
are many conflicts, majorly because we removed breeze from the implementation. 
To merge the changes, I think we should make separate PRs for the JIRAs 
mentioned here, from fixes to new features. Do you mind me splitting this PR? 
Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-12-30 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-68409940
  
@derrickburns I'm going to check the implementation during the break. Since 
it is hard to locate the serialization issue, do you mind me making changes to 
your implementation and sending you a pull request after?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-12-30 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-68414436
  
No problem. Thx!

Sent from my iPhone

 On Dec 30, 2014, at 3:23 PM, Xiangrui Meng notificati...@github.com 
wrote:
 
 @derrickburns I'm going to check the implementation during the break. 
Since it is hard to locate the serialization issue, do you mind me making 
changes to your implementation and sending you a pull request after?
 
 —
 Reply to this email directly or view it on GitHub.
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-12-30 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-68425655
  
Thx for taking this on.

Sent from my iPhone

 On Dec 30, 2014, at 3:23 PM, Xiangrui Meng notificati...@github.com 
wrote:
 
 @derrickburns I'm going to check the implementation during the break. 
Since it is hard to locate the serialization issue, do you mind me making 
changes to your implementation and sending you a pull request after?
 
 —
 Reply to this email directly or view it on GitHub.
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-12-28 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-68218597
  
That would be great!

On Sat, Dec 27, 2014 at 12:59 PM, Nicholas Chammas notificati...@github.com
 wrote:

 @mengxr https://github.com/mengxr Now that 1.2.0 is out, can we
 schedule a rough timeframe for reviewing this patch?

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/2634#issuecomment-68190043.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-12-27 Thread nchammas
Github user nchammas commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-68190043
  
@mengxr Now that 1.2.0 is out, can we schedule a rough timeframe for 
reviewing this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-24 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-60351026
  
@derrickburns The features are useful, so please don't delete the PR. Since 
this is a major refactor of `KMeans`, I need to allocate a block of time to 
review the code and run performance tests. But I'm running out of bandwidth 
now. Do you mind me doing the review after the v1.2 feature freeze?

Btw, it would be really helpful if you can post some performance testing 
results against the current implementation in MLlib, for large-scale sparse and 
dense data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-23 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-60336780
  
@mengxr  Is there interest in this pull request?  Should I delete it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-08 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-58433653
  
@derrickburns Is it possible to split this PR into small ones? So we can 
trace down the closure serialization issue.

Another question I have is on the convergence with other distance measures, 
as discussed in #1964 . If we use the weighted mean for new centers, does it 
guarantee to converge for distance rather than L2?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-08 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-58449199
  
The closure data capture problem occurs MultiKMeans.scala:105.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-58104576
  
@derrickburns The style test doesn't capture all, unfortunately. The Spark 
Code Style Guide is the first place to check. I will mark a few examples inline.

I think the best way to trace down the problem is to split this PR into 
small ones and see whether we can find it. Is there a good way to split this? 
Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r18488327
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansParallel.scala ---
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.mllib.base.{ PointOps, FP, Zero }
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+import scala.collection.mutable.ArrayBuffer
--- End diff --

move scala import before spark imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r18488318
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -17,429 +17,57 @@
 
 package org.apache.spark.mllib.clustering
 
-import scala.collection.mutable.ArrayBuffer
 
-import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm}
-
-import org.apache.spark.annotation.Experimental
-import org.apache.spark.Logging
-import org.apache.spark.SparkContext._
-import org.apache.spark.mllib.linalg.{Vector, Vectors}
-import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.mllib.base.{FP, PointOps}
+import org.apache.spark.mllib.clustering.metrics.FastEuclideanOps
 import org.apache.spark.rdd.RDD
-import org.apache.spark.storage.StorageLevel
-import org.apache.spark.util.random.XORShiftRandom
-
-/**
- * K-means clustering with support for multiple parallel runs and a 
k-means++ like initialization
- * mode (the k-means|| algorithm by Bahmani et al). When multiple 
concurrent runs are requested,
- * they are executed together with joint passes over the data for 
efficiency.
- *
- * This is an iterative algorithm that will make multiple passes over the 
data, so any RDDs given
- * to it should be cached by the user.
- */
-class KMeans private (
-private var k: Int,
-private var maxIterations: Int,
-private var runs: Int,
-private var initializationMode: String,
-private var initializationSteps: Int,
-private var epsilon: Double) extends Serializable with Logging {
-
-  /**
-   * Constructs a KMeans instance with default parameters: {k: 2, 
maxIterations: 20, runs: 1,
-   * initializationMode: k-means||, initializationSteps: 5, epsilon: 
1e-4}.
-   */
-  def this() = this(2, 20, 1, KMeans.K_MEANS_PARALLEL, 5, 1e-4)
-
-  /** Set the number of clusters to create (k). Default: 2. */
-  def setK(k: Int): this.type = {
-this.k = k
-this
-  }
-
-  /** Set maximum number of iterations to run. Default: 20. */
-  def setMaxIterations(maxIterations: Int): this.type = {
-this.maxIterations = maxIterations
-this
-  }
-
-  /**
-   * Set the initialization algorithm. This can be either random to 
choose random points as
-   * initial cluster centers, or k-means|| to use a parallel variant of 
k-means++
-   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
-   */
-  def setInitializationMode(initializationMode: String): this.type = {
-if (initializationMode != KMeans.RANDOM  initializationMode != 
KMeans.K_MEANS_PARALLEL) {
-  throw new IllegalArgumentException(Invalid initialization mode:  + 
initializationMode)
-}
-this.initializationMode = initializationMode
-this
-  }
-
-  /**
-   * :: Experimental ::
-   * Set the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
-   * this many times with random starting conditions (configured by the 
initialization mode), then
-   * return the best clustering found over any run. Default: 1.
-   */
-  @Experimental
-  def setRuns(runs: Int): this.type = {
-if (runs = 0) {
-  throw new IllegalArgumentException(Number of runs must be positive)
-}
-this.runs = runs
-this
-  }
-
-  /**
-   * Set the number of steps for the k-means|| initialization mode. This 
is an advanced
-   * setting -- the default of 5 is almost always enough. Default: 5.
-   */
-  def setInitializationSteps(initializationSteps: Int): this.type = {
-if (initializationSteps = 0) {
-  throw new IllegalArgumentException(Number of initialization steps 
must be positive)
-}
-this.initializationSteps = initializationSteps
-this
-  }
-
-  /**
-   * Set the distance threshold within which we've consider centers to 
have converged.
-   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
-   */
-  def setEpsilon(epsilon: Double): this.type = {
-this.epsilon = epsilon
-this
-  }
-
-  /** Whether a warning should be logged if the input RDD is uncached. */
-  private var warnOnUncachedInput = true
-
-  /** Disable warnings about uncached input. */
-  private[spark] def disableUncachedWarning(): this.type = {
-warnOnUncachedInput = false
-this
-  }  
-
-  /**
-   * Train a K-means model on the given set of points; `data` should be 
cached for high
-   * performance, because this is an iterative algorithm.
-   */
-  def run(data: RDD[Vector]): KMeansModel = {
-
-if (warnOnUncachedInput  

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r18488296
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/GeneralizedKMeansModel.scala
 ---
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.base.{PointOps, FP}
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+/**
+ * A clustering model for K-means. Each point belongs to the cluster with 
the closest center.
+ */
+private[mllib] class GeneralizedKMeansModel[P:FP, C:FP](
+  val pointOps: PointOps[P, C],
+  val centers: Array[C])
+  extends Serializable {
--- End diff --

merge this line to 31


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r18488309
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -17,429 +17,57 @@
 
 package org.apache.spark.mllib.clustering
 
-import scala.collection.mutable.ArrayBuffer
 
-import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm}
-
-import org.apache.spark.annotation.Experimental
-import org.apache.spark.Logging
-import org.apache.spark.SparkContext._
-import org.apache.spark.mllib.linalg.{Vector, Vectors}
-import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.mllib.base.{FP, PointOps}
+import org.apache.spark.mllib.clustering.metrics.FastEuclideanOps
 import org.apache.spark.rdd.RDD
-import org.apache.spark.storage.StorageLevel
-import org.apache.spark.util.random.XORShiftRandom
-
-/**
- * K-means clustering with support for multiple parallel runs and a 
k-means++ like initialization
- * mode (the k-means|| algorithm by Bahmani et al). When multiple 
concurrent runs are requested,
- * they are executed together with joint passes over the data for 
efficiency.
- *
- * This is an iterative algorithm that will make multiple passes over the 
data, so any RDDs given
- * to it should be cached by the user.
- */
-class KMeans private (
-private var k: Int,
-private var maxIterations: Int,
-private var runs: Int,
-private var initializationMode: String,
-private var initializationSteps: Int,
-private var epsilon: Double) extends Serializable with Logging {
-
-  /**
-   * Constructs a KMeans instance with default parameters: {k: 2, 
maxIterations: 20, runs: 1,
-   * initializationMode: k-means||, initializationSteps: 5, epsilon: 
1e-4}.
-   */
-  def this() = this(2, 20, 1, KMeans.K_MEANS_PARALLEL, 5, 1e-4)
-
-  /** Set the number of clusters to create (k). Default: 2. */
-  def setK(k: Int): this.type = {
-this.k = k
-this
-  }
-
-  /** Set maximum number of iterations to run. Default: 20. */
-  def setMaxIterations(maxIterations: Int): this.type = {
-this.maxIterations = maxIterations
-this
-  }
-
-  /**
-   * Set the initialization algorithm. This can be either random to 
choose random points as
-   * initial cluster centers, or k-means|| to use a parallel variant of 
k-means++
-   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
-   */
-  def setInitializationMode(initializationMode: String): this.type = {
-if (initializationMode != KMeans.RANDOM  initializationMode != 
KMeans.K_MEANS_PARALLEL) {
-  throw new IllegalArgumentException(Invalid initialization mode:  + 
initializationMode)
-}
-this.initializationMode = initializationMode
-this
-  }
-
-  /**
-   * :: Experimental ::
-   * Set the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
-   * this many times with random starting conditions (configured by the 
initialization mode), then
-   * return the best clustering found over any run. Default: 1.
-   */
-  @Experimental
-  def setRuns(runs: Int): this.type = {
-if (runs = 0) {
-  throw new IllegalArgumentException(Number of runs must be positive)
-}
-this.runs = runs
-this
-  }
-
-  /**
-   * Set the number of steps for the k-means|| initialization mode. This 
is an advanced
-   * setting -- the default of 5 is almost always enough. Default: 5.
-   */
-  def setInitializationSteps(initializationSteps: Int): this.type = {
-if (initializationSteps = 0) {
-  throw new IllegalArgumentException(Number of initialization steps 
must be positive)
-}
-this.initializationSteps = initializationSteps
-this
-  }
-
-  /**
-   * Set the distance threshold within which we've consider centers to 
have converged.
-   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
-   */
-  def setEpsilon(epsilon: Double): this.type = {
-this.epsilon = epsilon
-this
-  }
-
-  /** Whether a warning should be logged if the input RDD is uncached. */
-  private var warnOnUncachedInput = true
-
-  /** Disable warnings about uncached input. */
-  private[spark] def disableUncachedWarning(): this.type = {
-warnOnUncachedInput = false
-this
-  }  
-
-  /**
-   * Train a K-means model on the given set of points; `data` should be 
cached for high
-   * performance, because this is an iterative algorithm.
-   */
-  def run(data: RDD[Vector]): KMeansModel = {
-
-if (warnOnUncachedInput  

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r18488331
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansPlusPlus.scala ---
@@ -0,0 +1,200 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import org.apache.spark.mllib.base.{PointOps, FP, Infinity, One, Zero}
+import org.apache.spark.util.random.XORShiftRandom
+import org.apache.spark.{Logging, SparkContext}
+
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/**
+ *
+ * The KMeans++ initialization algorithm
+ *
+ * @param pointOps distance function
+ * @tparam P point type
+ * @tparam C center type
+ */
+private[mllib] class KMeansPlusPlus[P : FP : ClassTag, C : FP : 
ClassTag](
+  pointOps: PointOps[P, C]) extends Serializable with Logging {
+
+  /**
+   * We will maintain for each point the distance to its closest cluster 
center.
+   * Since only one center is added on each iteration, recomputing the 
closest cluster center
+   * only requires computing the distance to the new cluster center if
+   * that distance is less than the closest cluster center.  
+   */
+  case class FatPoint(location: P, index: Int, weight: Double, distance: 
Double)
+
+  /**
+   * K-means++ on the weighted point set `points`. This first does the 
K-means++
+   * initialization procedure and then rounds of Lloyd's algorithm.
+   */
+
+  def cluster(
+   sc: SparkContext,
--- End diff --

4-space indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r18488320
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -17,429 +17,57 @@
 
 package org.apache.spark.mllib.clustering
 
-import scala.collection.mutable.ArrayBuffer
 
-import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm}
-
-import org.apache.spark.annotation.Experimental
-import org.apache.spark.Logging
-import org.apache.spark.SparkContext._
-import org.apache.spark.mllib.linalg.{Vector, Vectors}
-import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.mllib.base.{FP, PointOps}
+import org.apache.spark.mllib.clustering.metrics.FastEuclideanOps
 import org.apache.spark.rdd.RDD
-import org.apache.spark.storage.StorageLevel
-import org.apache.spark.util.random.XORShiftRandom
-
-/**
- * K-means clustering with support for multiple parallel runs and a 
k-means++ like initialization
- * mode (the k-means|| algorithm by Bahmani et al). When multiple 
concurrent runs are requested,
- * they are executed together with joint passes over the data for 
efficiency.
- *
- * This is an iterative algorithm that will make multiple passes over the 
data, so any RDDs given
- * to it should be cached by the user.
- */
-class KMeans private (
-private var k: Int,
-private var maxIterations: Int,
-private var runs: Int,
-private var initializationMode: String,
-private var initializationSteps: Int,
-private var epsilon: Double) extends Serializable with Logging {
-
-  /**
-   * Constructs a KMeans instance with default parameters: {k: 2, 
maxIterations: 20, runs: 1,
-   * initializationMode: k-means||, initializationSteps: 5, epsilon: 
1e-4}.
-   */
-  def this() = this(2, 20, 1, KMeans.K_MEANS_PARALLEL, 5, 1e-4)
-
-  /** Set the number of clusters to create (k). Default: 2. */
-  def setK(k: Int): this.type = {
-this.k = k
-this
-  }
-
-  /** Set maximum number of iterations to run. Default: 20. */
-  def setMaxIterations(maxIterations: Int): this.type = {
-this.maxIterations = maxIterations
-this
-  }
-
-  /**
-   * Set the initialization algorithm. This can be either random to 
choose random points as
-   * initial cluster centers, or k-means|| to use a parallel variant of 
k-means++
-   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
-   */
-  def setInitializationMode(initializationMode: String): this.type = {
-if (initializationMode != KMeans.RANDOM  initializationMode != 
KMeans.K_MEANS_PARALLEL) {
-  throw new IllegalArgumentException(Invalid initialization mode:  + 
initializationMode)
-}
-this.initializationMode = initializationMode
-this
-  }
-
-  /**
-   * :: Experimental ::
-   * Set the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
-   * this many times with random starting conditions (configured by the 
initialization mode), then
-   * return the best clustering found over any run. Default: 1.
-   */
-  @Experimental
-  def setRuns(runs: Int): this.type = {
-if (runs = 0) {
-  throw new IllegalArgumentException(Number of runs must be positive)
-}
-this.runs = runs
-this
-  }
-
-  /**
-   * Set the number of steps for the k-means|| initialization mode. This 
is an advanced
-   * setting -- the default of 5 is almost always enough. Default: 5.
-   */
-  def setInitializationSteps(initializationSteps: Int): this.type = {
-if (initializationSteps = 0) {
-  throw new IllegalArgumentException(Number of initialization steps 
must be positive)
-}
-this.initializationSteps = initializationSteps
-this
-  }
-
-  /**
-   * Set the distance threshold within which we've consider centers to 
have converged.
-   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
-   */
-  def setEpsilon(epsilon: Double): this.type = {
-this.epsilon = epsilon
-this
-  }
-
-  /** Whether a warning should be logged if the input RDD is uncached. */
-  private var warnOnUncachedInput = true
-
-  /** Disable warnings about uncached input. */
-  private[spark] def disableUncachedWarning(): this.type = {
-warnOnUncachedInput = false
-this
-  }  
-
-  /**
-   * Train a K-means model on the given set of points; `data` should be 
cached for high
-   * performance, because this is an iterative algorithm.
-   */
-  def run(data: RDD[Vector]): KMeansModel = {
-
-if (warnOnUncachedInput  

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r18488293
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/GeneralizedKMeansModel.scala
 ---
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.base.{PointOps, FP}
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+/**
+ * A clustering model for K-means. Each point belongs to the cluster with 
the closest center.
+ */
+private[mllib] class GeneralizedKMeansModel[P:FP, C:FP](
+  val pointOps: PointOps[P, C],
--- End diff --

4-space indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r18488325
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala ---
@@ -25,37 +25,28 @@ import org.apache.spark.mllib.linalg.Vector
 /**
  * A clustering model for K-means. Each point belongs to the cluster with 
the closest center.
  */
-class KMeansModel (val clusterCenters: Array[Vector]) extends Serializable 
{
 
-  /** Total number of clusters. */
-  def k: Int = clusterCenters.length
+
+class KMeansModel(specific: GeneralizedKMeansModel[_,_]) {
--- End diff --

space after `,`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r18488297
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/GeneralizedKMeansModel.scala
 ---
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.base.{PointOps, FP}
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+/**
+ * A clustering model for K-means. Each point belongs to the cluster with 
the closest center.
+ */
+private[mllib] class GeneralizedKMeansModel[P:FP, C:FP](
+  val pointOps: PointOps[P, C],
+  val centers: Array[C])
+  extends Serializable {
+
+  val k: Int = clusterCenters.length
+
+  def clusterCenters : Array[Vector] = centers.map{ c = 
pointOps.centerToVector(c) }
--- End diff --

use `.map(c = ...)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r18488364
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -255,4 +253,4 @@ class KMeansClusterSuite extends FunSuite with 
LocalClusterSparkContext {
   val cost = model.computeCost(points)
 }
   }
-}
+}
--- End diff --

newline at the end


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r18488334
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansPlusPlus.scala ---
@@ -0,0 +1,200 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import org.apache.spark.mllib.base.{PointOps, FP, Infinity, One, Zero}
+import org.apache.spark.util.random.XORShiftRandom
+import org.apache.spark.{Logging, SparkContext}
+
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+/**
+ *
+ * The KMeans++ initialization algorithm
+ *
+ * @param pointOps distance function
+ * @tparam P point type
+ * @tparam C center type
+ */
+private[mllib] class KMeansPlusPlus[P : FP : ClassTag, C : FP : 
ClassTag](
+  pointOps: PointOps[P, C]) extends Serializable with Logging {
+
+  /**
+   * We will maintain for each point the distance to its closest cluster 
center.
+   * Since only one center is added on each iteration, recomputing the 
closest cluster center
+   * only requires computing the distance to the new cluster center if
+   * that distance is less than the closest cluster center.  
+   */
+  case class FatPoint(location: P, index: Int, weight: Double, distance: 
Double)
+
+  /**
+   * K-means++ on the weighted point set `points`. This first does the 
K-means++
+   * initialization procedure and then rounds of Lloyd's algorithm.
+   */
+
+  def cluster(
+   sc: SparkContext,
+   seed: Int,
+   points: Array[C],
+   weights: Array[Double],
+   k: Int,
+   maxIterations: Int,
+   numPartitions: Int): Array[C] = {
+val centers: Array[C] = getCenters(sc, seed, points, weights, k, 
numPartitions, 1)
+val pts =  sc.parallelize(points.map(pointOps.centerToPoint))
+new MultiKMeans(pointOps, maxIterations).cluster(pts, 
Array(centers))._2.centers
+  }
+
+  /**
+   * Select centers in rounds.  On each round, select 'perRound' centers, 
with probability of
+   * selection equal to the product of the given weights and distance to 
the closest cluster center
+   * of the previous round.
+   *
+   * @param sc the Spark context
+   * @param seed a random number seed
+   * @param points  the candidate centers
+   * @param weights  the weights on the candidate centers
+   * @param k  the total number of centers to select
+   * @param numPartitions the number of data partitions to use
+   * @param perRound the number of centers to add per round
+   * @return   an array of at most k cluster centers
+   */
+  def getCenters(sc: SparkContext, seed: Int, points: Array[C], weights: 
Array[Double], k: Int,
--- End diff --

chop parameters and 4-space indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r18488302
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -17,429 +17,57 @@
 
 package org.apache.spark.mllib.clustering
 
-import scala.collection.mutable.ArrayBuffer
 
-import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm}
-
-import org.apache.spark.annotation.Experimental
-import org.apache.spark.Logging
-import org.apache.spark.SparkContext._
-import org.apache.spark.mllib.linalg.{Vector, Vectors}
-import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.mllib.base.{FP, PointOps}
+import org.apache.spark.mllib.clustering.metrics.FastEuclideanOps
 import org.apache.spark.rdd.RDD
-import org.apache.spark.storage.StorageLevel
-import org.apache.spark.util.random.XORShiftRandom
-
-/**
- * K-means clustering with support for multiple parallel runs and a 
k-means++ like initialization
- * mode (the k-means|| algorithm by Bahmani et al). When multiple 
concurrent runs are requested,
- * they are executed together with joint passes over the data for 
efficiency.
- *
- * This is an iterative algorithm that will make multiple passes over the 
data, so any RDDs given
- * to it should be cached by the user.
- */
-class KMeans private (
-private var k: Int,
-private var maxIterations: Int,
-private var runs: Int,
-private var initializationMode: String,
-private var initializationSteps: Int,
-private var epsilon: Double) extends Serializable with Logging {
-
-  /**
-   * Constructs a KMeans instance with default parameters: {k: 2, 
maxIterations: 20, runs: 1,
-   * initializationMode: k-means||, initializationSteps: 5, epsilon: 
1e-4}.
-   */
-  def this() = this(2, 20, 1, KMeans.K_MEANS_PARALLEL, 5, 1e-4)
-
-  /** Set the number of clusters to create (k). Default: 2. */
-  def setK(k: Int): this.type = {
--- End diff --

This is a breaking change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r18488348
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/MultiKMeansClusterer.scala
 ---
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.base.FP
+import org.apache.spark.rdd.RDD
+
+
+private[mllib] trait MultiKMeansClusterer[P : FP, C : FP] extends 
Serializable with Logging {
+  def cluster(data: RDD[P], centers: Array[Array[C]]): (Double, 
GeneralizedKMeansModel[P, C])
--- End diff --

missing doc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r18488358
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
 ---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering.metrics
+
+import breeze.linalg.{DenseVector = BDV, SparseVector = BSV, Vector = 
BV}
+import org.apache.spark.mllib.base._
+import org.apache.spark.mllib.linalg.{SparseVector, DenseVector, Vector}
+
+import org.apache.spark.mllib.base.{Centroid, FPoint, PointOps, Infinity, 
Zero}
+
+
--- End diff --

remove extra empty lines


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r18488316
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -17,429 +17,57 @@
 
 package org.apache.spark.mllib.clustering
 
-import scala.collection.mutable.ArrayBuffer
 
-import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm}
-
-import org.apache.spark.annotation.Experimental
-import org.apache.spark.Logging
-import org.apache.spark.SparkContext._
-import org.apache.spark.mllib.linalg.{Vector, Vectors}
-import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.mllib.base.{FP, PointOps}
+import org.apache.spark.mllib.clustering.metrics.FastEuclideanOps
 import org.apache.spark.rdd.RDD
-import org.apache.spark.storage.StorageLevel
-import org.apache.spark.util.random.XORShiftRandom
-
-/**
- * K-means clustering with support for multiple parallel runs and a 
k-means++ like initialization
- * mode (the k-means|| algorithm by Bahmani et al). When multiple 
concurrent runs are requested,
- * they are executed together with joint passes over the data for 
efficiency.
- *
- * This is an iterative algorithm that will make multiple passes over the 
data, so any RDDs given
- * to it should be cached by the user.
- */
-class KMeans private (
-private var k: Int,
-private var maxIterations: Int,
-private var runs: Int,
-private var initializationMode: String,
-private var initializationSteps: Int,
-private var epsilon: Double) extends Serializable with Logging {
-
-  /**
-   * Constructs a KMeans instance with default parameters: {k: 2, 
maxIterations: 20, runs: 1,
-   * initializationMode: k-means||, initializationSteps: 5, epsilon: 
1e-4}.
-   */
-  def this() = this(2, 20, 1, KMeans.K_MEANS_PARALLEL, 5, 1e-4)
-
-  /** Set the number of clusters to create (k). Default: 2. */
-  def setK(k: Int): this.type = {
-this.k = k
-this
-  }
-
-  /** Set maximum number of iterations to run. Default: 20. */
-  def setMaxIterations(maxIterations: Int): this.type = {
-this.maxIterations = maxIterations
-this
-  }
-
-  /**
-   * Set the initialization algorithm. This can be either random to 
choose random points as
-   * initial cluster centers, or k-means|| to use a parallel variant of 
k-means++
-   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
-   */
-  def setInitializationMode(initializationMode: String): this.type = {
-if (initializationMode != KMeans.RANDOM  initializationMode != 
KMeans.K_MEANS_PARALLEL) {
-  throw new IllegalArgumentException(Invalid initialization mode:  + 
initializationMode)
-}
-this.initializationMode = initializationMode
-this
-  }
-
-  /**
-   * :: Experimental ::
-   * Set the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
-   * this many times with random starting conditions (configured by the 
initialization mode), then
-   * return the best clustering found over any run. Default: 1.
-   */
-  @Experimental
-  def setRuns(runs: Int): this.type = {
-if (runs = 0) {
-  throw new IllegalArgumentException(Number of runs must be positive)
-}
-this.runs = runs
-this
-  }
-
-  /**
-   * Set the number of steps for the k-means|| initialization mode. This 
is an advanced
-   * setting -- the default of 5 is almost always enough. Default: 5.
-   */
-  def setInitializationSteps(initializationSteps: Int): this.type = {
-if (initializationSteps = 0) {
-  throw new IllegalArgumentException(Number of initialization steps 
must be positive)
-}
-this.initializationSteps = initializationSteps
-this
-  }
-
-  /**
-   * Set the distance threshold within which we've consider centers to 
have converged.
-   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
-   */
-  def setEpsilon(epsilon: Double): this.type = {
-this.epsilon = epsilon
-this
-  }
-
-  /** Whether a warning should be logged if the input RDD is uncached. */
-  private var warnOnUncachedInput = true
-
-  /** Disable warnings about uncached input. */
-  private[spark] def disableUncachedWarning(): this.type = {
-warnOnUncachedInput = false
-this
-  }  
-
-  /**
-   * Train a K-means model on the given set of points; `data` should be 
cached for high
-   * performance, because this is an iterative algorithm.
-   */
-  def run(data: RDD[Vector]): KMeansModel = {
-
-if (warnOnUncachedInput  

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r18488342
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/MultiKMeans.scala ---
@@ -0,0 +1,134 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import org.apache.spark.Accumulator
+import org.apache.spark.SparkContext._
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.mllib.base.{Centroid, FP, PointOps, Zero}
+import org.apache.spark.rdd.RDD
+
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+
+/**
+ * A K-Means clustering implementation that performs multiple K-means 
clusterings simultaneously,
+ * returning the one with the lowest cost.
+ *
+ * We only compute if a center has moved if we need to.
+ *
+ * We use null to represent empty clusters instead of an Option type to 
save space.
+ *
+ * The resulting clustering may contain fewer than K clusters.
+ */
+
+private[mllib] class MultiKMeans[P : FP : ClassTag, C : FP : ClassTag](
+  pointOps: PointOps[P, C], maxIterations: Int) extends 
MultiKMeansClusterer[P, C] {
+
+  def cluster(data: RDD[P], centers: Array[Array[C]]): (Double, 
GeneralizedKMeansModel[P, C]) = {
+val runs = centers.length
+val active = Array.fill(runs)(true)
+val costs = Array.fill(runs)(Zero)
+var activeRuns = new ArrayBuffer[Int] ++ (0 until runs)
+var iteration = 0
+
+/*
+ * Execute iterations of Lloyd's algorithm until all runs have 
converged.
+ */
+
+while (iteration  maxIterations  activeRuns.nonEmpty) {
+  // remove the empty clusters
+  log.info(iteration {}, iteration)
+
+  val activeCenters = activeRuns.map(r = centers(r)).toArray
+
+  if (log.isInfoEnabled) {
+for (r - 0 until activeCenters.length)
+  log.info(run {} has {} centers, activeRuns(r), 
activeCenters(r).length)
+  }
+
+  // Find the sum and count of points mapping to each center
+  val (centroids, runDistortion) = getCentroids(data, activeCenters)
+
+  if (log.isInfoEnabled) {
+for (run - activeRuns) log.info(run {} distortion {}, run, 
runDistortion(run))
+  }
+
+  for (run - activeRuns) active(run) = false
+
+  for (((runIndex: Int, clusterIndex: Int), cn: Centroid) - 
centroids) {
+val run = activeRuns(runIndex)
+if (cn.isEmpty) {
+  active(run) = true
+  centers(run)(clusterIndex) = null.asInstanceOf[C]
+} else {
+  val centroid = pointOps.centroidToPoint(cn)
+  active(run) = active(run) || pointOps.centerMoved(centroid, 
centers(run)(clusterIndex))
+  centers(run)(clusterIndex) = pointOps.pointToCenter(centroid)
+}
+  }
+
+  // filter out null centers
+  for (r - activeRuns) centers(r) = centers(r).filter(_ != null)
+
+  // update distortions and print log message if run completed during 
this iteration
+  for ((run, runIndex) - activeRuns.zipWithIndex) {
+costs(run) = runDistortion(runIndex)
+if (!active(run)) log.info(run {} finished in {} iterations, 
run, iteration + 1)
+  }
+  activeRuns = activeRuns.filter(active(_))
+  iteration += 1
+}
+
+val best = costs.zipWithIndex.min._2
+(costs(best), new GeneralizedKMeansModel(pointOps, centers(best)))
+  }
+
+  def getCentroids(data: RDD[P], activeCenters: Array[Array[C]]) = {
+val runDistortion = activeCenters.map(_ = 
data.sparkContext.accumulator(Zero))
+val bcActiveCenters = data.sparkContext.broadcast(activeCenters)
+val ops = pointOps
+val result = data.mapPartitions {  points =
+  val bcCenters = bcActiveCenters.value
+  val centers = bcCenters.map {
+_.map { _ = 

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r18488359
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala ---
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib
+
+import org.apache.spark.mllib.linalg.Vector
+import breeze.linalg.{Vector = BV}
+import org.apache.spark.rdd.RDD
+
+package object base {
+
+  val Zero = 0.0
+  val One = 1.0
+  val Infinity = Double.MaxValue
+  val Unknown = -1.0
+
+  private[mllib] trait FP extends Serializable {
+val weight: Double
+val raw: BV[Double]
+  }
+
+  private[mllib] class FPoint(val raw: BV[Double], val weight: Double) 
extends FP {
+override def toString: String = weight + , + (raw.toArray mkString 
,)
+lazy val inh = (raw :*  (1.0 / weight)).toArray
+  }
+
+  /**
+   * A mutable point in homogeneous coordinates that is lazily initialized.
+   */
+  private[mllib] class Centroid extends Serializable {
+override def toString: String = weight +
+  (if(raw != null) (, + raw.toArray mkString ,) else )
+
+def isEmpty = weight == Zero
+
+var raw: BV[Double] = null
+
+var weight: Double = Zero
+
+def add(p: Centroid): this.type = add(p.raw, p.weight)
+
+def add(p: FP): this.type = add(p.raw, p.weight)
+
+def sub(p: Centroid): this.type = sub(p.raw, p.weight)
+
+def sub(p: FP): this.type = sub(p.raw, p.weight)
+
+def sub(r: BV[Double], w: Double): this.type = {
+  if (r != null) {
+if (raw == null) {
+  raw = r.toVector :*= -1.0
+  weight = w * -1
+} else {
+  raw -= r
+  weight = weight - w
+}
+  }
+  this
+}
+
+def add(r: BV[Double], w: Double) : this.type = {
+  if (r != null) {
+if (raw == null) {
+  raw = r.toVector
+  weight = w
+} else {
+  raw += r
+  weight = weight + w
+}
+  }
+  this
+}
+  }
+
+  private[mllib] trait PointOps[P : FP, C : FP] {
+
+def distance(p: P, c: C, upperBound: Double): Double
--- End diff --

missing doc


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2634#discussion_r18488352
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/EuclideanOps.scala
 ---
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering.metrics
+
+import breeze.linalg.{DenseVector = BDV, SparseVector = BSV, Vector = 
BV}
+import org.apache.spark.mllib.base.{Centroid, FPoint, PointOps, Infinity, 
Zero}
+import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector}
+
+
+/**
+ * Euclidean distance measure
+ *
+ * This is the slow implementation of the squared Euclidean distance 
function,
+ * shown here simply for clarity.
+ */
+class EuclideanOps extends PointOps[FPoint, FPoint] with Serializable {
+
+  type C = FPoint
+  type P = FPoint
+
+  val epsilon = 1e-4
+
+  def distance(p: P, c: C, upperBound: Double = Infinity): Double = {
+val d = p.inh.zip(c.inh).foldLeft(Zero) {
+  case (d: Double, (a: Double, b: Double)) = d + (a - b) * (a - b)
+}
+if( d  Zero) Zero else d
--- End diff --

`if (d  Zero)` (space after `if` and no space after `(`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-58106056
  
@derrickburns I marked a few style problems (not all of them). There are 
breaking changes in your PR, which we should avoid as much as possible. Even we 
want to remove some methods, we should deprecate them first and remove them in 
a later release.

For the serialization problem, I'm not sure whether I have time looking 
into it this week. It would be nice to split this into small PRs, so we can 
trace down the problem faster.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-58106085
  
I know exactly which closure is exceeding the size limitation. The problem,
is that I cannot see how to make the closure capture less data!

On Mon, Oct 6, 2014 at 2:42 PM, Xiangrui Meng notificati...@github.com
wrote:

 @derrickburns https://github.com/derrickburns The style test doesn't
 capture all, unfortunately. The Spark Code Style Guide is the first place
 to check. I will mark a few examples inline.

 I think the best way to trace down the problem is to split this PR into
 small ones and see whether we can find it. Is there a good way to split
 this? Thanks!

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/2634#issuecomment-58104576.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-58106509
  
@mengxr

Is there an IntelliJ or Eclipse configuration that i can use to reformat
the code according to the guidelines?

The breaking change in the PR is for a private constructor that is,
therefore, not used outside of Spark.  Do these need to be deprecated first
as well?

On Mon, Oct 6, 2014 at 2:53 PM, Xiangrui Meng notificati...@github.com
wrote:

 @derrickburns https://github.com/derrickburns I marked a few style
 problems (not all of them). There are breaking changes in your PR, which 
we
 should avoid as much as possible. Even we want to remove some methods, we
 should deprecate them first and remove them in a later release.

 For the serialization problem, I'm not sure whether I have time looking
 into it this week. It would be nice to split this into small PRs, so we 
can
 trace down the problem faster.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/2634#issuecomment-58106056.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-58120070
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21355/consoleFull)
 for   PR 2634 at commit 
[`c603079`](https://github.com/apache/spark/commit/c6030791406238430548634677a4f9d4c59aa93a).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-58122649
  
@derrickburns I don't know any formatter that can do the job nicely. This 
has to be done by hand at this moment, unfortunately.

`KMeans` has a public constructor that takes no argument besides the 
default constructor and then users can use setters to set parameters. So they 
are public APIs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-58122976
  
Ah, I see. Will fix. 

Sent from my iPhone

 On Oct 6, 2014, at 5:55 PM, Xiangrui Meng notificati...@github.com 
wrote:
 
 @derrickburns I don't know any formatter that can do the job nicely. This 
has to be done by hand at this moment, unfortunately.
 
 KMeans has a public constructor that takes no argument besides the 
default constructor and then users can use setters to set parameters. So they 
are public APIs.
 
 —
 Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-58123795
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21355/consoleFull)
 for   PR 2634 at commit 
[`c603079`](https://github.com/apache/spark/commit/c6030791406238430548634677a4f9d4c59aa93a).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-58123798
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21355/Test 
FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-58126789
  
@mengxr I restored the KMeans class and public methods on that class. I did 
not mark them for deprecation.

I also re-formatted the code to follow the Spark style guide. I hope that I 
caught all the cases.  I used the Scalariform plugin to IntelliJ, with some 
manual adjustments to handle long formal parameter lists. 

The closure data capture problem occurs MultiKMeans.scala:105. Thx.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-58127303
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21358/consoleFull)
 for   PR 2634 at commit 
[`e2b6c02`](https://github.com/apache/spark/commit/e2b6c02f0f6ed715c7c72a1bd2db217a93deba3c).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-58127496
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21357/Test 
FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-58130332
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21358/Test 
FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-58130318
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21358/consoleFull)
 for   PR 2634 at commit 
[`e2b6c02`](https://github.com/apache/spark/commit/e2b6c02f0f6ed715c7c72a1bd2db217a93deba3c).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-03 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-57841035
  
@derrickburns The `*ClusterSuite` was created to prevent referencing 
unnecessary objects into the task closure. You can try to remove `Serializable` 
from algorithms. While the models are serializable, the algorithm instances 
should stay on the driver node. If you want to use a member method in a task 
closure, either make it static or define it as a local method. If you want to 
use a member variable, assign it to a `val` first.

This is something we can try. Avoiding serializing unnecessary objects is a 
good practice, but I'm not sure whether it is worth the effort.

Btw, could you update your PR following the 
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide ? 
Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-03 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-57894117
  
I ran the style tests. The pass. Is there something else in the style guide 
that is not captured in the tests ?

I have expended much effort to avoid serializing unnecessary objects. I'm 
still perplexed why so much data is being captured in the closure that the test 
fails. 

Anyway, what are the next steps? Omit the test and Approve the PR? Ask 
someone to help fix the code to avoid the unit test failure ?

Thx !

Sent from my iPhone

 On Oct 3, 2014, at 12:17 PM, Xiangrui Meng notificati...@github.com 
wrote:
 
 @derrickburns The *ClusterSuite was created to prevent referencing 
unnecessary objects into the task closure. You can try to remove Serializable 
from algorithms. While the models are serializable, the algorithm instances 
should stay on the driver node. If you want to use a member method in a task 
closure, either make it static or define it as a local method. If you want to 
use a member variable, assign it to a val first.
 
 This is something we can try. Avoiding serializing unnecessary objects is 
a good practice, but I'm not sure whether it is worth the effort.
 
 Btw, could you update your PR following the 
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide ? 
Thanks!
 
 —
 Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-02 Thread nchammas
Github user nchammas commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-57679766
  
@derrickburns Side note: It looks like a lot of [extraneous 
commits](https://github.com/apache/spark/pull/2419#commits-pushed-471d5f2) got 
pulled into this PR by mistake. Did you rebase your PR correctly? As it stands, 
this PR does not merge cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-02 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-57691032
  
Uh oh.

Sent from my iPhone

 On Oct 2, 2014, at 11:35 AM, Nicholas Chammas notificati...@github.com 
wrote:
 
 @derrickburns Side note: It looks like a lot of extraneous commits got 
pulled into this PR by mistake. Did you rebase your PR correctly? As it stands, 
this PR does not merge cleanly.
 
 —
 Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-02 Thread derrickburns
Github user derrickburns closed the pull request at:

https://github.com/apache/spark/pull/2419


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-02 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-57696626
  
I will close this pull request and create another.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-02 Thread derrickburns
GitHub user derrickburns opened a pull request:

https://github.com/apache/spark/pull/2634

[SPARK-3218, SPARK-3219, SPARK-3261, SPARK-3424] [RESUBMIT] MLLIB K-Means 
Clusterer

This commit introduces a general distance function trait, `PointOps`, for 
the Spark K-Means clusterer. There are no public API changes*.  

### Issue - Data Capture Test Fails - NEED HELP

The `org.apache.spark.mllib.clustering.KMeansClusterSuite` task size 
should be small in both training and prediction fails, suggesting that the RDD 
data is being captured in a closure.  This is quite puzzling. My efforts to 
solve this problem have failed. I *need help* to solve this problem. 

### Distance Function Trait

The `PointOps` trait defines the distance function.  The `PointOps` trait 
is more than a simple distance function.  It also defines the types of Points 
and Centers for the clusterer.  Standard MLLIB `Vector`s are converted into 
Points and Centers.  In the case of the `FastEuclideanOps` implementation of 
`PointOps`, the Point and Center types includes vector norm members. In other 
distance functions such as the [Kullback-Leibler distance 
function](http://en.wikipedia.org/wiki/Bregman_divergence), the Point and 
Center types include different values that speed up the distance calculation in 
a similar way that caching vector norms speeds up the Euclidean distance 
function.  This addresses SPARK-3219.

### Refactoring

To understand this original code, I found it useful to refactor the 
original implementation into components.  You may find it helpful to understand 
this pull request by looking at the new components and comparing them to their 
original implementation.  Unfortunately, GitHub diff does not help very much 
with this.

This commit splits up the clusterer into a number of components which 
behave (largely) like their predecessors. `KMeansParallel` implements the 
K-Means || initialization algorithm.  `KMeansRandom` implements the K-Means 
Random initialization algorithm.  `MultiKMeans` implements the K-Means 
algorithm on multiple sets of cluster centers using a given distance function.  
Traits for the initializer, `KMeansInitializer`, and the general K-Means 
clusterer, `MultiKMeansClusterer`, are provided to highlight the salient 
interfaces with the intent that alternative implementations of these interfaces 
may be provided in the future.

### Performance

This pull request is not focused on performance. Nevertheless,  the 
performance of the KMeans++ implementation was *dramatically* improved by NOT 
recomputing distances to clusters centers that were present in previous steps.  
This turns a quadratic implementation into a linear one. 

Second, the KMeans++ implementation uses the general K-Means clusterer in 
the final step.  This parallelizes a step that was sequential.

Together, these changes address SPARK-3424.

### Next Steps

This pull request does not introduce new user-visible changes.  The next 
step is to make different distance functions available through a user-visible 
API.   I will provide other distance functions after this pull request has been 
accepted. Then, we can decide on an appropriate user-level API to access those 
functions.

### Compatibility

While there are no user-level API changes, the behavior of the clusterer is 
*different* on some tests.  Specifically, the handling of empty clusters has 
changed.  Empty clusters are not filled with random points in this 
implementation.  The former behavior is undesirable for a number a reasons, not 
the least of which is that there is no reasonable use for duplicate cluster 
centers. To accommodate the change in behavior, the test cases were changed 
accordingly. This addresses SPARK-3261.

The private K-Means constructor which was used by some test Java code and 
one example was replaced with a Scala constructor that is not Java friendly.  
Since the constructor was not user visible, I simply changed the Java test code 
and the example to use the higher level interface.

### Testing

This code has been tested (albeit while packaged outside of Spark) and 
performance measured on data sets of millions of features each with hundreds of 
dimensions and on tens of thousands of clusters.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/derrickburns/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2634.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2634


commit fbfdcd8946d8681c062ba08d71eac961db50417d
Author: Derrick Burns derrickrbu...@gmail.com
Date:   2014-10-02T20:01:18Z

This commit introduces a general distance function trait, PointOps, 

[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-02 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-57698333
  
@mingxr  I created a new clean pull request.  I *still need help* to 
understand/fix a closure that is capturing too much data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-57698619
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21220/consoleFull)
 for   PR 2634 at commit 
[`fbfdcd8`](https://github.com/apache/spark/commit/fbfdcd8946d8681c062ba08d71eac961db50417d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-57708661
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21220/consoleFull)
 for   PR 2634 at commit 
[`fbfdcd8`](https://github.com/apache/spark/commit/fbfdcd8946d8681c062ba08d71eac961db50417d).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-57708679
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21220/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-10-02 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2634#issuecomment-57714029
  
@mengxr

This is as expected. I need help in solving the closure data capture 
problem.

Sent from my iPhone

 On Oct 2, 2014, at 2:10 PM, Apache Spark QA notificati...@github.com 
wrote:
 
 QA tests have finished for PR 2634 at commit fbfdcd8.
 
 This patch fails unit tests.
 This patch merges cleanly.
 This patch adds no public classes.
 —
 Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-57403972
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21076/consoleFull)
 for   PR 2419 at commit 
[`48a31dd`](https://github.com/apache/spark/commit/48a31dd36e1defa7c4279548579bc1874820a5ae).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-57404043
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21076/consoleFull)
 for   PR 2419 at commit 
[`48a31dd`](https://github.com/apache/spark/commit/48a31dd36e1defa7c4279548579bc1874820a5ae).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-57404045
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21076/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-57404291
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21077/consoleFull)
 for   PR 2419 at commit 
[`ebdc853`](https://github.com/apache/spark/commit/ebdc853de33373cdd92f5401288d89ab6e879741).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-57404357
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21077/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-57404356
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21077/consoleFull)
 for   PR 2419 at commit 
[`ebdc853`](https://github.com/apache/spark/commit/ebdc853de33373cdd92f5401288d89ab6e879741).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-57404968
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21078/consoleFull)
 for   PR 2419 at commit 
[`d6f7c66`](https://github.com/apache/spark/commit/d6f7c66dae95a0b6967637d1132bb7327ab41d8c).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-30 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-57404998
  
@mengxr 
1. I fixed the merge issue and also remerged to capture more recent changes.
2. I did as you suggested and introduced a local variable to hold a value 
used in the closure.  *Unfortunately, the test still fails.  Since all values 
used in the closure are local variables, I do not know what to do at this 
point.*
3. The best fix to the problem with the test that selects two random 
centers is the one that I implemented.  It will pass the test 
deterministically, while still using any random seed. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-57408391
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21078/consoleFull)
 for   PR 2419 at commit 
[`d6f7c66`](https://github.com/apache/spark/commit/d6f7c66dae95a0b6967637d1132bb7327ab41d8c).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-57408395
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21078/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2419#discussion_r18184806
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -17,16 +17,13 @@
 
 package org.apache.spark.mllib.clustering
 
-import scala.collection.mutable.ArrayBuffer
 
-import breeze.linalg.{DenseVector = BDV, Vector = BV, norm = breezeNorm}
-
-import org.apache.spark.annotation.Experimental
-import org.apache.spark.Logging
-import org.apache.spark.SparkContext._
-import org.apache.spark.mllib.linalg.{Vector, Vectors}
-import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.mllib.base.{FP, PointOps}
+import org.apache.spark.mllib.clustering.metrics.FastEuclideanOps
 import org.apache.spark.rdd.RDD
+ HEAD
--- End diff --

not merged cleanly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-29 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-57233504
  
@derrickburns Could you merge the PR cleanly? For the closure 
serialization, just go with the spores style, assign a class member using to a 
val and reference the val in the closure. Creating a new class with a single 
object is not necessary.

For randomization, we should try to make the algorithm deterministic given 
a random seed. Then we can fix the random seed for tests. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-29 Thread derrickburns
Github user derrickburns commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-57238377
  
Will do

Sent from my iPhone

 On Sep 29, 2014, at 2:32 PM, Xiangrui Meng notificati...@github.com 
wrote:
 
 @derrickburns Could you merge the PR cleanly? For the closure 
serialization, just go with the spores style, assign a class member using to a 
val and reference the val in the closure. Creating a new class with a single 
object is not necessary.
 
 For randomization, we should try to make the algorithm deterministic 
given a random seed. Then we can fix the random seed for tests.
 
 —
 Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3218, SPARK-3219, SPARK-3261, SPARK-342...

2014-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2419#issuecomment-57012356
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20871/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >