[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-03-19 Thread yu-iskw
Github user yu-iskw closed the pull request at:

https://github.com/apache/spark/pull/2906


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-03-19 Thread yu-iskw
Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-83757019
  
I've spoken with @freeman-lab. I am going to send a new PR after replacing 
the algorithm to the new one and adding wrapper classes for ml package. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-03-11 Thread yu-iskw
Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-78214203
  
@freeman-lab, @srowen, I apologize for the delay in replying. I will modify 
the code ASAP.
And I have a question about the implementation. I think this implementation 
is very slow and it difficult to take the large number of clusters in an 
argument. So, I tried to implement the new one which is more scalable and 
faster than current one. The new one is 1000 times faster than the current one.

https://github.com/yu-iskw/more-scalable-hierarchical-clustering-with-spark

Should we continue the PR, or replace the current one with the new one. 
thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-08 Thread rnowling
Github user rnowling commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-69192847
  
@freeman-lab @srowen @mengxr many thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-08 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22641414
  
--- Diff: 
mllib/src/test/java/org/apache/spark/mllib/clustering/JavaHierarchicalClusteringSuite.java
 ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering;
+
+import com.google.common.collect.Lists;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.Vectors;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.Serializable;
+import java.util.List;
+
+import static org.junit.Assert.assertEquals;
+
+public class JavaHierarchicalClusteringSuite implements Serializable {
+  private transient JavaSparkContext sc;
+
+  @Before
+  public void setUp() {
+sc = new JavaSparkContext("local", "JavaHierarchicalClustering");
+  }
+
+  @After
+  public void tearDown() {
+sc.stop();
+sc = null;
+  }
+
+  @Test
+  public void runHierarchicalClusteringConstructor() {
+List points = Lists.newArrayList(
+Vectors.dense(1.0, 2.0, 6.0),
+Vectors.dense(1.0, 3.0, 0.0),
+Vectors.dense(1.0, 4.0, 6.0)
+);
+Vector expectedCenter = Vectors.dense(1.0, 3.0, 4.0);
+
+JavaRDD data = sc.parallelize(points, 2);
+HierarchicalClusteringModel model = 
HierarchicalClustering.train(data.rdd(), 1);
+assertEquals(1, model.getCenters().length);
+assertEquals(expectedCenter, model.getCenters()[0]);
+  }
+
+  @Test
+  public void predictJavaRDD() {
+List points = Lists.newArrayList(
+Vectors.dense(1.0, 2.0, 6.0),
+Vectors.dense(1.0, 3.0, 0.0),
+Vectors.dense(1.0, 4.0, 6.0)
+);
+JavaRDD data = sc.parallelize(points, 2);
+HierarchicalClustering algo = new 
HierarchicalClustering().setNumClusters(1);
+HierarchicalClusteringModel model = algo.run(data.rdd());
+JavaRDD predictions = model.predict(data);
+// Should be able to get the first prediction.
+predictions.first();
--- End diff --

assert what the first one is? or is it not stable enough to reliably test 
for?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-08 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22641364
  
--- Diff: 
mllib/src/test/java/org/apache/spark/mllib/clustering/JavaHierarchicalClusteringSuite.java
 ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering;
+
+import com.google.common.collect.Lists;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.Vectors;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.Serializable;
+import java.util.List;
+
+import static org.junit.Assert.assertEquals;
+
+public class JavaHierarchicalClusteringSuite implements Serializable {
+  private transient JavaSparkContext sc;
--- End diff --

This is not a comment on this PR per se, but this whole `implements 
Serializable` and `transient JavaSparkContext` thing is an anti-pattern I wish 
wasn't used in even the tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-08 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22641250
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/mllib/JavaHierarchicalClustering.java
 ---
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib;
+
+import org.apache.spark.SparkConf;
--- End diff --

These look correctly ordered in the sense that package `a.b.c` sorts 
entirely before `a.b.c.d`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-08 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22641223
  
--- Diff: docs/mllib-clustering.md ---
@@ -154,6 +156,175 @@ section of the Spark
 Quick Start guide. Be sure to also include *spark-mllib* to your build 
file as
 a dependency.
 
+
+### Hierarchical Clustering
+
+MLlib supports
+[hierarchical 
clustering](http://en.wikipedia.org/wiki/Hierarchical_clustering), one of the 
most commonly used clustering algorithm which seeks to build a hierarchy of 
clusters.
+Strategies for hierarchical clustering generally fall into two types.
+One is the agglomerative clustering which is a "bottom up" approach: each 
observation starts in its own cluster, and pairs of clusters are merged as one 
moves up the hierarchy.
+The other is the divisive clustering which is a "top down" approach: all 
observations start in one cluster, and splits are performed recursively as one 
moves down the hierarchy.
+The MLlib implementation only includes a divisive hierarchical clustering 
algorithm.
+
+The implementation in MLlib has the following parameters:
+
+* *k* is the number of maximum desired clusters. 
+* *subIterations* is the maximum number of iterations to split a cluster 
to its 2 sub clusters.
+* *numRetries* is the maximum number of retries if a splitting doesn't 
work as expected.
+* *epsilon* determines the saturate threshold to consider the splitting to 
have converged.
+
+
+
+### Hierarchical Clustering Example
+
+
+
+
+The following code snippets can be executed in `spark-shell`.
+
+In the following example after loading and parsing data, 
+we use the hierarchical clustering object to cluster the sample data into 
three clusters. 
+The number of desired clusters is passed to the algorithm. 
+Hoerver, even though the number of clusters is less than *k* in the middle 
of the clustering,
+the clustering is stopped if they can not be split any more.
+
+{% highlight scala %}
+import org.apache.spark.mllib.clustering.HierarchicalClustering
+import org.apache.spark.mllib.linalg.Vectors
+
+// Load and parse the data
+val data = sc.textFile("data/mllib/sample_hierarchical_data.csv")
+val parsedData = data.map(s => 
Vectors.dense(s.split(',').map(_.toDouble))).cache()
+
+// Cluster the data into three classes using HierarchicalClustering object
+val numClusters = 10
+val model = HierarchicalClustering.train(parsedData, numClusters)
+println(s"# Clusters: ${model.getClusters().size}")
+
+// Show the cluster centers
+model.getCenters.foreach(println)
+
+// Evaluate clustering by computing the sum of variance of the clusters
+val variance = model.getClusters.map(_.getVariance.get).sum
+println(s"Sum of Variance of the Clusters = ${variance}")
+
+// Cut the cluster tree by height
+val cut_model = model.cut(4.0)
+println(s"# Clusters: ${cut_model.getClusters().size}")
+val variance = cut_model.getClusters.map(_.getVariance.get).sum
+println(s"Sum of Variance of the Clusters = ${variance}")
+{% endhighlight %}
+
+
+
+All of MLlib's methods use Java-friendly types, so you can import and call 
them there the same
+way you do in Scala. The only caveat is that the methods take Scala RDD 
objects, while the
+Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD 
to a Scala one by
+calling `.rdd()` on your `JavaRDD` object. A self-contained application 
example
+that is equivalent to the provided example in Scala is given below:
+
+{% highlight java %}
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.mllib.clustering.HierarchicalClustering;
+import org.apache.spark.mllib.clustering.HierarchicalClusteringModel;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.Vectors;
+
+public class JavaHierarchicalClustering {
--- End diff --

The other example code I see foregoes a lot of the boilerplate here of 
declaring a class, main method, System.out, etc. The indentation here is also 
significantly deeper than the 2-space indent in the code. Addressing these 
might make it easier to scan as an example on the web page.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-08 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22640591
  
--- Diff: data/mllib/sample_hierarchical_data.csv ---
@@ -0,0 +1,150 @@
+5.1,3.5,1.4,0.2
--- End diff --

Good point =) Leave as is then. Maybe at some point we should give all the 
vector-valued example data sets the same format / file type just for 
consistency, but that can be a separate PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-08 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22640249
  
--- Diff: data/mllib/sample_hierarchical_data.csv ---
@@ -0,0 +1,150 @@
+5.1,3.5,1.4,0.2
--- End diff --

Minor point - this wouldn't really be CSV though. I imagine the example 
shows parsing a common encoding like this on purpose.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22635337
  
--- Diff: examples/src/main/python/mllib/hierarchical_clustering.py ---
@@ -0,0 +1,84 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+A hierarchical clustering program using MLlib.
+
+This example requires NumPy, SciPy and matplotlib.
+"""
+
+import os
+import sys
+
+from numpy import array
+import matplotlib.pyplot as plt
--- End diff --

We should be careful to add any dependency, even in example. For here, I'd 
like to make it optional, tell user to install it to get a better experience if 
matplotlib and scipy is not installed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-69134341
  
Hi @yu-iskw and @rnowling , I've spent time reviewing the code and using it 
in both Python and Scala. Overall great work, terrific to see my little gist 
turned into something so refined and performant! =) I left lots of comments, 
most minor, though documenting the caching behavior seems quite important.

The one significant addition I'd suggest is exposing another model output: 
a list of the centers at all nodes in the learned tree. This would be in 
addition to just the centers of the leaves, which is currently returned by 
`getCenters` (or `clusterCenters` in Python). Maybe call it `getTreeCenters`. 
It's basically given by `model.clusterTree.toSeq().map(_.center)`. But we 
should make sure it's sorted so that it can be indexed using the values from 
the merge list. In other words, if `Z` is the merge list, and row i indicates 
that `Z[i,0]` and `Z[i,1]` were merged, we want to be able to get the centers 
associated with those nodes by calling, for example, 
`model.treeCenters[Z[i,0]]` and `model.treeCenters[Z[i,1]]`. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22634895
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,627 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => 
BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * This trait is used for the configuration of the hierarchical clustering
+ */
+sealed
+trait HierarchicalClusteringConf extends Serializable {
+  this: HierarchicalClustering =>
+
+  def setNumClusters(numClusters: Int): this.type = {
+this.numClusters = numClusters
+this
+  }
+
+  def getNumClusters(): Int = this.numClusters
+
+  def setNumRetries(numRetries: Int): this.type = {
+this.numRetries = numRetries
+this
+  }
+
+  def getNumRetries(): Int = this.numRetries
+
+  def setSubIterations(subIterations: Int): this.type = {
+this.subIterations = subIterations
+this
+  }
+
+  def getSubIterations(): Int = this.subIterations
+
+  def setEpsilon(epsilon: Double): this.type = {
+this.epsilon = epsilon
+this
+  }
+
+  def getEpsilon(): Double = this.epsilon
+
+  def setRandomSeed(seed: Int): this.type = {
+this.randomSeed = seed
+this
+  }
+
+  def getRandomSeed(): Int = this.randomSeed
+
+  def setRandomRange(range: Double): this.type = {
+this.randomRange = range
+this
+  }
+}
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bi-sect 
k-means algorithm.
+ *
+ * The main idea of this algorithm is derived from:
+ * "A comparison of document clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * @param numClusters the number of clusters you want
+ * @param subIterations the number of iterations at digging
+ * @param epsilon the threshold to stop the sub-iterations
+ * @param randomSeed uses in sampling data for initializing centers in 
each sub iterations
+ * @param randomRange the range coefficient to generate random points in 
each clustering step
+ */
+class HierarchicalClustering(
+  private[mllib] var numClusters: Int,
+  private[mllib] var subIterations: Int,
+  private[mllib] var numRetries: Int,
+  private[mllib] var epsilon: Double,
+  private[mllib] var randomSeed: Int,
+  private[mllib] var randomRange: Double)
+extends Serializable with Logging with HierarchicalClusteringConf {
+
+  /**
+   * Constructs with the default configuration
+   */
+  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
+
+  /** Shows the parameters */
+  override def toString(): String = {
+Array(
+  s"numClusters:${numClusters}",
+  s"subIterations:${subIterations}",
+  s"numRetries:${numRetries}",
+  s"epsilon:${epsilon}",
+  s"randomSeed:${randomSeed}",
+  s"randomRange:${randomRange}"
+).mkString(", ")
+  }
+
+  /**
+   * Trains a hierarchical clustering model with the given configuration
+   *
+   * @param data training points
+   * @return a model for hierarchical clustering
+   */
+  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
+validateData(data)
+logInfo(s"Run with ${this}")
+
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+val clusterTree = ClusterTree.fromRDD(data) // make the root node
+val model

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22634887
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,627 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => 
BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * This trait is used for the configuration of the hierarchical clustering
+ */
+sealed
+trait HierarchicalClusteringConf extends Serializable {
+  this: HierarchicalClustering =>
+
+  def setNumClusters(numClusters: Int): this.type = {
+this.numClusters = numClusters
+this
+  }
+
+  def getNumClusters(): Int = this.numClusters
+
+  def setNumRetries(numRetries: Int): this.type = {
+this.numRetries = numRetries
+this
+  }
+
+  def getNumRetries(): Int = this.numRetries
+
+  def setSubIterations(subIterations: Int): this.type = {
+this.subIterations = subIterations
+this
+  }
+
+  def getSubIterations(): Int = this.subIterations
+
+  def setEpsilon(epsilon: Double): this.type = {
+this.epsilon = epsilon
+this
+  }
+
+  def getEpsilon(): Double = this.epsilon
+
+  def setRandomSeed(seed: Int): this.type = {
+this.randomSeed = seed
+this
+  }
+
+  def getRandomSeed(): Int = this.randomSeed
+
+  def setRandomRange(range: Double): this.type = {
+this.randomRange = range
+this
+  }
+}
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bi-sect 
k-means algorithm.
+ *
+ * The main idea of this algorithm is derived from:
+ * "A comparison of document clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * @param numClusters the number of clusters you want
+ * @param subIterations the number of iterations at digging
+ * @param epsilon the threshold to stop the sub-iterations
+ * @param randomSeed uses in sampling data for initializing centers in 
each sub iterations
+ * @param randomRange the range coefficient to generate random points in 
each clustering step
+ */
+class HierarchicalClustering(
+  private[mllib] var numClusters: Int,
+  private[mllib] var subIterations: Int,
+  private[mllib] var numRetries: Int,
+  private[mllib] var epsilon: Double,
+  private[mllib] var randomSeed: Int,
+  private[mllib] var randomRange: Double)
+extends Serializable with Logging with HierarchicalClusteringConf {
+
+  /**
+   * Constructs with the default configuration
+   */
+  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
+
+  /** Shows the parameters */
+  override def toString(): String = {
+Array(
+  s"numClusters:${numClusters}",
+  s"subIterations:${subIterations}",
+  s"numRetries:${numRetries}",
+  s"epsilon:${epsilon}",
+  s"randomSeed:${randomSeed}",
+  s"randomRange:${randomRange}"
+).mkString(", ")
+  }
+
+  /**
+   * Trains a hierarchical clustering model with the given configuration
+   *
+   * @param data training points
+   * @return a model for hierarchical clustering
+   */
+  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
+validateData(data)
+logInfo(s"Run with ${this}")
+
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+val clusterTree = ClusterTree.fromRDD(data) // make the root node
+val model

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22634890
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,627 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => 
BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * This trait is used for the configuration of the hierarchical clustering
+ */
+sealed
+trait HierarchicalClusteringConf extends Serializable {
+  this: HierarchicalClustering =>
+
+  def setNumClusters(numClusters: Int): this.type = {
+this.numClusters = numClusters
+this
+  }
+
+  def getNumClusters(): Int = this.numClusters
+
+  def setNumRetries(numRetries: Int): this.type = {
+this.numRetries = numRetries
+this
+  }
+
+  def getNumRetries(): Int = this.numRetries
+
+  def setSubIterations(subIterations: Int): this.type = {
+this.subIterations = subIterations
+this
+  }
+
+  def getSubIterations(): Int = this.subIterations
+
+  def setEpsilon(epsilon: Double): this.type = {
+this.epsilon = epsilon
+this
+  }
+
+  def getEpsilon(): Double = this.epsilon
+
+  def setRandomSeed(seed: Int): this.type = {
+this.randomSeed = seed
+this
+  }
+
+  def getRandomSeed(): Int = this.randomSeed
+
+  def setRandomRange(range: Double): this.type = {
+this.randomRange = range
+this
+  }
+}
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bi-sect 
k-means algorithm.
+ *
+ * The main idea of this algorithm is derived from:
+ * "A comparison of document clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * @param numClusters the number of clusters you want
+ * @param subIterations the number of iterations at digging
+ * @param epsilon the threshold to stop the sub-iterations
+ * @param randomSeed uses in sampling data for initializing centers in 
each sub iterations
+ * @param randomRange the range coefficient to generate random points in 
each clustering step
+ */
+class HierarchicalClustering(
+  private[mllib] var numClusters: Int,
+  private[mllib] var subIterations: Int,
+  private[mllib] var numRetries: Int,
+  private[mllib] var epsilon: Double,
+  private[mllib] var randomSeed: Int,
+  private[mllib] var randomRange: Double)
+extends Serializable with Logging with HierarchicalClusteringConf {
+
+  /**
+   * Constructs with the default configuration
+   */
+  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
+
+  /** Shows the parameters */
+  override def toString(): String = {
+Array(
+  s"numClusters:${numClusters}",
+  s"subIterations:${subIterations}",
+  s"numRetries:${numRetries}",
+  s"epsilon:${epsilon}",
+  s"randomSeed:${randomSeed}",
+  s"randomRange:${randomRange}"
+).mkString(", ")
+  }
+
+  /**
+   * Trains a hierarchical clustering model with the given configuration
+   *
+   * @param data training points
+   * @return a model for hierarchical clustering
+   */
+  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
+validateData(data)
+logInfo(s"Run with ${this}")
+
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+val clusterTree = ClusterTree.fromRDD(data) // make the root node
+val model

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22634865
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,627 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => 
BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * This trait is used for the configuration of the hierarchical clustering
+ */
+sealed
+trait HierarchicalClusteringConf extends Serializable {
+  this: HierarchicalClustering =>
+
+  def setNumClusters(numClusters: Int): this.type = {
+this.numClusters = numClusters
+this
+  }
+
+  def getNumClusters(): Int = this.numClusters
+
+  def setNumRetries(numRetries: Int): this.type = {
+this.numRetries = numRetries
+this
+  }
+
+  def getNumRetries(): Int = this.numRetries
+
+  def setSubIterations(subIterations: Int): this.type = {
+this.subIterations = subIterations
+this
+  }
+
+  def getSubIterations(): Int = this.subIterations
+
+  def setEpsilon(epsilon: Double): this.type = {
+this.epsilon = epsilon
+this
+  }
+
+  def getEpsilon(): Double = this.epsilon
+
+  def setRandomSeed(seed: Int): this.type = {
+this.randomSeed = seed
+this
+  }
+
+  def getRandomSeed(): Int = this.randomSeed
+
+  def setRandomRange(range: Double): this.type = {
+this.randomRange = range
+this
+  }
+}
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bi-sect 
k-means algorithm.
+ *
+ * The main idea of this algorithm is derived from:
+ * "A comparison of document clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * @param numClusters the number of clusters you want
+ * @param subIterations the number of iterations at digging
+ * @param epsilon the threshold to stop the sub-iterations
+ * @param randomSeed uses in sampling data for initializing centers in 
each sub iterations
+ * @param randomRange the range coefficient to generate random points in 
each clustering step
+ */
+class HierarchicalClustering(
+  private[mllib] var numClusters: Int,
+  private[mllib] var subIterations: Int,
+  private[mllib] var numRetries: Int,
+  private[mllib] var epsilon: Double,
+  private[mllib] var randomSeed: Int,
+  private[mllib] var randomRange: Double)
+extends Serializable with Logging with HierarchicalClusteringConf {
+
+  /**
+   * Constructs with the default configuration
+   */
+  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
+
+  /** Shows the parameters */
+  override def toString(): String = {
+Array(
+  s"numClusters:${numClusters}",
+  s"subIterations:${subIterations}",
+  s"numRetries:${numRetries}",
+  s"epsilon:${epsilon}",
+  s"randomSeed:${randomSeed}",
+  s"randomRange:${randomRange}"
+).mkString(", ")
+  }
+
+  /**
+   * Trains a hierarchical clustering model with the given configuration
+   *
+   * @param data training points
+   * @return a model for hierarchical clustering
+   */
+  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
+validateData(data)
+logInfo(s"Run with ${this}")
+
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+val clusterTree = ClusterTree.fromRDD(data) // make the root node
+val model

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22634802
  
--- Diff: examples/src/main/python/mllib/hierarchical_clustering.py ---
@@ -0,0 +1,84 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+A hierarchical clustering program using MLlib.
+
+This example requires NumPy, SciPy and matplotlib.
+"""
+
+import os
+import sys
+
+from numpy import array
+import matplotlib.pyplot as plt
--- End diff --

I love that you've made it so easy to visualize the output, but this now 
adds a "dependency" on matplotlib which isn't used anywhere else in PySpark 
AFAIK. Strictly, because PySpark doesn't currently use formal package 
management (e.g. through PyPi), this isn't really adding a dependency, and it's 
just an example. But might be safer to just use a line note showing how the 
output can be visualized with matplotlib. Curious what others think. cc @davies


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22634674
  
--- Diff: docs/mllib-clustering.md ---
@@ -154,6 +156,175 @@ section of the Spark
 Quick Start guide. Be sure to also include *spark-mllib* to your build 
file as
 a dependency.
 
+
+### Hierarchical Clustering
+
+MLlib supports
+[hierarchical 
clustering](http://en.wikipedia.org/wiki/Hierarchical_clustering), one of the 
most commonly used clustering algorithm which seeks to build a hierarchy of 
clusters.
+Strategies for hierarchical clustering generally fall into two types.
+One is the agglomerative clustering which is a "bottom up" approach: each 
observation starts in its own cluster, and pairs of clusters are merged as one 
moves up the hierarchy.
+The other is the divisive clustering which is a "top down" approach: all 
observations start in one cluster, and splits are performed recursively as one 
moves down the hierarchy.
+The MLlib implementation only includes a divisive hierarchical clustering 
algorithm.
+
+The implementation in MLlib has the following parameters:
+
+* *k* is the number of maximum desired clusters. 
+* *subIterations* is the maximum number of iterations to split a cluster 
to its 2 sub clusters.
+* *numRetries* is the maximum number of retries if a splitting doesn't 
work as expected.
+* *epsilon* determines the saturate threshold to consider the splitting to 
have converged.
+
+
+
+### Hierarchical Clustering Example
+
+
+
+
+The following code snippets can be executed in `spark-shell`.
+
+In the following example after loading and parsing data, 
+we use the hierarchical clustering object to cluster the sample data into 
three clusters. 
--- End diff --

Clarify that this means three clusters at the bottom-most levels of a 
hierarchical tree.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22634203
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -88,6 +92,162 @@ def train(cls, rdd, k, maxIterations=100, runs=1, 
initializationMode="k-means||"
 return KMeansModel([c.toArray() for c in centers])
 
 
+class HierarchicalClusteringModel(object):
+
+"""A clustering model derived from the hierarchical clustering method.
+
+>>> from numpy import array
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4,2)
+>>> train_rdd = sc.parallelize(data)
+>>> model = HierarchicalClustering.train(train_rdd, 2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.predict(array([8.0, 9.0])) == model.predict(array([9.0, 
8.0]))
+True
+>>> x = model.predict(data[0])
+>>> type(x)
+
+>>> predicted_rdd = model.predict(train_rdd)
+>>> type(predicted_rdd)
+
+>>> predicted_rdd.collect() == [0, 0, 1, 1]
+True
+>>> sparse_data = [
+... SparseVector(3, {1: 1.0}),
+... SparseVector(3, {1: 1.1}),
+... SparseVector(3, {2: 1.0}),
+... SparseVector(3, {2: 1.1})
+... ]
+>>> train_rdd = sc.parallelize(sparse_data)
+>>> model = HierarchicalClustering.train(train_rdd, 2, numRetries=100)
+>>> model.predict(array([0., 1., 0.])) == model.predict(array([0, 1.1, 
0.]))
+True
+>>> model.predict(array([0., 0., 1.])) == model.predict(array([0, 0, 
1.1]))
+True
+>>> model.predict(sparse_data[0]) == model.predict(sparse_data[1])
+True
+>>> model.predict(sparse_data[2]) == model.predict(sparse_data[3])
+True
+>>> x = model.predict(array([0., 1., 0.]))
+>>> type(x)
+
+>>> predicted_rdd = model.predict(train_rdd)
+>>> type(predicted_rdd)
+
+>>> (predicted_rdd.collect() == [0, 0, 1, 1]
+... or predicted_rdd.collect() == [1, 1, 0, 0] )
+True
+>>> type(model.clusterCenters)
+
+"""
+
+def __init__(self, sc, java_model, centers):
+"""
+:param sc:  Spark context
+:param java_model:  Handle to Java model object
+:param centers: the cluster centers
+"""
+self._sc = sc
+self._java_model = java_model
+self.centers = centers
+
+def __del__(self):
+self._sc._gateway.detach(self._java_model)
+
+@property
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return self.centers
+
+def predict(self, x):
+"""Predict the closest cluster index
+
+:param x: a ndarray of list, a SparseVector or RDD[SparseVector]
+:return: the closest index or a RDD of int which means the closest 
index
+"""
+if isinstance(x, ndarray) or isinstance(x, Vector):
+return self.__predict_by_array(x)
+elif isinstance(x, RDD):
+return self.__predict_by_rdd(x)
+else:
+print 'Invalid input data type x:' + type(x)
+
+def __predict_by_array(self, x):
+"""Predict the closest cluster index with an ndarray or an 
SparseVector
+
+:param x: a vector
+:return: the closest cluster index
+"""
+ser = PickleSerializer()
+bytes = bytearray(ser.dumps(_convert_to_vector(x)))
+vec = self._sc._jvm.SerDe.loads(bytes)
+result = self._java_model.predict(vec)
+return 
PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(result)))
+
+def __predict_by_rdd(self, x):
+"""Predict the closest cluster index with a RDD
+:param x: a RDD of vector
+:return: a RDD of int
+"""
+ser = PickleSerializer()
+cached = 
x.map(_convert_to_vector)._reserialize(AutoBatchedSerializer(ser)).cache()
+rdd = _to_java_object_rdd(cached)
+jrdd = self._java_model.predict(rdd)
+jpyrdd = self._sc._jvm.SerDe.javaToPython(jrdd)
+return RDD(jpyrdd, self._sc, 
AutoBatchedSerializer(PickleSerializer()))
+
+def cut(self, height):
+"""Cut nodes and leaves in a cluster tree by a dendrogram height.
+:param height: a threshold to cut a cluster tree
+"""
+ser = PickleSerializer()
+model = self._java_model.cut(height)
+bytes = self._sc._jvm.SerDe.dumps(model.getCenters())
+centers = ser.loads(str(bytes))
+return HierarchicalClusteringModel(se

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22633997
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModel.scala
 ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+/**
+ * this class is used for the model of the hierarchical clustering
+ *
+ * @param clusterTree a cluster as a tree node
+ * @param isTrained if the model has been trained, the flag is true
+ */
+class HierarchicalClusteringModel private (
+  val clusterTree: ClusterTree,
+  private[mllib] var isTrained: Boolean) extends Serializable with Logging 
with Cloneable {
+
+  def this(clusterTree: ClusterTree) = this(clusterTree, false)
+
+  override def clone(): HierarchicalClusteringModel = {
+new HierarchicalClusteringModel(this.clusterTree.clone(), true)
+  }
+
+  /**
+   * Cuts a cluster tree by given threshold of dendrogram height
+   *
+   * @param height a threshold to cut a cluster tree
+   * @return a hierarchical clustering model
+   */
+  def cut(height: Double): HierarchicalClusteringModel = {
+val cloned = this.clone()
+cloned.clusterTree.cut(height)
+cloned
+  }
+
+  /**
+   * Predicts the closest cluster of each point
+   */
+  def predict(vector: Vector): Int = {
+// TODO Supports distance metrics other Euclidean distance metric
+val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - 
bv2, 2.0)
+this.clusterTree.assignClusterIndex(metric)(vector)
+  }
+
+  /**
+   * Predicts the closest cluster of each point
+   */
+  def predict(data: RDD[Vector]): RDD[(Int, Vector)] = {
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+val sc = data.sparkContext
+
+// TODO Supports distance metrics other Euclidean distance metric
+val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - 
bv2, 2.0)
+val treeRoot = this.clusterTree
+sc.broadcast(metric)
+sc.broadcast(treeRoot)
+val predicted = data.map(point => 
(treeRoot.assignClusterIndex(metric)(point), point))
+
+val predictTime = System.currentTimeMillis() - startTime
+logInfo(s"Predicting Time: ${predictTime.toDouble / 1000} [sec]")
+
+predicted
+  }
+
+  /** Maps given points to their cluster indices. */
+  def predict(points: JavaRDD[Vector]): JavaRDD[java.lang.Integer] =
+
predict(points.rdd).map(_._1).toJavaRDD().asInstanceOf[JavaRDD[java.lang.Integer]]
+
+  /**
+   * Computes the sum of total variance of all cluster
+   */
+  def getSumOfVariance(): Double = 
this.getClusters().map(_.getVariance().get).sum
+
+  def getClusters(): Array[ClusterTree] = clusterTree.getClusters().toArray
+
+  def getCenters(): Array[Vector] = getClusters().map(_.center)
+
+  /**
+   * Converts a clustering merging list
+   * Returned data format is fit for scipy's dendrogram function
+   * SEE ALSO: scipy.cluster.hierarchy.dendrogram
+   *
+   * @return List[(node1, node2, distance, tree size)]
+   */
+  def toMergeList(): List[(Int, Int, Double, Int)] = {
--- End diff --

Consider renaming -> `toLinkageMatrix`? I think that's a more general term 
for this data structure. Would require renaming here and elsewhere (e.g. in the 
Python code).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructur

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22633951
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModel.scala
 ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+/**
+ * this class is used for the model of the hierarchical clustering
+ *
+ * @param clusterTree a cluster as a tree node
+ * @param isTrained if the model has been trained, the flag is true
+ */
+class HierarchicalClusteringModel private (
+  val clusterTree: ClusterTree,
+  private[mllib] var isTrained: Boolean) extends Serializable with Logging 
with Cloneable {
+
+  def this(clusterTree: ClusterTree) = this(clusterTree, false)
+
+  override def clone(): HierarchicalClusteringModel = {
+new HierarchicalClusteringModel(this.clusterTree.clone(), true)
+  }
+
+  /**
+   * Cuts a cluster tree by given threshold of dendrogram height
+   *
+   * @param height a threshold to cut a cluster tree
+   * @return a hierarchical clustering model
+   */
+  def cut(height: Double): HierarchicalClusteringModel = {
+val cloned = this.clone()
+cloned.clusterTree.cut(height)
+cloned
+  }
+
+  /**
+   * Predicts the closest cluster of each point
+   */
+  def predict(vector: Vector): Int = {
+// TODO Supports distance metrics other Euclidean distance metric
+val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - 
bv2, 2.0)
+this.clusterTree.assignClusterIndex(metric)(vector)
+  }
+
+  /**
+   * Predicts the closest cluster of each point
+   */
+  def predict(data: RDD[Vector]): RDD[(Int, Vector)] = {
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+val sc = data.sparkContext
+
+// TODO Supports distance metrics other Euclidean distance metric
+val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - 
bv2, 2.0)
+val treeRoot = this.clusterTree
+sc.broadcast(metric)
+sc.broadcast(treeRoot)
+val predicted = data.map(point => 
(treeRoot.assignClusterIndex(metric)(point), point))
+
+val predictTime = System.currentTimeMillis() - startTime
+logInfo(s"Predicting Time: ${predictTime.toDouble / 1000} [sec]")
+
+predicted
+  }
+
+  /** Maps given points to their cluster indices. */
+  def predict(points: JavaRDD[Vector]): JavaRDD[java.lang.Integer] =
+
predict(points.rdd).map(_._1).toJavaRDD().asInstanceOf[JavaRDD[java.lang.Integer]]
+
+  /**
+   * Computes the sum of total variance of all cluster
+   */
+  def getSumOfVariance(): Double = 
this.getClusters().map(_.getVariance().get).sum
+
+  def getClusters(): Array[ClusterTree] = clusterTree.getClusters().toArray
+
+  def getCenters(): Array[Vector] = getClusters().map(_.center)
+
+  /**
+   * Converts a clustering merging list
+   * Returned data format is fit for scipy's dendrogram function
--- End diff --

I think it's a little weird to justify this based on a connection to scipy, 
and to reference that code so explicitly. This is primarily scala code, after 
all =) More importantly, the basic logic of this data structure is quite 
general, and is used in at least scipy and matlab (and possibly also R?). I'd 
instead give a longer description of how the list is organized here in the doc, 
and maybe mention that it is used by other libraries.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes s

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22633847
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModel.scala
 ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+/**
+ * this class is used for the model of the hierarchical clustering
+ *
+ * @param clusterTree a cluster as a tree node
+ * @param isTrained if the model has been trained, the flag is true
+ */
+class HierarchicalClusteringModel private (
+  val clusterTree: ClusterTree,
+  private[mllib] var isTrained: Boolean) extends Serializable with Logging 
with Cloneable {
+
+  def this(clusterTree: ClusterTree) = this(clusterTree, false)
+
+  override def clone(): HierarchicalClusteringModel = {
+new HierarchicalClusteringModel(this.clusterTree.clone(), true)
+  }
+
+  /**
+   * Cuts a cluster tree by given threshold of dendrogram height
+   *
+   * @param height a threshold to cut a cluster tree
+   * @return a hierarchical clustering model
+   */
+  def cut(height: Double): HierarchicalClusteringModel = {
+val cloned = this.clone()
+cloned.clusterTree.cut(height)
+cloned
+  }
+
+  /**
+   * Predicts the closest cluster of each point
+   */
+  def predict(vector: Vector): Int = {
+// TODO Supports distance metrics other Euclidean distance metric
+val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - 
bv2, 2.0)
+this.clusterTree.assignClusterIndex(metric)(vector)
+  }
+
+  /**
+   * Predicts the closest cluster of each point
+   */
+  def predict(data: RDD[Vector]): RDD[(Int, Vector)] = {
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+val sc = data.sparkContext
+
+// TODO Supports distance metrics other Euclidean distance metric
+val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - 
bv2, 2.0)
+val treeRoot = this.clusterTree
+sc.broadcast(metric)
+sc.broadcast(treeRoot)
+val predicted = data.map(point => 
(treeRoot.assignClusterIndex(metric)(point), point))
+
+val predictTime = System.currentTimeMillis() - startTime
+logInfo(s"Predicting Time: ${predictTime.toDouble / 1000} [sec]")
+
+predicted
+  }
+
+  /** Maps given points to their cluster indices. */
+  def predict(points: JavaRDD[Vector]): JavaRDD[java.lang.Integer] =
+
predict(points.rdd).map(_._1).toJavaRDD().asInstanceOf[JavaRDD[java.lang.Integer]]
+
+  /**
+   * Computes the sum of total variance of all cluster
+   */
+  def getSumOfVariance(): Double = 
this.getClusters().map(_.getVariance().get).sum
+
+  def getClusters(): Array[ClusterTree] = clusterTree.getClusters().toArray
+
+  def getCenters(): Array[Vector] = getClusters().map(_.center)
+
+  /**
+   * Converts a clustering merging list
+   * Returned data format is fit for scipy's dendrogram function
+   * SEE ALSO: scipy.cluster.hierarchy.dendrogram
+   *
+   * @return List[(node1, node2, distance, tree size)]
+   */
+  def toMergeList(): List[(Int, Int, Double, Int)] = {
+val seq = this.clusterTree.toSeq().sortWith{ case (a, b) => 
a.getHeight() < b.getHeight()}
+val leaves = seq.filter(_.isLeaf())
+val nodes = seq.filter(!_.isLeaf()).filter(_.children.size > 1)
+val clusters = leaves ++ nodes
+val treeMap = clusters.zipWithIndex.map { case (tree, idx) => (tree -> 
idx)}.toMap
+
+// If a node only has one-child, the child is regarded as the cluster 
of the child.
  

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22633778
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,627 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => 
BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * This trait is used for the configuration of the hierarchical clustering
+ */
+sealed
+trait HierarchicalClusteringConf extends Serializable {
+  this: HierarchicalClustering =>
+
+  def setNumClusters(numClusters: Int): this.type = {
+this.numClusters = numClusters
+this
+  }
+
+  def getNumClusters(): Int = this.numClusters
+
+  def setNumRetries(numRetries: Int): this.type = {
+this.numRetries = numRetries
+this
+  }
+
+  def getNumRetries(): Int = this.numRetries
+
+  def setSubIterations(subIterations: Int): this.type = {
+this.subIterations = subIterations
+this
+  }
+
+  def getSubIterations(): Int = this.subIterations
+
+  def setEpsilon(epsilon: Double): this.type = {
+this.epsilon = epsilon
+this
+  }
+
+  def getEpsilon(): Double = this.epsilon
+
+  def setRandomSeed(seed: Int): this.type = {
+this.randomSeed = seed
+this
+  }
+
+  def getRandomSeed(): Int = this.randomSeed
+
+  def setRandomRange(range: Double): this.type = {
+this.randomRange = range
+this
+  }
+}
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bi-sect 
k-means algorithm.
--- End diff --

Give a slightly longer overview of how the algorithm works?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22633758
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,627 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => 
BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * This trait is used for the configuration of the hierarchical clustering
+ */
+sealed
+trait HierarchicalClusteringConf extends Serializable {
+  this: HierarchicalClustering =>
+
+  def setNumClusters(numClusters: Int): this.type = {
+this.numClusters = numClusters
+this
+  }
+
+  def getNumClusters(): Int = this.numClusters
+
+  def setNumRetries(numRetries: Int): this.type = {
+this.numRetries = numRetries
+this
+  }
+
+  def getNumRetries(): Int = this.numRetries
+
+  def setSubIterations(subIterations: Int): this.type = {
+this.subIterations = subIterations
+this
+  }
+
+  def getSubIterations(): Int = this.subIterations
+
+  def setEpsilon(epsilon: Double): this.type = {
+this.epsilon = epsilon
+this
+  }
+
+  def getEpsilon(): Double = this.epsilon
+
+  def setRandomSeed(seed: Int): this.type = {
+this.randomSeed = seed
+this
+  }
+
+  def getRandomSeed(): Int = this.randomSeed
+
+  def setRandomRange(range: Double): this.type = {
+this.randomRange = range
+this
+  }
+}
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bi-sect 
k-means algorithm.
+ *
+ * The main idea of this algorithm is derived from:
+ * "A comparison of document clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * @param numClusters the number of clusters you want
+ * @param subIterations the number of iterations at digging
+ * @param epsilon the threshold to stop the sub-iterations
+ * @param randomSeed uses in sampling data for initializing centers in 
each sub iterations
+ * @param randomRange the range coefficient to generate random points in 
each clustering step
+ */
+class HierarchicalClustering(
+  private[mllib] var numClusters: Int,
+  private[mllib] var subIterations: Int,
+  private[mllib] var numRetries: Int,
+  private[mllib] var epsilon: Double,
+  private[mllib] var randomSeed: Int,
+  private[mllib] var randomRange: Double)
+extends Serializable with Logging with HierarchicalClusteringConf {
+
+  /**
+   * Constructs with the default configuration
+   */
+  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
+
+  /** Shows the parameters */
+  override def toString(): String = {
+Array(
+  s"numClusters:${numClusters}",
+  s"subIterations:${subIterations}",
+  s"numRetries:${numRetries}",
+  s"epsilon:${epsilon}",
+  s"randomSeed:${randomSeed}",
+  s"randomRange:${randomRange}"
+).mkString(", ")
+  }
+
+  /**
+   * Trains a hierarchical clustering model with the given configuration
+   *
+   * @param data training points
+   * @return a model for hierarchical clustering
+   */
+  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
+validateData(data)
+logInfo(s"Run with ${this}")
+
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+val clusterTree = ClusterTree.fromRDD(data) // make the root node
+val model

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22633425
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,627 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => 
BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * This trait is used for the configuration of the hierarchical clustering
+ */
+sealed
+trait HierarchicalClusteringConf extends Serializable {
+  this: HierarchicalClustering =>
+
+  def setNumClusters(numClusters: Int): this.type = {
+this.numClusters = numClusters
+this
+  }
+
+  def getNumClusters(): Int = this.numClusters
+
+  def setNumRetries(numRetries: Int): this.type = {
+this.numRetries = numRetries
+this
+  }
+
+  def getNumRetries(): Int = this.numRetries
+
+  def setSubIterations(subIterations: Int): this.type = {
+this.subIterations = subIterations
+this
+  }
+
+  def getSubIterations(): Int = this.subIterations
+
+  def setEpsilon(epsilon: Double): this.type = {
+this.epsilon = epsilon
+this
+  }
+
+  def getEpsilon(): Double = this.epsilon
+
+  def setRandomSeed(seed: Int): this.type = {
+this.randomSeed = seed
+this
+  }
+
+  def getRandomSeed(): Int = this.randomSeed
+
+  def setRandomRange(range: Double): this.type = {
+this.randomRange = range
+this
+  }
+}
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bi-sect 
k-means algorithm.
+ *
+ * The main idea of this algorithm is derived from:
+ * "A comparison of document clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * @param numClusters the number of clusters you want
+ * @param subIterations the number of iterations at digging
+ * @param epsilon the threshold to stop the sub-iterations
+ * @param randomSeed uses in sampling data for initializing centers in 
each sub iterations
+ * @param randomRange the range coefficient to generate random points in 
each clustering step
+ */
+class HierarchicalClustering(
+  private[mllib] var numClusters: Int,
+  private[mllib] var subIterations: Int,
+  private[mllib] var numRetries: Int,
+  private[mllib] var epsilon: Double,
+  private[mllib] var randomSeed: Int,
+  private[mllib] var randomRange: Double)
+extends Serializable with Logging with HierarchicalClusteringConf {
+
+  /**
+   * Constructs with the default configuration
+   */
+  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
+
+  /** Shows the parameters */
+  override def toString(): String = {
+Array(
+  s"numClusters:${numClusters}",
+  s"subIterations:${subIterations}",
+  s"numRetries:${numRetries}",
+  s"epsilon:${epsilon}",
+  s"randomSeed:${randomSeed}",
+  s"randomRange:${randomRange}"
+).mkString(", ")
+  }
+
+  /**
+   * Trains a hierarchical clustering model with the given configuration
+   *
+   * @param data training points
+   * @return a model for hierarchical clustering
+   */
+  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
+validateData(data)
+logInfo(s"Run with ${this}")
+
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+val clusterTree = ClusterTree.fromRDD(data) // make the root node
+val model

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22632919
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,627 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => 
BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * This trait is used for the configuration of the hierarchical clustering
+ */
+sealed
+trait HierarchicalClusteringConf extends Serializable {
+  this: HierarchicalClustering =>
+
+  def setNumClusters(numClusters: Int): this.type = {
+this.numClusters = numClusters
+this
+  }
+
+  def getNumClusters(): Int = this.numClusters
+
+  def setNumRetries(numRetries: Int): this.type = {
+this.numRetries = numRetries
+this
+  }
+
+  def getNumRetries(): Int = this.numRetries
+
+  def setSubIterations(subIterations: Int): this.type = {
+this.subIterations = subIterations
+this
+  }
+
+  def getSubIterations(): Int = this.subIterations
+
+  def setEpsilon(epsilon: Double): this.type = {
+this.epsilon = epsilon
+this
+  }
+
+  def getEpsilon(): Double = this.epsilon
+
+  def setRandomSeed(seed: Int): this.type = {
+this.randomSeed = seed
+this
+  }
+
+  def getRandomSeed(): Int = this.randomSeed
+
+  def setRandomRange(range: Double): this.type = {
+this.randomRange = range
+this
+  }
+}
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bi-sect 
k-means algorithm.
+ *
+ * The main idea of this algorithm is derived from:
+ * "A comparison of document clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * @param numClusters the number of clusters you want
+ * @param subIterations the number of iterations at digging
+ * @param epsilon the threshold to stop the sub-iterations
+ * @param randomSeed uses in sampling data for initializing centers in 
each sub iterations
+ * @param randomRange the range coefficient to generate random points in 
each clustering step
+ */
+class HierarchicalClustering(
+  private[mllib] var numClusters: Int,
+  private[mllib] var subIterations: Int,
+  private[mllib] var numRetries: Int,
+  private[mllib] var epsilon: Double,
+  private[mllib] var randomSeed: Int,
+  private[mllib] var randomRange: Double)
+extends Serializable with Logging with HierarchicalClusteringConf {
+
+  /**
+   * Constructs with the default configuration
+   */
+  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
+
+  /** Shows the parameters */
+  override def toString(): String = {
+Array(
+  s"numClusters:${numClusters}",
+  s"subIterations:${subIterations}",
+  s"numRetries:${numRetries}",
+  s"epsilon:${epsilon}",
+  s"randomSeed:${randomSeed}",
+  s"randomRange:${randomRange}"
+).mkString(", ")
+  }
+
+  /**
+   * Trains a hierarchical clustering model with the given configuration
+   *
+   * @param data training points
+   * @return a model for hierarchical clustering
+   */
+  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
+validateData(data)
+logInfo(s"Run with ${this}")
+
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+val clusterTree = ClusterTree.fromRDD(data) // make the root node
+val model

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22632804
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,627 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => 
BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * This trait is used for the configuration of the hierarchical clustering
+ */
+sealed
+trait HierarchicalClusteringConf extends Serializable {
+  this: HierarchicalClustering =>
+
+  def setNumClusters(numClusters: Int): this.type = {
+this.numClusters = numClusters
+this
+  }
+
+  def getNumClusters(): Int = this.numClusters
+
+  def setNumRetries(numRetries: Int): this.type = {
+this.numRetries = numRetries
+this
+  }
+
+  def getNumRetries(): Int = this.numRetries
+
+  def setSubIterations(subIterations: Int): this.type = {
+this.subIterations = subIterations
+this
+  }
+
+  def getSubIterations(): Int = this.subIterations
+
+  def setEpsilon(epsilon: Double): this.type = {
+this.epsilon = epsilon
+this
+  }
+
+  def getEpsilon(): Double = this.epsilon
+
+  def setRandomSeed(seed: Int): this.type = {
+this.randomSeed = seed
+this
+  }
+
+  def getRandomSeed(): Int = this.randomSeed
+
+  def setRandomRange(range: Double): this.type = {
+this.randomRange = range
+this
+  }
+}
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bi-sect 
k-means algorithm.
+ *
+ * The main idea of this algorithm is derived from:
+ * "A comparison of document clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * @param numClusters the number of clusters you want
+ * @param subIterations the number of iterations at digging
+ * @param epsilon the threshold to stop the sub-iterations
+ * @param randomSeed uses in sampling data for initializing centers in 
each sub iterations
+ * @param randomRange the range coefficient to generate random points in 
each clustering step
+ */
+class HierarchicalClustering(
+  private[mllib] var numClusters: Int,
+  private[mllib] var subIterations: Int,
+  private[mllib] var numRetries: Int,
+  private[mllib] var epsilon: Double,
+  private[mllib] var randomSeed: Int,
+  private[mllib] var randomRange: Double)
+extends Serializable with Logging with HierarchicalClusteringConf {
+
+  /**
+   * Constructs with the default configuration
+   */
+  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
+
+  /** Shows the parameters */
+  override def toString(): String = {
+Array(
+  s"numClusters:${numClusters}",
+  s"subIterations:${subIterations}",
+  s"numRetries:${numRetries}",
+  s"epsilon:${epsilon}",
+  s"randomSeed:${randomSeed}",
+  s"randomRange:${randomRange}"
+).mkString(", ")
+  }
+
+  /**
+   * Trains a hierarchical clustering model with the given configuration
+   *
+   * @param data training points
+   * @return a model for hierarchical clustering
+   */
+  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
+validateData(data)
+logInfo(s"Run with ${this}")
+
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+val clusterTree = ClusterTree.fromRDD(data) // make the root node
+val model

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22632678
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,627 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => 
BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * This trait is used for the configuration of the hierarchical clustering
+ */
+sealed
+trait HierarchicalClusteringConf extends Serializable {
+  this: HierarchicalClustering =>
+
+  def setNumClusters(numClusters: Int): this.type = {
+this.numClusters = numClusters
+this
+  }
+
+  def getNumClusters(): Int = this.numClusters
+
+  def setNumRetries(numRetries: Int): this.type = {
+this.numRetries = numRetries
+this
+  }
+
+  def getNumRetries(): Int = this.numRetries
+
+  def setSubIterations(subIterations: Int): this.type = {
+this.subIterations = subIterations
+this
+  }
+
+  def getSubIterations(): Int = this.subIterations
+
+  def setEpsilon(epsilon: Double): this.type = {
+this.epsilon = epsilon
+this
+  }
+
+  def getEpsilon(): Double = this.epsilon
+
+  def setRandomSeed(seed: Int): this.type = {
+this.randomSeed = seed
+this
+  }
+
+  def getRandomSeed(): Int = this.randomSeed
+
+  def setRandomRange(range: Double): this.type = {
+this.randomRange = range
+this
+  }
+}
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bi-sect 
k-means algorithm.
+ *
+ * The main idea of this algorithm is derived from:
+ * "A comparison of document clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * @param numClusters the number of clusters you want
+ * @param subIterations the number of iterations at digging
+ * @param epsilon the threshold to stop the sub-iterations
+ * @param randomSeed uses in sampling data for initializing centers in 
each sub iterations
+ * @param randomRange the range coefficient to generate random points in 
each clustering step
+ */
+class HierarchicalClustering(
+  private[mllib] var numClusters: Int,
+  private[mllib] var subIterations: Int,
+  private[mllib] var numRetries: Int,
+  private[mllib] var epsilon: Double,
+  private[mllib] var randomSeed: Int,
+  private[mllib] var randomRange: Double)
+extends Serializable with Logging with HierarchicalClusteringConf {
+
+  /**
+   * Constructs with the default configuration
+   */
+  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
+
+  /** Shows the parameters */
+  override def toString(): String = {
+Array(
+  s"numClusters:${numClusters}",
+  s"subIterations:${subIterations}",
+  s"numRetries:${numRetries}",
+  s"epsilon:${epsilon}",
+  s"randomSeed:${randomSeed}",
+  s"randomRange:${randomRange}"
+).mkString(", ")
+  }
+
+  /**
+   * Trains a hierarchical clustering model with the given configuration
+   *
+   * @param data training points
+   * @return a model for hierarchical clustering
+   */
+  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
+validateData(data)
+logInfo(s"Run with ${this}")
+
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+val clusterTree = ClusterTree.fromRDD(data) // make the root node
+val model

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22632686
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModel.scala
 ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+/**
+ * this class is used for the model of the hierarchical clustering
+ *
+ * @param clusterTree a cluster as a tree node
+ * @param isTrained if the model has been trained, the flag is true
+ */
+class HierarchicalClusteringModel private (
+  val clusterTree: ClusterTree,
+  private[mllib] var isTrained: Boolean) extends Serializable with Logging 
with Cloneable {
+
+  def this(clusterTree: ClusterTree) = this(clusterTree, false)
+
+  override def clone(): HierarchicalClusteringModel = {
+new HierarchicalClusteringModel(this.clusterTree.clone(), true)
+  }
+
+  /**
+   * Cuts a cluster tree by given threshold of dendrogram height
+   *
+   * @param height a threshold to cut a cluster tree
+   * @return a hierarchical clustering model
+   */
+  def cut(height: Double): HierarchicalClusteringModel = {
+val cloned = this.clone()
+cloned.clusterTree.cut(height)
+cloned
+  }
+
+  /**
+   * Predicts the closest cluster of each point
+   */
+  def predict(vector: Vector): Int = {
+// TODO Supports distance metrics other Euclidean distance metric
+val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - 
bv2, 2.0)
+this.clusterTree.assignClusterIndex(metric)(vector)
+  }
+
+  /**
+   * Predicts the closest cluster of each point
+   */
+  def predict(data: RDD[Vector]): RDD[(Int, Vector)] = {
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+val sc = data.sparkContext
+
+// TODO Supports distance metrics other Euclidean distance metric
+val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - 
bv2, 2.0)
+val treeRoot = this.clusterTree
+sc.broadcast(metric)
--- End diff --

Not output, see my other note about `sc.broadcast`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22632654
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,627 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => 
BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * This trait is used for the configuration of the hierarchical clustering
+ */
+sealed
+trait HierarchicalClusteringConf extends Serializable {
+  this: HierarchicalClustering =>
+
+  def setNumClusters(numClusters: Int): this.type = {
+this.numClusters = numClusters
+this
+  }
+
+  def getNumClusters(): Int = this.numClusters
+
+  def setNumRetries(numRetries: Int): this.type = {
+this.numRetries = numRetries
+this
+  }
+
+  def getNumRetries(): Int = this.numRetries
+
+  def setSubIterations(subIterations: Int): this.type = {
+this.subIterations = subIterations
+this
+  }
+
+  def getSubIterations(): Int = this.subIterations
+
+  def setEpsilon(epsilon: Double): this.type = {
+this.epsilon = epsilon
+this
+  }
+
+  def getEpsilon(): Double = this.epsilon
+
+  def setRandomSeed(seed: Int): this.type = {
+this.randomSeed = seed
+this
+  }
+
+  def getRandomSeed(): Int = this.randomSeed
+
+  def setRandomRange(range: Double): this.type = {
+this.randomRange = range
+this
+  }
+}
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bi-sect 
k-means algorithm.
+ *
+ * The main idea of this algorithm is derived from:
+ * "A comparison of document clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * @param numClusters the number of clusters you want
+ * @param subIterations the number of iterations at digging
+ * @param epsilon the threshold to stop the sub-iterations
+ * @param randomSeed uses in sampling data for initializing centers in 
each sub iterations
+ * @param randomRange the range coefficient to generate random points in 
each clustering step
+ */
+class HierarchicalClustering(
+  private[mllib] var numClusters: Int,
+  private[mllib] var subIterations: Int,
+  private[mllib] var numRetries: Int,
+  private[mllib] var epsilon: Double,
+  private[mllib] var randomSeed: Int,
+  private[mllib] var randomRange: Double)
+extends Serializable with Logging with HierarchicalClusteringConf {
--- End diff --

Indent by 2 spaces.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22632647
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,627 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => 
BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * This trait is used for the configuration of the hierarchical clustering
+ */
+sealed
+trait HierarchicalClusteringConf extends Serializable {
+  this: HierarchicalClustering =>
+
+  def setNumClusters(numClusters: Int): this.type = {
+this.numClusters = numClusters
+this
+  }
+
+  def getNumClusters(): Int = this.numClusters
+
+  def setNumRetries(numRetries: Int): this.type = {
+this.numRetries = numRetries
+this
+  }
+
+  def getNumRetries(): Int = this.numRetries
+
+  def setSubIterations(subIterations: Int): this.type = {
+this.subIterations = subIterations
+this
+  }
+
+  def getSubIterations(): Int = this.subIterations
+
+  def setEpsilon(epsilon: Double): this.type = {
+this.epsilon = epsilon
+this
+  }
+
+  def getEpsilon(): Double = this.epsilon
+
+  def setRandomSeed(seed: Int): this.type = {
+this.randomSeed = seed
+this
+  }
+
+  def getRandomSeed(): Int = this.randomSeed
+
+  def setRandomRange(range: Double): this.type = {
+this.randomRange = range
+this
+  }
+}
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bi-sect 
k-means algorithm.
+ *
+ * The main idea of this algorithm is derived from:
+ * "A comparison of document clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * @param numClusters the number of clusters you want
+ * @param subIterations the number of iterations at digging
+ * @param epsilon the threshold to stop the sub-iterations
+ * @param randomSeed uses in sampling data for initializing centers in 
each sub iterations
+ * @param randomRange the range coefficient to generate random points in 
each clustering step
+ */
+class HierarchicalClustering(
+  private[mllib] var numClusters: Int,
--- End diff --

Indent these variable definition lines by 4 spaces.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22632512
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,627 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => 
BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * This trait is used for the configuration of the hierarchical clustering
+ */
+sealed
+trait HierarchicalClusteringConf extends Serializable {
+  this: HierarchicalClustering =>
+
+  def setNumClusters(numClusters: Int): this.type = {
+this.numClusters = numClusters
+this
+  }
+
+  def getNumClusters(): Int = this.numClusters
+
+  def setNumRetries(numRetries: Int): this.type = {
+this.numRetries = numRetries
+this
+  }
+
+  def getNumRetries(): Int = this.numRetries
+
+  def setSubIterations(subIterations: Int): this.type = {
+this.subIterations = subIterations
+this
+  }
+
+  def getSubIterations(): Int = this.subIterations
+
+  def setEpsilon(epsilon: Double): this.type = {
+this.epsilon = epsilon
+this
+  }
+
+  def getEpsilon(): Double = this.epsilon
+
+  def setRandomSeed(seed: Int): this.type = {
+this.randomSeed = seed
+this
+  }
+
+  def getRandomSeed(): Int = this.randomSeed
+
+  def setRandomRange(range: Double): this.type = {
+this.randomRange = range
+this
+  }
+}
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bi-sect 
k-means algorithm.
+ *
+ * The main idea of this algorithm is derived from:
+ * "A comparison of document clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * @param numClusters the number of clusters you want
+ * @param subIterations the number of iterations at digging
+ * @param epsilon the threshold to stop the sub-iterations
+ * @param randomSeed uses in sampling data for initializing centers in 
each sub iterations
+ * @param randomRange the range coefficient to generate random points in 
each clustering step
+ */
+class HierarchicalClustering(
+  private[mllib] var numClusters: Int,
+  private[mllib] var subIterations: Int,
+  private[mllib] var numRetries: Int,
+  private[mllib] var epsilon: Double,
+  private[mllib] var randomSeed: Int,
+  private[mllib] var randomRange: Double)
+extends Serializable with Logging with HierarchicalClusteringConf {
+
+  /**
+   * Constructs with the default configuration
+   */
+  def this() = this(20, 20, 10, 10E-4, 1, 0.1)
+
+  /** Shows the parameters */
+  override def toString(): String = {
+Array(
+  s"numClusters:${numClusters}",
+  s"subIterations:${subIterations}",
+  s"numRetries:${numRetries}",
+  s"epsilon:${epsilon}",
+  s"randomSeed:${randomSeed}",
+  s"randomRange:${randomRange}"
+).mkString(", ")
+  }
+
+  /**
+   * Trains a hierarchical clustering model with the given configuration
+   *
+   * @param data training points
+   * @return a model for hierarchical clustering
+   */
+  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
+validateData(data)
+logInfo(s"Run with ${this}")
+
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+val clusterTree = ClusterTree.fromRDD(data) // make the root node
+val model

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22632265
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -88,6 +92,162 @@ def train(cls, rdd, k, maxIterations=100, runs=1, 
initializationMode="k-means||"
 return KMeansModel([c.toArray() for c in centers])
 
 
+class HierarchicalClusteringModel(object):
+
+"""A clustering model derived from the hierarchical clustering method.
+
+>>> from numpy import array
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4,2)
+>>> train_rdd = sc.parallelize(data)
+>>> model = HierarchicalClustering.train(train_rdd, 2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.predict(array([8.0, 9.0])) == model.predict(array([9.0, 
8.0]))
+True
+>>> x = model.predict(data[0])
+>>> type(x)
+
+>>> predicted_rdd = model.predict(train_rdd)
+>>> type(predicted_rdd)
+
+>>> predicted_rdd.collect() == [0, 0, 1, 1]
+True
+>>> sparse_data = [
+... SparseVector(3, {1: 1.0}),
+... SparseVector(3, {1: 1.1}),
+... SparseVector(3, {2: 1.0}),
+... SparseVector(3, {2: 1.1})
+... ]
+>>> train_rdd = sc.parallelize(sparse_data)
+>>> model = HierarchicalClustering.train(train_rdd, 2, numRetries=100)
+>>> model.predict(array([0., 1., 0.])) == model.predict(array([0, 1.1, 
0.]))
+True
+>>> model.predict(array([0., 0., 1.])) == model.predict(array([0, 0, 
1.1]))
+True
+>>> model.predict(sparse_data[0]) == model.predict(sparse_data[1])
+True
+>>> model.predict(sparse_data[2]) == model.predict(sparse_data[3])
+True
+>>> x = model.predict(array([0., 1., 0.]))
+>>> type(x)
+
+>>> predicted_rdd = model.predict(train_rdd)
+>>> type(predicted_rdd)
+
+>>> (predicted_rdd.collect() == [0, 0, 1, 1]
+... or predicted_rdd.collect() == [1, 1, 0, 0] )
+True
+>>> type(model.clusterCenters)
+
+"""
+
+def __init__(self, sc, java_model, centers):
+"""
+:param sc:  Spark context
+:param java_model:  Handle to Java model object
+:param centers: the cluster centers
+"""
+self._sc = sc
+self._java_model = java_model
+self.centers = centers
+
+def __del__(self):
+self._sc._gateway.detach(self._java_model)
+
+@property
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return self.centers
+
+def predict(self, x):
+"""Predict the closest cluster index
+
+:param x: a ndarray of list, a SparseVector or RDD[SparseVector]
+:return: the closest index or a RDD of int which means the closest 
index
+"""
+if isinstance(x, ndarray) or isinstance(x, Vector):
+return self.__predict_by_array(x)
+elif isinstance(x, RDD):
+return self.__predict_by_rdd(x)
+else:
+print 'Invalid input data type x:' + type(x)
+
+def __predict_by_array(self, x):
+"""Predict the closest cluster index with an ndarray or an 
SparseVector
+
+:param x: a vector
+:return: the closest cluster index
+"""
+ser = PickleSerializer()
+bytes = bytearray(ser.dumps(_convert_to_vector(x)))
+vec = self._sc._jvm.SerDe.loads(bytes)
+result = self._java_model.predict(vec)
+return 
PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(result)))
+
+def __predict_by_rdd(self, x):
+"""Predict the closest cluster index with a RDD
+:param x: a RDD of vector
+:return: a RDD of int
+"""
+ser = PickleSerializer()
+cached = 
x.map(_convert_to_vector)._reserialize(AutoBatchedSerializer(ser)).cache()
+rdd = _to_java_object_rdd(cached)
+jrdd = self._java_model.predict(rdd)
+jpyrdd = self._sc._jvm.SerDe.javaToPython(jrdd)
+return RDD(jpyrdd, self._sc, 
AutoBatchedSerializer(PickleSerializer()))
+
+def cut(self, height):
--- End diff --

This currently breaks if an integer is passed as `height` (which is likely 
to be common). For example, after creating the model from the example, I got an 
error when calling `model.cut(4)` but not `model.cut(4.0)`. Probably just 
recast the input here as a float.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well.

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22632220
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/mllib/JavaHierarchicalClustering.java
 ---
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib;
+
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.mllib.clustering.HierarchicalClustering;
+import org.apache.spark.mllib.clustering.HierarchicalClusteringModel;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.Vectors;
+
+public class JavaHierarchicalClustering {
--- End diff --

Would it be possible to also add a similar example in scala? At least for 
MLlib, there are examples for almost all algorithms in scala, and then a subset 
of examples Java.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22632194
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringSuite.scala
 ---
@@ -0,0 +1,330 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
--- End diff --

Import formatting, see other comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22632182
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModel.scala
 ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
--- End diff --

Import formatting, see other comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22632172
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,627 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => 
BV, norm => breezeNorm}
--- End diff --

There should be a line separating third-party imports (e.g. breeze) from 
spark imports. And within each group, imports should be ordered alphabetically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22632146
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/mllib/JavaHierarchicalClustering.java
 ---
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib;
+
+import org.apache.spark.SparkConf;
--- End diff --

Imports should be ordered alphabetically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-07 Thread freeman-lab
Github user freeman-lab commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r22632101
  
--- Diff: data/mllib/sample_hierarchical_data.csv ---
@@ -0,0 +1,150 @@
+5.1,3.5,1.4,0.2
--- End diff --

It might be nice if this could be parsed directly by `Vectors.parse`, it 
would just require adding a `[` and `]` at the start and end of each line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-06 Thread rnowling
Github user rnowling commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-68870971
  
Thanks @mengxr @freeman-lab! :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-05 Thread freeman-lab
Github user freeman-lab commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-68794407
  
Hey all, thanks for the nudge =) I've been going through it, will get you 
feedback ASAP.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-05 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-68787775
  
@yu-iskw @rnowling, I asked @freeman-lab to make one pass on this PR. Let's 
ping him :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2015-01-05 Thread rnowling
Github user rnowling commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-68746596
  
@mengxr This PR has been lingering for a while.  What can we do to get it a 
little more attention?  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-11-09 Thread yu-iskw
Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-6214
  
@srowen and @rnowling , 
Sorry for my complicated commits.  I modified my source code. Could you 
review my PR?

- I modified what you pointed out.
- I added a function to cut a cluster tree of a trained hierarchical 
clustering model by a height of dendrogram.
- I rebased my PR with the latest master branch and then force-push my 
branch. Because there are a few conflicts with it.

Thanks,


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-11-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-62332985
  
  [Test build #23125 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23125/consoleFull)
 for   PR 2906 at commit 
[`b0b061e`](https://github.com/apache/spark/commit/b0b061edc4c2ad42deda00bb664534e1334b50e5).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class JavaHierarchicalClustering `
  * `trait HierarchicalClusteringConf extends Serializable `
  * `class HierarchicalClustering(`
  * `class HierarchicalClusteringModel(object):`
  * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-62332987
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23125/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-11-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-62328415
  
  [Test build #23125 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23125/consoleFull)
 for   PR 2906 at commit 
[`b0b061e`](https://github.com/apache/spark/commit/b0b061edc4c2ad42deda00bb664534e1334b50e5).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-62325997
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23124/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-11-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-62325994
  
  [Test build #23124 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23124/consoleFull)
 for   PR 2906 at commit 
[`cfdf842`](https://github.com/apache/spark/commit/cfdf8429bf4afb3e7a6a329dd285fe48429aec46).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class JavaHierarchicalClustering `
  * `trait HierarchicalClusteringConf extends Serializable `
  * `class HierarchicalClustering(`
  * `class HierarchicalClusteringModel(object):`
  * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-11-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-62323346
  
  [Test build #23124 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23124/consoleFull)
 for   PR 2906 at commit 
[`cfdf842`](https://github.com/apache/spark/commit/cfdf8429bf4afb3e7a6a329dd285fe48429aec46).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-62310162
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23121/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-11-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-62310159
  
  [Test build #23121 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23121/consoleFull)
 for   PR 2906 at commit 
[`691c49a`](https://github.com/apache/spark/commit/691c49adf9751193f3b8928211e77d307ef44c37).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class JavaHierarchicalClustering `
  * `trait HierarchicalClusteringConf extends Serializable `
  * `class HierarchicalClustering(`
  * `class HierarchicalClusteringModel(object):`
  * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-11-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-62307500
  
  [Test build #23121 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23121/consoleFull)
 for   PR 2906 at commit 
[`691c49a`](https://github.com/apache/spark/commit/691c49adf9751193f3b8928211e77d307ef44c37).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-11-09 Thread yu-iskw
Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-62302445
  
There is a few conflicts with master brach. I will rebase my PR branch, and 
then force push it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-11-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-62147990
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23052/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-11-07 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-62147985
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23052/consoleFull)
 for   PR 2906 at commit 
[`8355f95`](https://github.com/apache/spark/commit/8355f959f02ca67454c9cb070912480db0a44671).
 * This patch **passes all tests**.
 * This patch **does not merge cleanly**.
 * This patch adds the following public classes _(experimental)_:
  * `public class JavaHierarchicalClustering `
  * `trait HierarchicalClusteringConf extends Serializable `
  * `class HierarchicalClustering(`
  * `class HierarchicalClusteringModel(object):`
  * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-11-07 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-62135443
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23052/consoleFull)
 for   PR 2906 at commit 
[`8355f95`](https://github.com/apache/spark/commit/8355f959f02ca67454c9cb070912480db0a44671).
 * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60931967
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22451/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60931955
  
  [Test build #22451 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22451/consoleFull)
 for   PR 2906 at commit 
[`825fbfb`](https://github.com/apache/spark/commit/825fbfbe62de7787d7b343f84036a4933b53e0ff).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class JavaHierarchicalClustering `
  * `trait HierarchicalClusteringConf extends Serializable `
  * `class HierarchicalClustering(`
  * `class HierarchicalClusteringModel(object):`
  * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60923278
  
  [Test build #22451 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22451/consoleFull)
 for   PR 2906 at commit 
[`825fbfb`](https://github.com/apache/spark/commit/825fbfbe62de7787d7b343f84036a4933b53e0ff).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-29 Thread yu-iskw
Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60922265
  
@srowen I finished modifying the source code which you had pointed out. Can 
you review it ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60922004
  
  [Test build #22450 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22450/consoleFull)
 for   PR 2906 at commit 
[`e772fdf`](https://github.com/apache/spark/commit/e772fdf0318b87ae4c2c4cc728d82752036a67db).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class JavaHierarchicalClustering `
  * `trait HierarchicalClusteringConf extends Serializable `
  * `class HierarchicalClustering(`
  * `class HierarchicalClusteringModel(object):`
  * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60922006
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22450/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60921831
  
  [Test build #22450 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22450/consoleFull)
 for   PR 2906 at commit 
[`e772fdf`](https://github.com/apache/spark/commit/e772fdf0318b87ae4c2c4cc728d82752036a67db).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-29 Thread yu-iskw
Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60921575
  
@mengxr I added the performance test for vector's sparsity at "Experiment 
5: The Effects of Vector Sparsity". You can download a new result. Please check 
it.


https://issues.apache.org/jira/secure/attachment/12677880/benchmark-result.2014-10-29.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-29 Thread yu-iskw
Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r19535947
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -91,6 +99,58 @@ def train(cls, rdd, k, maxIterations=100, runs=1, 
initializationMode="k-means||"
 return KMeansModel([c.toArray() for c in centers])
 
 
+class HierarchicalClusteringModel(ClusteringModel):
--- End diff --

I changed the way to call `predict` at the python code, using Java API.


https://github.com/yu-iskw/spark/commit/8aa6a00f0dd53a5913be668a64332fc050314040


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60575130
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22291/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60575123
  
  [Test build #22291 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22291/consoleFull)
 for   PR 2906 at commit 
[`8be11da`](https://github.com/apache/spark/commit/8be11da1f045e9ffc8c56886eea7c133aefe3eaf).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class JavaHierarchicalClustering `
  * `trait HierarchicalClusteringConf extends Serializable `
  * `class HierarchicalClustering(`
  * `class ClusteringModel(object):`
  * `class KMeansModel(ClusteringModel):`
  * `class HierarchicalClusteringModel(ClusteringModel):`
  * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-27 Thread yu-iskw
Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r19396916
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,549 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * the configuration for a hierarchical clustering algorithm
+ *
+ * @param numClusters the number of clusters you want
+ * @param subIterations the number of iterations at digging
+ * @param epsilon the threshold to stop the sub-iterations
+ * @param randomSeed uses in sampling data for initializing centers in 
each sub iterations
+ * @param randomRange the range coefficient to generate random points in 
each clustering step
+ */
+class HierarchicalClusteringConf(
+  private var numClusters: Int,
+  private var subIterations: Int,
+  private var numRetries: Int,
+  private var epsilon: Double,
+  private var randomSeed: Int,
+  private[mllib] var randomRange: Double) extends Serializable {
+
+  def this() = this(20, 5, 20, 10E-6, 1, 0.1)
+
+  def setNumClusters(numClusters: Int): this.type = {
+this.numClusters = numClusters
+this
+  }
+
+  def getNumClusters(): Int = this.numClusters
+
+  def setSubIterations(iterations: Int): this.type = {
+this.subIterations = iterations
+this
+  }
+
+  def setNumRetries(numRetries: Int): this.type = {
+this.numRetries = numRetries
+this
+  }
+
+  def getNumRetries(): Int = this.numRetries
+
+  def getSubIterations(): Int = this.subIterations
+
+  def setEpsilon(epsilon: Double): this.type = {
+this.epsilon = epsilon
+this
+  }
+
+  def getEpsilon(): Double = this.epsilon
+
+  def setRandomSeed(seed: Int): this.type = {
+this.randomSeed = seed
+this
+  }
+
+  def getRandomSeed(): Int = this.randomSeed
+
+  def setRandomRange(range: Double): this.type = {
+this.randomRange = range
+this
+  }
+}
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bi-sect 
k-means algorithm.
+ *
+ * @param conf the configuration class for the hierarchical clustering
+ */
+class HierarchicalClustering(val conf: HierarchicalClusteringConf)
+extends Serializable with Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  def this() = this(new HierarchicalClusteringConf())
+
+  /**
+   * Trains a hierarchical clustering model with the given configuration
+   *
+   * @param data training points
+   * @return a model for hierarchical clustering
+   */
+  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
+validateData(data)
+logInfo(s"Run with ${conf.toString}")
+
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+val clusterTree = ClusterTree.fromRDD(data) // make the root node
+val model = new HierarchicalClusteringModel(clusterTree)
+val statsUpdater = new ClusterTreeStatsUpdater()
+
+var node: Option[ClusterTree] = Some(model.clusterTree)
+statsUpdater(node.get)
+
+// If the followed conditions are satisfied, and then stop the 
training.
+//   1. There is no splittable cluster
+//   2. The number of the splitted clusters is greater than that of 
given clusters
+//   3. The total variance of all clusters increases, when a cluster 
is splitted
+var totalVariance = Double.MaxValue
+var newTotalVariance = model.clusterTree.getVa

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-27 Thread yu-iskw
Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r19396869
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,549 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * the configuration for a hierarchical clustering algorithm
+ *
+ * @param numClusters the number of clusters you want
+ * @param subIterations the number of iterations at digging
+ * @param epsilon the threshold to stop the sub-iterations
+ * @param randomSeed uses in sampling data for initializing centers in 
each sub iterations
+ * @param randomRange the range coefficient to generate random points in 
each clustering step
+ */
+class HierarchicalClusteringConf(
+  private var numClusters: Int,
+  private var subIterations: Int,
+  private var numRetries: Int,
+  private var epsilon: Double,
+  private var randomSeed: Int,
+  private[mllib] var randomRange: Double) extends Serializable {
+
+  def this() = this(20, 5, 20, 10E-6, 1, 0.1)
+
+  def setNumClusters(numClusters: Int): this.type = {
+this.numClusters = numClusters
+this
+  }
+
+  def getNumClusters(): Int = this.numClusters
+
+  def setSubIterations(iterations: Int): this.type = {
+this.subIterations = iterations
+this
+  }
+
+  def setNumRetries(numRetries: Int): this.type = {
+this.numRetries = numRetries
+this
+  }
+
+  def getNumRetries(): Int = this.numRetries
+
+  def getSubIterations(): Int = this.subIterations
+
+  def setEpsilon(epsilon: Double): this.type = {
+this.epsilon = epsilon
+this
+  }
+
+  def getEpsilon(): Double = this.epsilon
+
+  def setRandomSeed(seed: Int): this.type = {
+this.randomSeed = seed
+this
+  }
+
+  def getRandomSeed(): Int = this.randomSeed
+
+  def setRandomRange(range: Double): this.type = {
+this.randomRange = range
+this
+  }
+}
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bi-sect 
k-means algorithm.
+ *
+ * @param conf the configuration class for the hierarchical clustering
+ */
+class HierarchicalClustering(val conf: HierarchicalClusteringConf)
+extends Serializable with Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  def this() = this(new HierarchicalClusteringConf())
+
+  /**
+   * Trains a hierarchical clustering model with the given configuration
+   *
+   * @param data training points
+   * @return a model for hierarchical clustering
+   */
+  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
+validateData(data)
+logInfo(s"Run with ${conf.toString}")
+
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+val clusterTree = ClusterTree.fromRDD(data) // make the root node
+val model = new HierarchicalClusteringModel(clusterTree)
+val statsUpdater = new ClusterTreeStatsUpdater()
+
+var node: Option[ClusterTree] = Some(model.clusterTree)
+statsUpdater(node.get)
+
+// If the followed conditions are satisfied, and then stop the 
training.
+//   1. There is no splittable cluster
+//   2. The number of the splitted clusters is greater than that of 
given clusters
+//   3. The total variance of all clusters increases, when a cluster 
is splitted
+var totalVariance = Double.MaxValue
+var newTotalVariance = model.clusterTree.getVa

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-27 Thread yu-iskw
Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r19396833
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,549 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * the configuration for a hierarchical clustering algorithm
+ *
+ * @param numClusters the number of clusters you want
+ * @param subIterations the number of iterations at digging
+ * @param epsilon the threshold to stop the sub-iterations
+ * @param randomSeed uses in sampling data for initializing centers in 
each sub iterations
+ * @param randomRange the range coefficient to generate random points in 
each clustering step
+ */
+class HierarchicalClusteringConf(
+  private var numClusters: Int,
+  private var subIterations: Int,
+  private var numRetries: Int,
+  private var epsilon: Double,
+  private var randomSeed: Int,
+  private[mllib] var randomRange: Double) extends Serializable {
+
+  def this() = this(20, 5, 20, 10E-6, 1, 0.1)
+
+  def setNumClusters(numClusters: Int): this.type = {
+this.numClusters = numClusters
+this
+  }
+
+  def getNumClusters(): Int = this.numClusters
+
+  def setSubIterations(iterations: Int): this.type = {
+this.subIterations = iterations
+this
+  }
+
+  def setNumRetries(numRetries: Int): this.type = {
+this.numRetries = numRetries
+this
+  }
+
+  def getNumRetries(): Int = this.numRetries
+
+  def getSubIterations(): Int = this.subIterations
+
+  def setEpsilon(epsilon: Double): this.type = {
+this.epsilon = epsilon
+this
+  }
+
+  def getEpsilon(): Double = this.epsilon
+
+  def setRandomSeed(seed: Int): this.type = {
+this.randomSeed = seed
+this
+  }
+
+  def getRandomSeed(): Int = this.randomSeed
+
+  def setRandomRange(range: Double): this.type = {
+this.randomRange = range
+this
+  }
+}
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bi-sect 
k-means algorithm.
+ *
+ * @param conf the configuration class for the hierarchical clustering
+ */
+class HierarchicalClustering(val conf: HierarchicalClusteringConf)
+extends Serializable with Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  def this() = this(new HierarchicalClusteringConf())
+
+  /**
+   * Trains a hierarchical clustering model with the given configuration
+   *
+   * @param data training points
+   * @return a model for hierarchical clustering
+   */
+  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
+validateData(data)
+logInfo(s"Run with ${conf.toString}")
--- End diff --

I added `toString` method to `HierarchicalClustering`.

https://github.com/yu-iskw/spark/commit/2898c3fb0b99697f5600f584f7051b12830a75e0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-27 Thread yu-iskw
Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r19396796
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,549 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * the configuration for a hierarchical clustering algorithm
+ *
+ * @param numClusters the number of clusters you want
+ * @param subIterations the number of iterations at digging
+ * @param epsilon the threshold to stop the sub-iterations
+ * @param randomSeed uses in sampling data for initializing centers in 
each sub iterations
+ * @param randomRange the range coefficient to generate random points in 
each clustering step
+ */
+class HierarchicalClusteringConf(
+  private var numClusters: Int,
+  private var subIterations: Int,
+  private var numRetries: Int,
+  private var epsilon: Double,
+  private var randomSeed: Int,
+  private[mllib] var randomRange: Double) extends Serializable {
+
+  def this() = this(20, 5, 20, 10E-6, 1, 0.1)
+
+  def setNumClusters(numClusters: Int): this.type = {
--- End diff --

I changed `HierarchicalClusteringConf` class to a trait for 
`HierarchicalClustering`. And the class parameters were moved to 
`HierarchicalClustering` such as `numClusters`.
I think if the accessor methods for the algorithm is included 
`HierarchicalClustering`, it gets larger. So I delegated the methods to the 
trait class.


https://github.com/yu-iskw/spark/commit/2879b00c39880a4ffc29cefaaffde26df655e63f


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-27 Thread yu-iskw
Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r19396451
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModel.scala
 ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+/**
+ * this class is used for the model of the hierarchical clustering
+ *
+ * @param clusterTree a cluster as a tree node
+ * @param trainTime the milliseconds for executing a training
+ * @param predictTime the milliseconds for executing a prediction
+ * @param isTrained if the model has been trained, the flag is true
+ */
+class HierarchicalClusteringModel private (
+  val clusterTree: ClusterTree,
+  var trainTime: Int,
+  var predictTime: Int,
+  var isTrained: Boolean) extends Serializable {
+
+  def this(clusterTree: ClusterTree) = this(clusterTree, 0, 0, false)
+
+  def getClusters(): Array[ClusterTree] = clusterTree.getClusters().toArray
+
+  def getCenters(): Array[Vector] = getClusters().map(_.center)
+
+  /**
+   * Predicts the closest cluster of each point
+   */
+  def predict(vector: Vector): Int = {
+// TODO Supports distance metrics other Euclidean distance metric
+val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - 
bv2, 2.0)
+this.clusterTree.assignClusterIndex(metric)(vector)
+  }
+
+  /**
+   * Predicts the closest cluster of each point
+   */
+  def predict(data: RDD[Vector]): RDD[(Int, Vector)] = {
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+
+// TODO Supports distance metrics other Euclidean distance metric
+val metric = (bv1: BV[Double], bv2: BV[Double]) => breezeNorm(bv1 - 
bv2, 2.0)
+val centers = getClusters().map(_.center.toBreeze)
+val treeRoot = this.clusterTree
+val closestClusterIndexFinder = treeRoot.assignClusterIndex(metric) _
+data.sparkContext.broadcast(closestClusterIndexFinder)
+val predicted = data.map(point => (closestClusterIndexFinder(point), 
point))
--- End diff --

Modify the way to use `broadcast`

https://github.com/yu-iskw/spark/commit/290d492c1c2d193ddf399b623fbdd97186bc1e75


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60567740
  
  [Test build #22291 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22291/consoleFull)
 for   PR 2906 at commit 
[`8be11da`](https://github.com/apache/spark/commit/8be11da1f045e9ffc8c56886eea7c133aefe3eaf).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-27 Thread yu-iskw
Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60566674
  
@mengxr thank you for your feedback.

> Is there a paper that you used as reference? If so, please cite it in the 
doc.
Yes. I added the comment into the doc. 

https://github.com/yu-iskw/spark/commit/6b22f0752d5d692912c1e8a5e3390326e5d8ebc6

> Could you send some performance testing results on dense and sparse 
datasets?
I had only tested the performance on dense datasets. You can download the 
benchmark result below the URL. However, because I changed the algorithm, I 
will test it again. I will send the result to you.
https://issues.apache.org/jira/secure/attachment/12675783/benchmark2.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60566364
  
  [Test build #22290 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22290/consoleFull)
 for   PR 2906 at commit 
[`2676166`](https://github.com/apache/spark/commit/2676166ba6f307b4605ea1e7ecf6ece5b9e200b3).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class JavaHierarchicalClustering `
  * `trait HierarchicalClusteringConf extends Serializable `
  * `class HierarchicalClustering(`
  * `class ClusteringModel(object):`
  * `class KMeansModel(ClusteringModel):`
  * `class HierarchicalClusteringModel(ClusteringModel):`
  * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60566367
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22290/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60566259
  
  [Test build #22290 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22290/consoleFull)
 for   PR 2906 at commit 
[`2676166`](https://github.com/apache/spark/commit/2676166ba6f307b4605ea1e7ecf6ece5b9e200b3).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60547114
  
  [Test build #22270 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22270/consoleFull)
 for   PR 2906 at commit 
[`8dbbacd`](https://github.com/apache/spark/commit/8dbbacd2e7f27e111b7237006fde73d1cf3eb5e7).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class JavaHierarchicalClustering `
  * `class HierarchicalClusteringConf(`
  * `class HierarchicalClustering(val conf: HierarchicalClusteringConf)`
  * `class ClusterTree(`
  * `class ClusteringModel(object):`
  * `class KMeansModel(ClusteringModel):`
  * `class HierarchicalClusteringModel(ClusteringModel):`
  * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60547118
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22270/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60546582
  
  [Test build #22267 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22267/consoleFull)
 for   PR 2906 at commit 
[`1a08510`](https://github.com/apache/spark/commit/1a0851079bf145939e665aa78f0e77b3995e6e66).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60546587
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22267/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60546524
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22268/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60546522
  
  [Test build #22268 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22268/consoleFull)
 for   PR 2906 at commit 
[`b014f50`](https://github.com/apache/spark/commit/b014f500112df597edfbe1a5cef8c02e06b1bbb0).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class JavaHierarchicalClustering `
  * `class HierarchicalClusteringConf(`
  * `class HierarchicalClustering(val conf: HierarchicalClusteringConf)`
  * `class ClusterTree(`
  * `class ClusteringModel(object):`
  * `class KMeansModel(ClusteringModel):`
  * `class HierarchicalClusteringModel(ClusteringModel):`
  * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60544649
  
  [Test build #22270 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22270/consoleFull)
 for   PR 2906 at commit 
[`8dbbacd`](https://github.com/apache/spark/commit/8dbbacd2e7f27e111b7237006fde73d1cf3eb5e7).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60543195
  
  [Test build #22268 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22268/consoleFull)
 for   PR 2906 at commit 
[`b014f50`](https://github.com/apache/spark/commit/b014f500112df597edfbe1a5cef8c02e06b1bbb0).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60542976
  
  [Test build #22267 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22267/consoleFull)
 for   PR 2906 at commit 
[`1a08510`](https://github.com/apache/spark/commit/1a0851079bf145939e665aa78f0e77b3995e6e66).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60464016
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22179/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60464011
  
  [Test build #22179 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22179/consoleFull)
 for   PR 2906 at commit 
[`91a38e3`](https://github.com/apache/spark/commit/91a38e361ac89933cb6e774cd05624f20e7b0344).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class JavaHierarchicalClustering `
  * `class HierarchicalClusteringConf(`
  * `class HierarchicalClustering(val conf: HierarchicalClusteringConf)`
  * `class ClusterTree(`
  * `class ClusteringModel(object):`
  * `class KMeansModel(ClusteringModel):`
  * `class HierarchicalClusteringModel(ClusteringModel):`
  * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-24 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60463744
  
@yu-iskw I added you to the whitelist. Future commits from you should 
trigger Jenkins automatically. Just took a very brief scan over the code and 
really appreciate the fact that more than half of the code is doc/test/example. 
I will check the implementation after the feature freeze. Some high-level 
questions for now:

1. Is there a paper that you used as reference? If so, please cite it in 
the doc.
2. Could you send some performance testing results on dense and sparse 
datasets?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60463499
  
  [Test build #22177 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22177/consoleFull)
 for   PR 2906 at commit 
[`91a38e3`](https://github.com/apache/spark/commit/91a38e361ac89933cb6e774cd05624f20e7b0344).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class JavaHierarchicalClustering `
  * `class HierarchicalClusteringConf(`
  * `class HierarchicalClustering(val conf: HierarchicalClusteringConf)`
  * `class ClusterTree(`
  * `class ClusteringModel(object):`
  * `class KMeansModel(ClusteringModel):`
  * `class HierarchicalClusteringModel(ClusteringModel):`
  * `class HierarchicalClustering(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60463501
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22177/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60460891
  
  [Test build #22179 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22179/consoleFull)
 for   PR 2906 at commit 
[`91a38e3`](https://github.com/apache/spark/commit/91a38e361ac89933cb6e774cd05624f20e7b0344).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60460562
  
  [Test build #22177 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22177/consoleFull)
 for   PR 2906 at commit 
[`91a38e3`](https://github.com/apache/spark/commit/91a38e361ac89933cb6e774cd05624f20e7b0344).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-24 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60460354
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-24 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60460281
  
Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-23 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/2906#issuecomment-60268305
  
I just gave this a quick read-through, and the structure makes sense. I 
left several small comments. I see the chunks of logic I would expect, but did 
not evaluate it in detail. The existence of some tests suggests this probably 
basically works :) I am wondering about performance too as this relies on Scala 
idioms in many places; it might be worth a quick look with jprofiler if you can 
to see if there are any easy-win optimizations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r19289245
  
--- Diff: 
mllib/src/test/java/org/apache/spark/mllib/clustering/JavaHierarchicalClusteringSuite.java
 ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering;
+
+import com.google.common.collect.Lists;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.Vectors;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.Serializable;
+import java.util.List;
+
+import static org.junit.Assert.assertEquals;
+
+public class JavaHierarchicalClusteringSuite implements Serializable {
+private transient JavaSparkContext sc;
--- End diff --

Looks like this is using 4-space indent but should be 2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r19289138
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,549 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * the configuration for a hierarchical clustering algorithm
+ *
+ * @param numClusters the number of clusters you want
+ * @param subIterations the number of iterations at digging
+ * @param epsilon the threshold to stop the sub-iterations
+ * @param randomSeed uses in sampling data for initializing centers in 
each sub iterations
+ * @param randomRange the range coefficient to generate random points in 
each clustering step
+ */
+class HierarchicalClusteringConf(
+  private var numClusters: Int,
+  private var subIterations: Int,
+  private var numRetries: Int,
+  private var epsilon: Double,
+  private var randomSeed: Int,
+  private[mllib] var randomRange: Double) extends Serializable {
+
+  def this() = this(20, 5, 20, 10E-6, 1, 0.1)
+
+  def setNumClusters(numClusters: Int): this.type = {
+this.numClusters = numClusters
+this
+  }
+
+  def getNumClusters(): Int = this.numClusters
+
+  def setSubIterations(iterations: Int): this.type = {
+this.subIterations = iterations
+this
+  }
+
+  def setNumRetries(numRetries: Int): this.type = {
+this.numRetries = numRetries
+this
+  }
+
+  def getNumRetries(): Int = this.numRetries
+
+  def getSubIterations(): Int = this.subIterations
+
+  def setEpsilon(epsilon: Double): this.type = {
+this.epsilon = epsilon
+this
+  }
+
+  def getEpsilon(): Double = this.epsilon
+
+  def setRandomSeed(seed: Int): this.type = {
+this.randomSeed = seed
+this
+  }
+
+  def getRandomSeed(): Int = this.randomSeed
+
+  def setRandomRange(range: Double): this.type = {
+this.randomRange = range
+this
+  }
+}
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bi-sect 
k-means algorithm.
+ *
+ * @param conf the configuration class for the hierarchical clustering
+ */
+class HierarchicalClustering(val conf: HierarchicalClusteringConf)
+extends Serializable with Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  def this() = this(new HierarchicalClusteringConf())
+
+  /**
+   * Trains a hierarchical clustering model with the given configuration
+   *
+   * @param data training points
+   * @return a model for hierarchical clustering
+   */
+  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
+validateData(data)
+logInfo(s"Run with ${conf.toString}")
+
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+val clusterTree = ClusterTree.fromRDD(data) // make the root node
+val model = new HierarchicalClusteringModel(clusterTree)
+val statsUpdater = new ClusterTreeStatsUpdater()
+
+var node: Option[ClusterTree] = Some(model.clusterTree)
+statsUpdater(node.get)
+
+// If the followed conditions are satisfied, and then stop the 
training.
+//   1. There is no splittable cluster
+//   2. The number of the splitted clusters is greater than that of 
given clusters
+//   3. The total variance of all clusters increases, when a cluster 
is splitted
+var totalVariance = Double.MaxValue
+var newTotalVariance = model.clusterTree.getVar

[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...

2014-10-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2906#discussion_r19288871
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/HierarchicalClustering.scala
 ---
@@ -0,0 +1,549 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector => BDV, Vector => BV, norm => breezeNorm}
+import org.apache.spark.Logging
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.random.XORShiftRandom
+
+/**
+ * the configuration for a hierarchical clustering algorithm
+ *
+ * @param numClusters the number of clusters you want
+ * @param subIterations the number of iterations at digging
+ * @param epsilon the threshold to stop the sub-iterations
+ * @param randomSeed uses in sampling data for initializing centers in 
each sub iterations
+ * @param randomRange the range coefficient to generate random points in 
each clustering step
+ */
+class HierarchicalClusteringConf(
+  private var numClusters: Int,
+  private var subIterations: Int,
+  private var numRetries: Int,
+  private var epsilon: Double,
+  private var randomSeed: Int,
+  private[mllib] var randomRange: Double) extends Serializable {
+
+  def this() = this(20, 5, 20, 10E-6, 1, 0.1)
+
+  def setNumClusters(numClusters: Int): this.type = {
+this.numClusters = numClusters
+this
+  }
+
+  def getNumClusters(): Int = this.numClusters
+
+  def setSubIterations(iterations: Int): this.type = {
+this.subIterations = iterations
+this
+  }
+
+  def setNumRetries(numRetries: Int): this.type = {
+this.numRetries = numRetries
+this
+  }
+
+  def getNumRetries(): Int = this.numRetries
+
+  def getSubIterations(): Int = this.subIterations
+
+  def setEpsilon(epsilon: Double): this.type = {
+this.epsilon = epsilon
+this
+  }
+
+  def getEpsilon(): Double = this.epsilon
+
+  def setRandomSeed(seed: Int): this.type = {
+this.randomSeed = seed
+this
+  }
+
+  def getRandomSeed(): Int = this.randomSeed
+
+  def setRandomRange(range: Double): this.type = {
+this.randomRange = range
+this
+  }
+}
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bi-sect 
k-means algorithm.
+ *
+ * @param conf the configuration class for the hierarchical clustering
+ */
+class HierarchicalClustering(val conf: HierarchicalClusteringConf)
+extends Serializable with Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  def this() = this(new HierarchicalClusteringConf())
+
+  /**
+   * Trains a hierarchical clustering model with the given configuration
+   *
+   * @param data training points
+   * @return a model for hierarchical clustering
+   */
+  def run(data: RDD[Vector]): HierarchicalClusteringModel = {
+validateData(data)
+logInfo(s"Run with ${conf.toString}")
+
+val startTime = System.currentTimeMillis() // to measure the execution 
time
+val clusterTree = ClusterTree.fromRDD(data) // make the root node
+val model = new HierarchicalClusteringModel(clusterTree)
+val statsUpdater = new ClusterTreeStatsUpdater()
+
+var node: Option[ClusterTree] = Some(model.clusterTree)
+statsUpdater(node.get)
+
+// If the followed conditions are satisfied, and then stop the 
training.
+//   1. There is no splittable cluster
+//   2. The number of the splitted clusters is greater than that of 
given clusters
+//   3. The total variance of all clusters increases, when a cluster 
is splitted
+var totalVariance = Double.MaxValue
+var newTotalVariance = model.clusterTree.getVar

  1   2   >