[ https://issues.apache.org/jira/browse/SPARK-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14290634#comment-14290634 ]
Muhammad-Ali A'rabi edited comment on SPARK-3439 at 1/24/15 2:41 PM: --------------------------------------------------------------------- Possible implementation: {code:java} import org.apache.spark.mllib.linalg._ import java.util.HashMap val vas = Array(Array(1.0, 0, 0), Array(1.1, 0, 0), Array(0.9, 0, 0), Array(0, 1.0, 0), Array(0.01, 1.01, 0), Array(0, 0, 1.0), Array(0, 0, 1.1)) val vs = vas.map(Vectors.dense(_)) val t1 = 1.0 val t2 = 0.5 // starting canopy val map = new HashMap[Vector, Vector] // map from data to clusters val set = new HashMap[Vector, Boolean] // the set for(v <- vs) set.put(v, true) for(v <- vs) { if(set.get(v)) { val dists = vs.map{ x => (x, Vectors.sqdist(x, v)) } dists.foreach { case (x, d) => if(d < t1) map.put(x, v) if(d < t2) set.put(x, false) } } } {code} The algorithm is working with arrays and lists, but all of them could be converted to RDD. was (Author: angellandros): Possible implementation: {code:scala} import org.apache.spark.mllib.linalg._ import java.util.HashMap val vas = Array(Array(1.0, 0, 0), Array(1.1, 0, 0), Array(0.9, 0, 0), Array(0, 1.0, 0), Array(0.01, 1.01, 0), Array(0, 0, 1.0), Array(0, 0, 1.1)) val vs = vas.map(Vectors.dense(_)) val t1 = 1.0 val t2 = 0.5 // starting canopy val map = new HashMap[Vector, Vector] // map from data to clusters val set = new HashMap[Vector, Boolean] // the set for(v <- vs) set.put(v, true) for(v <- vs) { if(set.get(v)) { val dists = vs.map{ x => (x, Vectors.sqdist(x, v)) } dists.foreach { case (x, d) => if(d < t1) map.put(x, v) if(d < t2) set.put(x, false) } } } {code} The algorithm is working with arrays and lists, but all of them could be converted to RDD. > Add Canopy Clustering Algorithm > ------------------------------- > > Key: SPARK-3439 > URL: https://issues.apache.org/jira/browse/SPARK-3439 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Yu Ishikawa > Assignee: Muhammad-Ali A'rabi > Priority: Minor > > The canopy clustering algorithm is an unsupervised pre-clustering algorithm. > It is often used as a preprocessing step for the K-means algorithm or the > Hierarchical clustering algorithm. It is intended to speed up clustering > operations on large data sets, where using another algorithm directly may be > impractical due to the size of the data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org