Re: K-nearest neighbors search in Spark
Any suggestion is very much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6421.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: K-nearest neighbors search in Spark
Hi Carter, In Spark 1.0 there will be an implementation of k-means available as part of MLLib. You can see the documentation for that below (until 1.0 is fully released). https://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/mllib-clustering.html Maybe diving into the source here will help get you started? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala Cheers, Andrew On Tue, May 27, 2014 at 4:10 AM, Carter gyz...@hotmail.com wrote: Any suggestion is very much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6421.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: K-nearest neighbors search in Spark
Carter, Just as a quick simple starting point for Spark. (caveats - lots of improvements reqd for scaling, graceful and efficient handling of RDD et al): import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import scala.collection.immutable.ListMap import scala.collection.immutable.SortedMap object TopK { // def getCurrentDirectory = new java.io.File( . ).getCanonicalPath // def distance(x1:List[Int],x2:List[Int]):Double = { val dist:Double = math.sqrt(math.pow(x1(1)-x2(1),2) + math.pow(x1(2)-x2( 2),2)) dist } // def main(args: Array[String]): Unit = { // println(getCurrentDirectory) val sc = new SparkContext(local,TopK, spark://USS-Defiant.local:7077) println(sRunning Spark Version ${sc.version}) val file = sc.textFile(data01.csv) // val data = file .map(line = line.split(,)) .map(x1 = List(x1(0).toInt,x1(1).toInt,x1(2).toInt)) //val data1 = data.collect println(data) for (d - data) { println(d) println(d(0)) } // val distList = for (d - data) yield {d(0)} //for (d - distList) (println(d)) val zipList = for (a - distList.collect; b - distList.collect) yield{ List( a,b)} zipList.foreach(println(_)) // val dist = for (l - zipList) yield { println(s${l(0)} = ${l(1)}) val x1a:Array[List[Int]] = data.filter(d = d(0) == l(0)).collect val x2a:Array[List[Int]] = data.filter(d = d(0) == l(1)).collect val x1:List[Int] = x1a(0) val x2:List[Int] = x2a(0) val dist = distance(x1,x2) Map ( dist - l ) } dist.foreach(println(_)) // sort this for topK // } } data01.csv 1,68,93 2,12,90 3,45,76 4,86,54 HTH. Cheers k/ On Tue, May 27, 2014 at 4:10 AM, Carter gyz...@hotmail.com wrote: Any suggestion is very much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6421.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
K-nearest neighbors search in Spark
Hi all,I want to implement a basic K-nearest neighbors search in Spark, but I am totally new to Scala so don't know where to start with.My data consists of millions of points. For each point, I need to compute its Euclidean distance to the other points, and return the top-K points that are closest to it. The data.txt is with the comma-separated format like this:ID, X, Y1, 68, 932, 12, 903, 45, 76100, 86, 54 Could you please tell me what data structure I should use, and how to implement this algorithm in Scala (*some sample code are greatly appreciated*).Thank you very much.Regards,Carter -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393.html Sent from the Apache Spark User List mailing list archive at Nabble.com.