Hi, I have the following problem, which is a kind of special case of k nearest neighbours. I have an Array of Vectors (v1) and an RDD[(Long, Vector)] of pairs of vectors with indexes (v2). The array v1 easily fits into a single node's memory (~100 entries), but v2 is very large (millions of entries).
My goal is to find for each vector in v1 the entries in v2 with least distance. The naive solution would be to define a helper function that computes all the distances between a vector from v1 and all vectors in v2, sorts them, and returns the top n results: def computeDistances(vector: Vector, vectors: RDD[(Long, Vector)], n:Int=10): Seq[Long] = { vectors.map { emb => (emb._1, Vectors.sqdist(emb._2, centroid)) } .sortBy(_._2) // sort by value .map(_._1) // retain indexes only .take(n) } So I can map the entries (after getting the indexes to keep track of the mappings) in v1 to the distances: v1.zipWithIndexes.map{ v => (computeDistances(v._1, v2), v._2) } This gives me for each entry in v1 the indexes of the n closest entries in v2. However, as v1 is an array, the computeDistances() calls are all done sequentially (on the driver, if I understand correctly) rather than distributed. The problem is that I must not convert v1 into an RDD because that will result in an error due to nested RDD actions in computeDistance(). To conclude, what I would like to do (if it were possible) is this: val v1: Seq[Vector] = ... val v2: RDD[(Long, Vector)] = ... sc.parallelize(v1).zipWithIndexes .map{ v => (computeDistances(v._1, v2), v._2) } Is there any good practice to approach problems like this? Thanks! Carsten -- Carsten Schnober Doctoral Researcher Ubiquitous Knowledge Processing (UKP) Lab FB 20 / Computer Science Department Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111 schno...@ukp.informatik.tu-darmstadt.de www.ukp.tu-darmstadt.de Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources (AIPHES): www.aiphes.tu-darmstadt.de PhD program: Knowledge Discovery in Scientific Literature (KDSL) www.kdsl.tu-darmstadt.de -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-Nearest-Neighbours-tp23759.html Sent from the Apache Spark User List mailing list archive at Nabble.com.