Hi,
I have the following problem, which is a kind of special case of k
nearest neighbours.
I have an Array of Vectors (v1) and an RDD[(Long, Vector)] of pairs of
vectors with indexes (v2). The array v1 easily fits into a single node's
memory (~100 entries), but v2 is very large (millions of entries).

My goal is to find for each vector in v1 the entries in v2 with least
distance. The naive solution would be to define a helper function that
computes all the distances between a vector from v1 and all vectors in
v2, sorts them, and returns the top n results:

def computeDistances(vector: Vector, vectors: RDD[(Long, Vector)],
n:Int=10): Seq[Long] =  {
    vectors.map { emb => (emb._1, Vectors.sqdist(emb._2, centroid)) }
      .sortBy(_._2) // sort by value
      .map(_._1) // retain indexes only
      .take(n)
}

So I can map the entries (after getting the indexes to keep track of the
mappings) in v1 to the distances:

v1.zipWithIndexes.map{ v => (computeDistances(v._1, v2), v._2) }

This gives me for each entry in v1 the indexes of the n closest entries
in v2.
However, as v1 is an array, the computeDistances() calls are all done
sequentially (on the driver, if I understand correctly) rather than
distributed.

The problem is that I must not convert v1 into an RDD because that will
result in an error due to nested RDD actions in computeDistance().

To conclude, what I would like to do (if it were possible) is this:

val v1: Seq[Vector] = ...
val v2: RDD[(Long, Vector)] = ...
sc.parallelize(v1).zipWithIndexes
  .map{ v => (computeDistances(v._1, v2), v._2) }


Is there any good practice to approach problems like this?
Thanks!
Carsten


-- 
Carsten Schnober
Doctoral Researcher
Ubiquitous Knowledge Processing (UKP) Lab
FB 20 / Computer Science Department
Technische Universität Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany
phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
schno...@ukp.informatik.tu-darmstadt.de
www.ukp.tu-darmstadt.de

Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
(AIPHES): www.aiphes.tu-darmstadt.de
PhD program: Knowledge Discovery in Scientific Literature (KDSL)
www.kdsl.tu-darmstadt.de




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/K-Nearest-Neighbours-tp23759.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to