Hi, I'm new to Spark and Hadoop, and I'd like to know if the following problem is solvable in terms of Spark's primitives.
To compute the K-nearest neighbours of a N-dimensional dataset, I can multiply my very large normalized sparse matrix by its transpose. As this yields all pairwise distance values (N x N), I can then sort each row and only keep the K highest elements for each, resulting in a N x K dense matrix. As this Quora answer suggests: http://qr.ae/v03lY rather than the row-wise dot product, which would be O(N^2), it's better to compute the sum of the column outer products, which is O(N x K^2). However, given the number of non-zero elements in the resulting matrix, it seems I could not afford to first perform the full multiplication (N x N) and then prune it afterward (N x K).. So I need a way to prune it on the fly. The original algorithm I came up with is roughly this, for an input matrix M: for each row i: __outer_i = [0] * N __for j in nonzero elements of row i: ____for k in nonzero elements of col j: ______outer_i[k] += M[i][j] * M[k][j] __nearest_i = {sort outer_i and keep best K} which can be parallelized in an "embarrassing" way, i.e. each compute node can simply process a slice of the the rows. Would there be a way to do something similar (or related) with Spark? Christian