Yes you need to use dimensionality reduction and/or locality sensitive
hashing to reduce number of pairs to compare. There is also LSH implementation
for collection of vectors I have just published here:
https://github.com/marufaytekin/lsh-spark. Implementation i based on this
paper:
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf
I hope It will help.

Maruf


On Thu, Aug 27, 2015 at 9:16 AM, Jaonary Rabarisoa <jaon...@gmail.com>
wrote:

> Thank you all for these links. I'll check them.
>
> On Wed, Aug 26, 2015 at 5:05 PM, Charlie Hack <charles.t.h...@gmail.com>
> wrote:
>
>> +1 to all of the above esp.  Dimensionality reduction and locality
>> sensitive hashing / min hashing.
>>
>> There's also an algorithm implemented in MLlib called DIMSUM which was
>> developed at Twitter for this purpose. I've been meaning to try it and
>> would be interested to hear about results you get.
>>
>> https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
>>
>> ​Charlie
>>
>>
>> — Sent from Mailbox
>>
>> On Wednesday, Aug 26, 2015 at 09:57, Michael Malak <
>> michaelma...@yahoo.com.invalid>, wrote:
>>
>>> Yes. And a paper that describes using grids (actually varying grids) is
>>> http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf
>>>  In
>>> the Spark GraphX In Action book that Robin East and I are writing, we
>>> implement a drastically simplified version of this in chapter 7, which
>>> should become available in the MEAP mid-September.
>>> http://www.manning.com/books/spark-graphx-in-action
>>>
>>>
>>> ------------------------------
>>>
>>> If you don't want to compute all N^2 similarities, you need to implement
>>> some kind of blocking first. For example, LSH (locally sensitive hashing).
>>> A quick search gave this link to a Spark implementation:
>>>
>>>
>>> http://stackoverflow.com/questions/27718888/spark-implementation-for-locality-sensitive-hashing
>>>
>>>
>>>
>>> On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa <jaon...@gmail.com>
>>> wrote:
>>>
>>> Dear all,
>>>
>>> I'm trying to find an efficient way to build a k-NN graph for a large
>>> dataset. Precisely, I have a large set of high dimensional vector (say d
>>> >>> 10000) and I want to build a graph where those high dimensional points
>>> are the vertices and each one is linked to the k-nearest neighbor based on
>>> some kind similarity defined on the vertex spaces.
>>> My problem is to implement an efficient algorithm to compute the weight
>>> matrix of the graph. I need to compute a N*N similarities and the only way
>>> I know is to use "cartesian" operation follow by "map" operation on RDD.
>>> But, this is very slow when the N is large. Is there a more cleaver way to
>>> do this for an arbitrary similarity function ?
>>>
>>> Cheers,
>>>
>>> Jao
>>>
>>>
>>>
>>>
>>>
>

Reply via email to