Document Similarity -Spark Mllib

satyajit vegesna Fri, 09 Dec 2016 11:54:02 -0800

Hi ALL,

I am trying to implement a mlllib spark job, to find the similarity between
documents(for my case is basically home addess).


i believe i cannot use DIMSUM for my use case as, DIMSUM is works well only
with matrix with thin columns and more rows in matrix.

matrix example format, for my use case:

                         doc1(address1)  doc2(address2) .......... m is
going to be huge as i have more add.
      san mateo         0.73462                 0
      san fransico       ..                           ..
      san bruno           ..                            ..
       .
       .
       .
       .
     and n is going to be thin compared to m

I would like to know if there is way to leverage DIMSUM to work on my use
case, and if not what other alogrithm i can try that is available in spark
mlllib.

Regards,
Satyajit.

Document Similarity -Spark Mllib

Reply via email to