Hi ALL, I am trying to implement a mlllib spark job, to find the similarity between documents(for my case is basically home addess).
i believe i cannot use DIMSUM for my use case as, DIMSUM is works well only with matrix with thin columns and more rows in matrix. matrix example format, for my use case: doc1(address1) doc2(address2) .......... m is going to be huge as i have more add. san mateo 0.73462 0 san fransico .. .. san bruno .. .. . . . . and n is going to be thin compared to m I would like to know if there is way to leverage DIMSUM to work on my use case, and if not what other alogrithm i can try that is available in spark mlllib. Regards, Satyajit.