Re: Document Similarity -Spark Mllib

Liang-Chi Hsieh Sat, 10 Dec 2016 03:45:08 -0800

Hi Satyajit,

I am not sure why you think DIMSUM cannot apply for your use case. Or you've
tried it but encountered some problems.


Although in the paper[1] the authors mentioned they concentrate on the
regime where the number of rows is very large, and the number of columns is
not too large. But I think it doesn't prevent you applying it on the dataset
of large columns. By the way, in another paper[2], they experimented it on a
dataset of 10^7 columns.

Even the number of column is very large, if your dataset is very sparse, and
you use SparseVector, DIMSUM should work well too. You can also adjust the
threshold when using DIMSUM.


[1] Reza Bosagh Zadeh and Gunnar Carlsson, "Dimension Independent Matrix
Square using MapReduce (DIMSUM)"
[2] Reza Bosagh Zadeh and Ashish Goel, "Dimension Independent Similarity
Computation"




-----
Liang-Chi Hsieh | @viirya
Spark Technology Center
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Document-Similarity-Spark-Mllib-tp20196p20198.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Document Similarity -Spark Mllib

Reply via email to