Hi Satyajit, I am not sure why you think DIMSUM cannot apply for your use case. Or you've tried it but encountered some problems.
Although in the paper[1] the authors mentioned they concentrate on the regime where the number of rows is very large, and the number of columns is not too large. But I think it doesn't prevent you applying it on the dataset of large columns. By the way, in another paper[2], they experimented it on a dataset of 10^7 columns. Even the number of column is very large, if your dataset is very sparse, and you use SparseVector, DIMSUM should work well too. You can also adjust the threshold when using DIMSUM. [1] Reza Bosagh Zadeh and Gunnar Carlsson, "Dimension Independent Matrix Square using MapReduce (DIMSUM)" [2] Reza Bosagh Zadeh and Ashish Goel, "Dimension Independent Similarity Computation" ----- Liang-Chi Hsieh | @viirya Spark Technology Center -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Document-Similarity-Spark-Mllib-tp20196p20198.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org