Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19340 @mgaido91 @srowen I have the same concern as @Kevin-Ferret and @viirya I don't find the normailization of vectors before training, and the update of center seems incorrect. The arithmetic mean of all points in the cluster is not naturally the new cluster center: For EUCLIDEAN distance, we need to update the center to minimize the square lose, then we get the arithmetic mean as the closed-form solution; For COSINE similarity, we need to update the center to *maximize the cosine similarity*, the solution is also the arithmetic mean only if all vectors are of unit length. In matlab's doc for KMeans, it says "One minus the cosine of the included angle between points (treated as vectors). Each centroid is the mean of the points in that cluster, after *normalizing those points to unit Euclidean length*." I think RapidMiners's implementation of KMeans with cosine similarity is wrong, if it just assign new center with the arithmetic mean. Some reference: [Spherical k-Means Clustering](https://www.jstatsoft.org/article/view/v050i10/v50i10.pdf) [Scikit-Learn's example: Clustering text documents using k-means](http://scikit-learn.org/dev/auto_examples/text/plot_document_clustering.html) https://stats.stackexchange.com/questions/299013/cosine-distance-as-similarity-measure-in-kmeans https://www.quora.com/How-can-I-use-cosine-similarity-in-clustering-For-example-K-means-clustering
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org