Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/19340
  
    @mgaido91 @srowen   I have the same concern as @Kevin-Ferret and @viirya 
    I don't find the normailization of vectors before training, and the update 
of center seems incorrect.
    The arithmetic mean of all points in the cluster is not naturally the new 
cluster center:
    For EUCLIDEAN distance, we need to update the center to minimize the square 
lose, then we get the arithmetic mean as the closed-form solution;
    For COSINE similarity, we need to update the center to *maximize the cosine 
similarity*, the solution is also the arithmetic mean only if all vectors are 
of unit length.
    
    In matlab's doc for KMeans, it says "One minus the cosine of the included 
angle between points (treated as vectors). Each centroid is the mean of the 
points in that cluster, after *normalizing those points to unit Euclidean 
length*."
    
    I think RapidMiners's implementation of KMeans with cosine similarity is 
wrong, if it just assign new center with the arithmetic mean.
    
    Some reference:
    [Spherical k-Means 
Clustering](https://www.jstatsoft.org/article/view/v050i10/v50i10.pdf)
    
    [Scikit-Learn's example: Clustering text documents using 
k-means](http://scikit-learn.org/dev/auto_examples/text/plot_document_clustering.html)
    
    
https://stats.stackexchange.com/questions/299013/cosine-distance-as-similarity-measure-in-kmeans
    
    
https://www.quora.com/How-can-I-use-cosine-similarity-in-clustering-For-example-K-means-clustering
    
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to